Data classes are like children. They are okay as a starting point, but to participate as a grownup object, they need to take some responsibility.1
Martin Fowler and Kent Beck
Python offers a few ways to build a simple class that is just a bunch of fields, with little or no extra funcionality. That pattern is known as a “data class”—and dataclass
is the name of a Python decorator that supports it. This chapter covers three different class builders that you may use as shortcuts to write data classes:
collections.namedtuple
: the simplest way—since Python 2.6;
typing.NamedTuple
: an alternative that allows type annotations on the fields—since Python 3.5; class
syntax supported since 3.6;
@dataclasses.dataclass
: a class decorator that allows more customization than previous alternatives, adding lots of options and potential complexity—since Python 3.7.
After covering those class builders, we will discuss why Data Class is also the name of a code smell: a coding pattern that may be a symptom of poor object-oriented design.
The chapter ends with a section on a very different topic, but still closely related to record-like data: the struct
module, designed to parse and build packed binary records that you may find in legacy flat-file databases, network protocols, and file headers.
typing.TypedDict
may seem like another data class builder—it’s
described right after typing.NamedTuple
in the typing
module documentation,
and uses similar syntax.
However, TypedDict
does not build concrete classes that you can instantiate.
It’s just a way to write static annotations for function parameters and variables,
so we’ll study them in Chapter 8, “TypedDict”.
This chapter is new in Fluent Python 2nd edition. The sections “Classic Named Tuples” and “Structs and Memory Views” appeared in chapters 2 and 4 in the 1st edition, but the rest of the chapter is completely new.
We begin with a high level overview of the three class builders.
Consider a simple class to represent a geographic coordinate pair:
class/coordinates.py
class
Coordinate
:
def
__init__
(
self
,
lat
,
long
):
self
.
lat
=
lat
self
.
long
=
long
That Coordinate
class does the job of holding latitude and longitude attributes.
Writing the __init__
boilerplate becomes old real fast,
especially if your class has more than a couple of attributes:
each of them is mentioned three times!
And that boilerplate doesn’t buy us basic features we’d expect from a Python object:
>>>
from
coordinates
import
Coordinate
>>>
moscow
=
Coordinate
(
55.76
,
37.62
)
>>>
moscow
<coordinates.Coordinate object at 0x107142f10>
>>>
location
=
Coordinate
(
55.76
,
37.62
)
>>>
location
==
moscow
False
>>>
(
location
.
lat
,
location
.
long
)
==
(
moscow
.
lat
,
moscow
.
long
)
True
__repr__
inherited from object
is not very helpful.
Meaningless equality; the __eq__
method inherited from object
compares object ids.
Comparing two coordinates requires explicit comparison of each attribute.
The data class builders covered in this chapter provide the necessary __init__
, __repr__
, and __eq__
methods automatically, as well as other useful features.
None of the class builders discussed here depend on inheritance to do their work.
Both collections.namedtuple
and typing.NamedTuple
build classes that are tuple
subclasses.
@dataclass
is a class decorator that does not affect the class hierarchy in any way.
Each of them use different metaprogramming techniques to inject methods and data attributes
into the class under construction.
Here is a Coordinate
class built with namedtuple
—a factory function that builds a subclass of tuple
with the name and fields you specify:
>>>
from
collections
import
namedtuple
>>>
Coordinate
=
namedtuple
(
'
Coordinate
'
,
'
lat long
'
)
>>>
issubclass
(
Coordinate
,
tuple
)
True
>>>
moscow
=
Coordinate
(
55.756
,
37.617
)
>>>
moscow
Coordinate(lat=55.756, long=37.617)
>>>
moscow
==
Coordinate
(
lat
=
55.756
,
long
=
37.617
)
True
The newer typing.NamedTuple
provides the same funcionality, adding a type annotation to each field:
>>>
import
typing
>>>
Coordinate
=
typing
.
NamedTuple
(
'Coordinate'
,
[(
'lat'
,
float
),
(
'long'
,
float
)])
>>>
issubclass
(
Coordinate
,
tuple
)
True
>>>
Coordinate
.
__annotations__
{'lat': <class 'float'>, 'long': <class 'float'>}
A typed named tuple can also be constructed with the fields given as keyword arguments, like this:
Coordinate = typing.NamedTuple('Coordinate', lat=float, long=float)
This is more readable, and also lets you provide the mapping of fields and types as **fields_and_types
.
Since Python 3.6, typing.NamedTuple
can also be used in a class
statement,
with type annotations written as described in PEP 526—Syntax for Variable Annotations.
This is much more readable, and makes it easy to override methods or add new ones.
Example 5-2 is the same Coordinate
class, with a pair of float
attributes
and a custom __str__
to display a coordinate formatted like 55.8°N, 37.6°E:
typing_namedtuple/coordinates.py
from
typing
import
NamedTuple
class
Coordinate
(
NamedTuple
):
lat
:
float
long
:
float
def
__str__
(
self
):
ns
=
'N'
if
self
.
lat
>=
0
else
'S'
we
=
'E'
if
self
.
long
>=
0
else
'W'
return
f
'{abs(self.lat):.1f}°{ns}, {abs(self.long):.1f}°{we}'
Although NamedTuple
appears in the class
statement as a superclass, it’s actually not.
typing.NamedTuple
uses the advanced functionality of a
metaclass2
to customize the creation of the user’s class. Check this out:
>>>
issubclass
(
Coordinate
,
typing
.
NamedTuple
)
False
>>>
issubclass
(
Coordinate
,
tuple
)
True
In the __init__
method generated by typing.NamedTuple
, the fields appear as parameters in the same order they appear in the class
statement.
Like typing.NamedTuple
, the dataclass
decorator supports PEP 526 syntax to declare instance attributes.
The decorator reads the variable annotations and automatically generates methods for your class.
For comparison, check out the equivalent Coordinate
class written with the help of the dataclass
decorator:
dataclass/coordinates.py
from
dataclasses
import
dataclass
@dataclass
(
frozen
=
True
)
class
Coordinate
:
lat
:
float
long
:
float
def
__str__
(
self
):
ns
=
'N'
if
self
.
lat
>=
0
else
'S'
we
=
'E'
if
self
.
long
>=
0
else
'W'
return
f
'{abs(self.lat):.1f}°{ns}, {abs(self.long):.1f}°{we}'
Note that the body of the classes in Example 5-2 and Example 5-3 are identical—the difference is in the class
statement itself.
The @dataclass
decorator does not depend on inheritance or a metaclass, so it should not interfere with your own use of these
mechanisms.3
The Coordinate
class in Example 5-3 is a subclass of object
.
The different data class builders have a lot of common. Here we’ll discuss the main features they share. Table 5-1 summarizes.
namedtuple | NamedTuple | dataclass | |
---|---|---|---|
mutable instances |
NO |
NO |
YES |
class statement syntax |
NO |
YES |
YES |
construct dict |
x._asdict() |
x._asdict() |
dataclasses.as_dict(x) |
get field names |
x._fields |
x._fields |
[f.name for f in dataclasses.fields(x)] |
get defaults |
x._field_defaults |
x._field_defaults |
[f.default for f in dataclasses.fields(x)] |
get field types |
N/A |
x.__annotations__ |
x.__annotations__ |
new instance with changes |
x._replace(…) |
x._replace(…) |
dataclasses.replace(x, …) |
new class at runtime |
namedtuple(…) |
NamedTuple(…) |
dataclasses.make_dataclass(…) |
A key difference between these class builders is that collections.namedtuple
and typing.NamedTuple
build tuple
subclasses, therefore the instances are immutable. By default, @dataclass
produces mutable classes. But the decorator accepts several keyword arguments to configure the class, including frozen
—shown in Example 5-3. When frozen=True
, the class will raise an exception if you try to assign values to the fields after the instance is initialized.
typing.NamedTuple
and dataclass
support the regular class
statement syntax, making it easier to add methods and docstrings to the class you are creating; collections.namedtuple
does not support that syntax.
Both named tuple variants provide an instance method (._as_dict
) to to construct a dict
object from the fields in a data class instance. dataclass
avoids injecting a similar method in the data class, but provides a module-level function to do it: dataclasses.as_dict
.
All three class builders let you get the field names and default values that may be configured for them.
In named tuple classes, that metadata is in the ._fields
and ._fields_defaults
class attributes.
You can get the same metadata from a dataclass
decorated class using the fields
function from the dataclasses
module.
It returns a tuple of Field
objects which have several attributes, including name
and default
.
A mapping of field names to type annotations is stored in the __annotations__
class atribute in classes defined with the help of typing.NamedTuple
and dataclass
.
Given a named tuple instance x
, the call x._replace(**kwargs)
returns a new instance with some attribute values replaced according to the keyword arguments given. The dataclasses.replace(x, **kwargs)
module-level function does the same for an instance of a dataclass
decorated class.
Although the class
statement syntax is more readable, it is hard-coded. A framework may need to build data classes on the fly, at runtime. For that, you can use the default function call syntax of collections.namedtuple
, which is likewise supported by typing.NamedTuple
. The dataclasses
module provides a make_dataclass
function for the same purpose.
After this overview of the main features of the data class builders, let’s focus on each of them in turn, starting with the simplest.
The collections.namedtuple
function is a factory that builds subclasses of tuple
enhanced with field names, a class name, and a nice __repr__
--which helps debugging.
Classes built with namedtuple
can be used anywhere where tuples
are needed, and in fact many functions of the Python standard library that used to return tuples now
return named tuples for convenience, without affecting user’s code at all.
Each instance of a class built by namedtuple
takes exactly the same amount of memory a tuple because the field names are stored in the class. They use less memory than a regular object because they don’t store attributes as key-value pairs in one __dict__
for each instance.
Example 5-4 shows how we could define a named tuple to hold information about a city.
>>>
from
collections
import
namedtuple
>>>
City
=
namedtuple
(
'
City
'
,
'
name country population coordinates
'
)
>>>
tokyo
=
City
(
'
Tokyo
'
,
'
JP
'
,
36.933
,
(
35.689722
,
139.691667
)
)
>>>
tokyo
City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722,
139.691667))
>>>
tokyo
.
population
36.933
>>>
tokyo
.
coordinates
(35.689722, 139.691667)
>>>
tokyo
[
1
]
'JP'
Two parameters are required to create a named tuple: a class name and a list of field names, which can be given as an iterable of strings or as a single space-delimited string.
Field values must be passed as separete positional arguments to the constructor (in contrast, the tuple
constructor takes a single iterable).
You can access the fields by name or position.
As a tuple
subclass, City
inherits useful methods such as __eq__
and
the special methods for comparison operators (__gt__
, __ge__
, etc.) which are useful for sorting
sequences of City
.
In addition to the methods inherited from tuple
, a named tuple offers a few attributes and methods.
Example 5-5 shows the most useful: the _fields
class attribute, the class method _make(iterable)
, and the _asdict()
instance method.
>>>
City
.
_fields
('name', 'country', 'population', 'location')
>>>
Coordinate
=
namedtuple
(
'
Coordinate
'
,
'
lat long
'
)
>>>
delhi_data
=
(
'
Delhi NCR
'
,
'
IN
'
,
21.935
,
Coordinate
(
28.613889
,
77.208889
)
)
>>>
delhi
=
City
.
_make
(
delhi_data
)
>>>
delhi
.
_asdict
(
)
{'name': 'Delhi NCR', 'country': 'IN', 'population': 21.935,
'location': Coordinate(lat=28.613889, long=77.208889)}
>>>
import
json
>>>
json
.
dumps
(
delhi
.
_asdict
(
)
)
'{"name": "Delhi NCR", "country": "IN", "population": 21.935,
"location": [28.613889, 77.208889]}'
._fields
is a tuple with the field names of the class.
._make()
builds City
from an iterable; City(*delhi_data)
would do the same.
._asdict()
returns a dict
built from the named tuple instance.
._asdict()
is useful to serialize the data in JSON format, for example.
The _asdict
method returned an OrderedDict
in Python 2.7, and in Python 3.1 to 3.7.
Since Python 3.8, a regular dict
is returned—which is probably fine now that we can rely on key insertion order.
If you must have an OrderedDict
, the
_asdict
documentation
recommends building one from the result: OrderedDict(x._asdict())
.
Since Python 3.7, namedtuple
accepts the defaults
keyword-only argument providing
an iterable of N default values for each of the N rightmost fields of the class.
Example 5-6 show how to define a Coordinate
named tuple with a default value for a reference
field:
>>>
Coordinate
=
namedtuple
(
'Coordinate'
,
'lat long reference'
,
defaults
=
[
'WGS84'
])
>>>
Coordinate
(
0
,
0
)
Coordinate(lat=0, long=0, reference='WGS84')
>>>
Coordinate
.
_field_defaults
{'reference': 'WGS84'}
In “Class statement syntax” I mentioned it’s easier to code methods with the
class syntax supported by typing.NamedTuple
and @dataclass
.
You can also add methods to a namedtuple
, but it’s a hack.
Skip the following box if you’re not interested in hacks.
Now let’s check out the typing.NamedTuple
variation.
The Coordinate
class with a default field from Example 5-6 can be written like this using typing.NamedTuple
:
typing_namedtuple/coordinates2.py
from
typing
import
NamedTuple
class
Coordinate
(
NamedTuple
)
:
lat
:
float
long
:
float
reference
:
str
=
'
WGS84
'
Classes built by typing.NamedTuple
don’t have any methods beyond those that collections.namedtuple
also generates—and those that are inherited from tuple
.
The only difference at runtime is the presence of the __attributes__
class field—which Python completely ignores at runtime.
Classes built with typing.NamedTuple
also have a _field_types
attribute. Since Python 3.8, that attribute is deprecated in favor of __annotations__
which has the same information and is the canonical place to find type hints in Python objects that have them.
Given that the main feature of typing.NamedTuple
are the type annotations, we’ll take a brief look at them before resuming our exploration of data class builders.
Type hints—a.k.a. type annotations—are ways to declare the expected type of function arguments, return values, and variables.
This is a very brief introduction to type hints,
just enough to make sense of the syntax and meaning of the annotations used typing.NamedTuple
and @dataclass
declarations.
We will cover type hints for function signatures in Chapter 8
and more advanced annotations like generics, unions etc. in [Link to Come].
Here we’ll mostly see hints with built-in types, such as str
, int
, and float
,
which are probably the most common types used to annotate fields of data classes.
The first thing you need to know about type hints is that they are not enforced at all by the Python bytecode compiler and runtime interpreter.
Type annotations don’t have any impact on the runtime behavior of Python programs. Check this out:
>>
>
import
typing
>>
>
class
Coordinate
(
typing
.
NamedTuple
)
:
.
.
.
lat
:
float
.
.
.
long
:
float
.
.
.
>>
>
trash
=
Coordinate
(
'
foo
'
,
None
)
>>
>
(
trash
)
Coordinate
(
lat
=
'
Ni!
'
,
long
=
None
)
If you type the code of Example 5-9 in a Python module,
replacing the last line with print(trash)
,
it will happily run and display a meaningless Coordinate
, with no error or warning:
$
python3 nocheck_demo.py Coordinate(
lat
=
'Ni!'
,long
=
None)
The type hints are intended primarily to support third-party type checkers, like Mypy or the PyCharm IDE built-in type checker. These are static analysis tools: they check Python source code “at rest”, not running code.
To see the effect of type hints, you must run one of those tools on your code—like a linter. For instance, here is what Mypy has to say about the previous example:
$
mypy nocheck_demo.py nocheck_demo.py:8: error: Argument1
to"Coordinate"
has incompatibletype
"str"
;
expected"float"
nocheck_demo.py:8: error: Argument2
to"Coordinate"
has incompatibletype
"None"
;
expected"float"
As you can see, given the definition of Coordinate
,
Mypy knows that both arguments to create an instance must be of type float
,
but the assignment to trash
uses a str
and None
.5
Now let’s talk about the syntax and meaning of type hints.
Both typing.NamedTuple
and @dataclass
use the syntax of variable annotations defined in PEP 526.
This is quick introduction to that syntax in the context defining attributes in class
statements.
The basic syntax of variable annotation is:
var_name
:
some_type
Acceptable type hints in PEP 484 explains what are acceptable types, but in the context of defining a concrete data class, these types are more useful:
a concrete class, for example str
or FrenchDeck
;
a parameterized collection type, like List[int]
, Tuple[str, float]
etc.
typing.Optional
, for example Optional[str]
;
Although possible in theory, it’s not very useful to define fields with abstractions such as:
special type constructs like Any
, Union
, Protocol
etc. from the typing
module;
an ABC (Abstract Base Class);
You can also initialize the variable with a value. In a typing.NamedTuple
or @dataclass
declaration,
that value will become the default for that attribute,
if the corresponding argument is ommitted in the constructor call.
var_name
:
some_type
=
a_value
We saw in “No runtime effect” that type hints have no effect at runtime.
But at import time—when a module is loaded—Python does read them to build the __annotations__
dictionary that typing.NamedTuple
and @dataclass
then use to enhance the class.
We’ll start this exploration with a simple class,
so that we can later see what extra features are added by typing.NamedTuple
and @dataclass
.
class
DemoPlainClass
:
a
:
int
b
:
float
=
1.1
c
=
'
spam
'
a
becomes an annotation, but is otherwise discarded.
b
is saved as an annotation, and also becomes a class attribute with value 1.1
.
c
is just a plain old class attribute, not an annotation.
We can verify that in the console, first reading the __annotations__
of the DemoPlainClass
, then trying to get its attributes named a
, b
, and c
:
>>>
from
demo_plain
import
DemoPlainClass
>>>
DemoPlainClass
.
__annotations__
{'a': <class 'int'>, 'b': <class 'float'>}
>>>
DemoPlainClass
.
a
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
AttributeError
:type object 'DemoPlainClass' has no attribute 'a'
>>>
DemoPlainClass
.
b
1.1
>>>
DemoPlainClass
.
c
'spam'
Note that the __annotations__
special attribute is created by the interpreter to record the type hints that appear in the source code—even in a plain class.
The a
survives only as an annotation. It doesn’t become a class attribute because no value is bound to it.6
The b
and c
are stored as class attributes because they are bound to values.
None of those three attributes will be in a new instance of DemoPlainClass
.
If you create an object o = DemoPlainClass()
, o.a
will raise AttributeError
, while o.b
and o.c
will retrieve the class attributes with values 1.1
and 'spam'
—that’s just normal Python object behavior.
typing.NamedTuple
Now let’s examine a class built with typing.NamedTuple
, using the same attributes and annotations as DemoPlainClass
from Example 5-10.
typing.NamedTuple
.import
typing
class
DemoNTClass
(
typing
.
NamedTuple
)
:
a
:
int
b
:
float
=
1.1
c
=
'
spam
'
a
becomes an annotation and also an instance attribute.
b
is another annotation, and also becomes an instance attribute with default value 1.1
.
c
is just a plain old class attribute; no annotation will refer to it.
Inspecting the DemoNTClass
, we get:
>>>
from
demo_nt
import
DemoNTClass
>>>
DemoNTClass
.
__annotations__
{'a': <class 'int'>, 'b': <class 'float'>}
>>>
DemoNTClass
.
a
<_collections._tuplegetter object at 0x101f0f940>
>>>
DemoNTClass
.
b
<_collections._tuplegetter object at 0x101f0f8b0>
>>>
DemoNTClass
.
c
'spam'
Here we see the same annotations for a
and b
as we saw in Example 5-10.
But DemoNTClass
has three class attributes a
, b
, and c
. The c
attribute is just a plain class attribute with the value 'spam'
.
The a
and b
class attributes are actually descriptors—an advanced feature covered in [Link to Come].
For now, think of them as similar to property getters: methods that don’t require the explicit call operator ()
to retrieve an instance attribute. In practice, this means a
and b
will work as read-only instance attributes—which makes sense when we recall that DemoNTClass
instances are just a fancy tuples, and tuples are immutable.
DemoNTClass
also gets a custom docstring:
>>>
DemoNTClass
.
__doc__
'DemoNTClass(a, b)'
Let’s inspect an instance of DemoNTClass
:
>>>
nt
=
DemoNTClass
(
8
)
>>>
nt
.
a
8
>>>
nt
.
b
1.1
>>>
nt
.
c
'spam'
To construct nt
, we need to give at least the a
argument to DemoNTClass
. The constructor also takes a b
argument, but it has a default value of 1.1
, so it’s optional. The nt
object has the a
and b
attributes as expected; it doesn’t have a c
attribute, but Python retrieves it from the class, as usual.
If you try to assign values to nt.a
, nt.b
, nt.c
or even nt.z
you’ll get AttributeError
, with subtly different error messages. Try that and reflect on the messages.
dataclass
Now we’ll examine Example 5-12:
@dataclass
from
dataclasses
import
dataclass
@dataclass
class
DemoDataClass
:
a
:
int
b
:
float
=
1.1
c
=
'
spam
'
a
becomes an annotation and also an instance attribute.
b
is another annotation, and also becomes an instance attribute with default value 1.1
.
c
is just a plain old class attribute; no annotation will refer to it.
Now let’s check out __annotations__
, __doc__
, and the a
, b
, c
attributes on DemoDataClass
:
>>>
from
demo_dc
import
DemoDataClass
>>>
DemoDataClass
.
__annotations__
{'a': <class 'int'>, 'b': <class 'float'>}
>>>
DemoDataClass
.
__doc__
'DemoDataClass(a: int, b: float = 1.1)'
>>>
DemoDataClass
.
a
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
AttributeError
:type object 'DemoDataClass' has no attribute 'a'
>>>
DemoDataClass
.
b
1.1
>>>
DemoDataClass
.
c
'spam'
The __annotations__
and __doc__
are not surprising.
However, there is no attribute named a
in DemoDataClass
—in contrast with DemoNTClass
from Example 5-11,
which has a descriptor to get a
from the instances as read-only attributes (that mysterious <_collections._tuplegetter>
).
That’s because the a
attribute will only exist in instances of DemoDataClass
.
It will be a public attribute that we can get and set, unless the class is frozen.
But b
and c
exist as class attributes, with b
holding the default value for the b
instance attribute,
while c
is just a class attribute that will not be bound to the instances.
Now let’s see how a DemoDataClass
instance looks like:
>>>
dc
=
DemoDataClass
(
9
)
>>>
dc
.
a
9
>>>
dc
.
b
1.1
>>>
dc
.
c
'spam'
Again, a
and b
are instance attributes, and c
is a class attribute we get via the instance.
As mentioned, DemoDataClass
instances are mutable—and no type checking is done at runtime:
>>>
dc
.
a
=
10
>>>
dc
.
b
=
'oops'
We can do even sillier assignments:
>>>
dc
.
c
=
'whatever'
>>>
dc
.
z
=
'secret stash'
Now the dc
instance has a c
attribute—but that does not change the c
class attribute. And we can add a new z
attribute.
This is normal Python behavior: regular instances can have their own attributes that don’t appear in the class.7
@dataclass
We’ve only seen simple examples of @dataclass
use so far. The decorator accepts several arguments. This is its signature:
@dataclass
(
*
,
init
=
True
,
repr
=
True
,
eq
=
True
,
order
=
False
,
unsafe_hash
=
False
,
frozen
=
False
)
The *
in the first position means the remaining parameters are keyword-only. Table 5-2 describes them.
option | default | meaning | notes |
---|---|---|---|
init |
True |
generate |
Ignored if |
repr |
True |
generate |
Ignored if |
eq |
True |
generate |
Ignored if |
order |
False |
generate |
Raises exceptions if |
unsafe_hash |
False |
generate |
Complex semantics and several caveats—see: dataclass documentation. |
frozen |
False |
make instances “immutable” |
instances will be reasonably safe from accidental change, but not really immutablea. |
a |
The defaults are really the most useful settings for common use cases. The options you are more likely to change from the defaults are:
order=True
: to allow sorting of instances of the data class;
frozen=True
: to protect against accidental changes to the class instances.
Given the dynamic nature of Python objects, it’s not too hard for a nosy programmer to go around the protection afforded by frozen=True
. But the necessary tricks should be easy to spot on a code review.
If eq
and frozen
are both true, @dataclass
will produce a suitable __hash__
method, so the instances will be hashabke.
The generated __hash__
will use data from all fields that are not individually excluded using a field option we’ll see in “Key-sharing dictionary”.
If frozen=False
(the default), @dataclass
will set __hash__
to None
, signalling that the instances are unhashable, therefore overriding __hash__
from any superclass.
Regarding unsafe_hash
, PEP 557 has this to say:
Although not recommended, you can force Data Classes to create a
__hash__
method withunsafe_hash=True
. This might be the case if your class is logically immutable but can nonetheless be mutated. This is a specialized use case and should be considered carefully.
I will leave unsafe_hash
at that. If you fell you must use that option, check the dataclasses.dataclass
documentation.
Further customization of the generated data class can be done at a field level.
We’ve already seen most basic field option: providing or not a default value with the type hint.
Note that fields are read in order, and after you declare a field with a default value,
all remaining fields must also have default values.
This limitation makes sense: the fields will become parameters in the generated __init__
,
and Python does not allow non-default parameters following parameters with defaults.
Mutable default values are a common source of bugs for beginning Python developers.
In function definitions, a mutable default value is easily corrupted when one invocation of the function mutates the default,
changing the behavior of further invocations—an issue we’ll explore in “Mutable Types as Parameter Defaults: Bad Idea” (Chapter 6).
Class atributes are often used as default attribute values for instances, including in data classes.
And @dataclass
uses the default values in the type hints to generate parameters with defaults for __init__
.
To prevent bugs, @dataclass
rejects the class definition in Example 5-13.
dataclass/club_wrong.py
: this class raises ValueError
@dataclass
class
ClubMember
:
name
:
str
guests
:
list
=
[]
If you load the module with that ClubMember
class, this is what you get:
$
python3 club_wrong.py Traceback(
most recent call last)
: File"club_wrong.py"
, line 4, in <module> class ClubMember: ...several lines ommitted... ValueError: mutable default <class'list'
>for
field guests is not allowed: use default_factory
The ValueError
message explains the problem and suggests a solution: use default_factory
. This is how to correct ClubMember
:
dataclass/club.py
: this ClubMember
definition works.from
dataclasses
import
dataclass
,
field
@dataclass
class
ClubMember
:
name
:
str
guests
:
list
=
field
(
default_factory
=
list
)
In the guests
field of Example 5-14, instead of a literal list, the default value is set by calling the dataclasses.field
function with default_factory=list
. The default_factory
parameter lets you provide a function, class, or any other callable, which will be invoked with zero arguments to build a default value each time an instance of the data class is created. This way, each instance of ClubMember
will have its own list
—instead of all instances sharing the same list
from the class, which is rarely what we want and is often a bug.
It’s good that @dataclass
rejects class definitions with a list
default value in a field.
However, be aware that it is a partial solution that only applies to list
, dict
and set
.
Other mutable values used as defaults will not be flagged by @dataclass
.
It’s up to you to understand the problem and remember to use a default factory to set mutable default values.
If you browse the dataclasses
module documentation, you’ll see a list
field defined with a novel syntax, as in Example 5-15.
dataclass/club_generic.py
: this ClubMember
definition is more precisefrom
dataclasses
import
dataclass
,
field
from
typing
import
List
@dataclass
class
ClubMember
:
name
:
str
guests
:
List
[
str
]
=
field
(
default_factory
=
list
)
The new syntax List[str]
is a generic type definition: the List
class from typing
accepts that bracket notation to specify the type of the list items. We’ll cover generics in [Link to Come]. For now, note that both Example 5-14 and Example 5-15 are correct, and the Mypy type checker does not complain about either of those class definitions. But the second one is more precise, and will allow the type checker to verify code that puts items in the list, or that read items from it.
The default_factory
is by far the most frequently used option of the field
function, but there are several others, listed in Table 5-3.
option | default | meaning |
---|---|---|
default |
_MISSING_TYPE |
default value for fielda |
default_factory |
_MISSING_TYPE |
0-parameter function used to produce a default |
init |
True |
include field in parameters to |
repr |
True |
include field |
hash |
None |
use field to compute |
compare |
True |
use field in comparison methods |
metadata |
None |
mapping with user-defined data; ignored by the |
a |
The default
option exists because the field
call takes the place of the default value in the field annotation.
If you want to create an athlete
field with default value of False
, and also ommit that field from the __repr__
method, you’d write this:
@dataclass
class
ClubMember
:
name
:
str
guests
:
list
=
field
(
default_factory
=
list
)
athlete
:
bool
=
field
(
default
=
False
,
repr
=
False
)
The __init__
method generated by @dataclass
only takes the arguments passed and assigns them—or their default values,
if missing—to the instance attributes that are instance fields.
But you may need to do more than that to initialize the instance.
If that’s the case, you can provide a __post_init__
method.
When that method exists, @dataclass
will add code to the generated __init__
to call __post_init__
as the last step.
Common use cases for __post_init__
are validation and computing field values based on other fields.
We’ll study a simple example that uses __post_init__
for both of these reasons.
First, let’s look at the expected behavior of a ClubMember
subclass named HackerClubMember
, as described by doctests in Example 5-16.
dataclass/hackerclub.py
: doctests for HackerClubMember
"""
``HackerClubMember`` objects accept an optional ``handle`` argument::
>>> anna = HackerClubMember('Anna Ravenscroft', handle='AnnaRaven')
>>> anna
HackerClubMember(name='Anna Ravenscroft', guests=[], handle='AnnaRaven')
If ``handle`` is ommitted, it's set to the first part of the member's name::
>>> leo = HackerClubMember('Leo Rochael')
>>> leo
HackerClubMember(name='Leo Rochael', guests=[], handle='Leo')
Members must have a unique handle. The following ``leo2`` will not be created,
because its ``handle`` would be 'Leo', which was taken by ``leo``::
>>> leo2 = HackerClubMember('Leo DaVinci')
Traceback (most recent call last):
...
ValueError: handle 'Leo' already exists.
To fix, ``leo2`` must be created with an explicit ``handle``::
>>> leo2 = HackerClubMember('Leo DaVinci', handle='Neo')
>>> leo2
HackerClubMember(name='Leo DaVinci', guests=[], handle='Neo')
"""
Note that we must provide handle
as a keyword argument, because HackerClubMember
inherits name
and guests
from ClubMember
, and adds the handle
field. The generated docstring for HackerClubMember
shows the order of the fields in the constructor call:
>>>
HackerClubMember
.
__doc__
"HackerClubMember(name: str, guests: list = <factory>, handle: str = '')"
Here, <factory>
is a short way of saying that some callable will produce the default value for guests
(in our case, the factory is the list
class).
The point is: to provide a handle
but no guests
, we must pass handle
as a keyword argument.
The Inheritance section of the dataclasses
module documentation explains how the order of the fields is computed when there are several levels of inheritance.
In [Link to Come] we’ll talk about misusing inheritance, particularly when the superclasses are not abstract.
Creating a hierarchy of data classes is usually a bad idea, but it served us well here to make Example 5-17 shorter,
focusing on the handle
field declaration and __post_init__
validation.
Example 5-17 is the implementation:
dataclass/hackerclub.py
: code for HackerClubMember
.from
dataclasses
import
dataclass
from
club
import
ClubMember
@dataclass
class
HackerClubMember
(
ClubMember
)
:
all_handles
=
set
(
)
handle
:
str
=
'
'
def
__post_init__
(
self
)
:
cls
=
self
.
__class__
if
self
.
handle
==
'
'
:
self
.
handle
=
self
.
name
.
split
(
)
[
0
]
if
self
.
handle
in
cls
.
all_handles
:
msg
=
f
'
handle
{self.handle!r}
already exists.
'
raise
ValueError
(
msg
)
cls
.
all_handles
.
add
(
self
.
handle
)
HackerClubMember
extends ClubMember
.
all_handles
is a class attribute.
handle
is an instance field of type str
with empty string as its default value; this makes it optional.
Get the class of the instance.
If self.handle
is the empty string, set it to the first part of name
.
If self.handle
is in cls.all_handles
, raise ValueError
.
Add the new handle
to cls.all_handles
.
Example 5-17 works as intended, but is not satisfactory to a static type checker. Next, we’ll see why, and how to fix it.
If we typecheck Example 5-17 with Mypy, we are reprimanded:
$ mypy hackerclub.py hackerclub.py:38: error: Need type annotation for 'all_handles' (hint: "all_handles: Set[<type>] = ...") Found 1 error in 1 file (checked 1 source file)
Unfortunately, the hint provided by Mypy (version 0.750 as I write this) is not helpful in the context of @dataclass
usage.
If we add a type hint like Set[…]
to all_handles
, @dataclass
will find that annotation and make all_handles
an instance field.
We saw this happening in “Inspecting a class decorated with dataclass
”.
The work-around defined in PEP 526—Syntax for Variable Annotations
is a class variable annotation, written with a pseudo-type named typing.ClassVar
,
which leverages the generics []
notation to set the type of the variable and also declare it a class attribute.
To make the type checker happy, this is how we are supposed to declare all_handles
in Example 5-17:
all_handles
:
ClassVar
[
Set
[
str
]]
=
set
()
That type hint is saying:
all_handles
is a class attribute of typeset
-of-str
, with an emptyset
as its default value.
To code that annotation, we must import ClassVar
and Set
from the typing
module.
The @dataclass
decorator doesn’t care about the types in the annotations, except in two cases,
and this is one of them: if the type is ClassVar
, an instance field will not be generated for that attribute.
The other case where the type of the field is relevant to @dataclass
is when declaring init-only variables, our next topic.
Sometimes you may neet to pass arguments to __init__
that are not instance fields.
Such arguments are called init-only variables by the dataclasses
documentation.
To declare an argument like that, dataclasses
module provides the pseudo-type InitVar
, which uses the same syntax of typing.ClassVar
.
The example given in the documentation is a data class that has a field initialized from a database,
and the database object must be passed to the constructor.
This is the code that illustrates the Init-only variables section:
dataclasses
module documentation.@dataclass
class
C
:
i
:
int
j
:
int
=
None
database
:
InitVar
[
DatabaseType
]
=
None
def
__post_init__
(
self
,
database
):
if
self
.
j
is
None
and
database
is
not
None
:
self
.
j
=
database
.
lookup
(
'j'
)
c
=
C
(
10
,
database
=
my_database
)
Note how the database
attribute is declared. InitVar
will prevent @dataclass
from treating database
as a regular field.
It will not be set as an instance attribute, and the dataclasses.fields
function will not list it.
However, database
will be one of the arguments that the generated __init__
will accept,
and it will be also passed to __post_init__
—if you write that method,
you must add a corresponding argument to the method signature, as shown in Example 5-18
This rather long overview of @dataclass
covered the most useful features—some of them appeared in previous sections, like “Main features” where we covered all three data class builders in parallel. The dataclasses
documentation and PEP 526 — Syntax for Variable Annotations have all details.
Often, classes built with @dataclass
will have more fields than the very short examples presented so far.
Dublin Core provides the foundation for a more typical @dataclass
example.
The Dublin Core Schema is a small set of vocabulary terms that can be used to describe digital resources (video, images, web pages, etc.), as well as physical resources such as books or CDs, and objects like artworks.
Dublin Core on Wikipedia
The standard defines 15 optional fields, the Resource
class in Example 5-19 uses 8 of them.
dataclass/resource.py
: code for Resource
, a class based on Dublin Core terms.from
dataclasses
import
dataclass
,
field
from
typing
import
List
,
Optional
from
enum
import
Enum
,
auto
from
datetime
import
date
class
ResourceType
(
Enum
)
:
BOOK
=
auto
(
)
EBOOK
=
auto
(
)
VIDEO
=
auto
(
)
@dataclass
class
Resource
:
"""Media resource description."""
identifier
:
str
title
:
str
=
'
<untitled>
'
creators
:
List
[
str
]
=
field
(
default_factory
=
list
)
date
:
Optional
[
date
]
=
None
type
:
ResourceType
=
ResourceType
.
BOOK
description
:
str
=
'
'
language
:
str
=
'
'
subjects
:
List
[
str
]
=
field
(
default_factory
=
list
)
This Enum
will provide type-safe values for the Resource.type
field.
identifier
is the only required field.
title
is the first field with a default. This forces all fields below to provide defaults.
The value of date
can be a datetime.date
instance, or None
.
The type
field default is ResourceType.BOOK
.
Example 5-20 is a doctest to demonstrate how a Resource
record appears in code:
dataclass/resource.py
: code for Resource
, a class based on Dublin Core terms.>>>
description
=
'Improving the design of existing code'
>>>
book
=
Resource
(
'978-0-13-475759-9'
,
'Refactoring, 2nd Edition'
,
...
[
'Martin Fowler'
,
'Kent Beck'
],
date
(
2018
,
11
,
19
),
...
ResourceType
.
BOOK
,
description
,
'EN'
,
...
[
'computer programming'
,
'OOP'
])
>>>
book
# doctest: +NORMALIZE_WHITESPACE
Resource
(
identifier
=
'978-0-13-475759-9'
,
title
=
'Refactoring, 2nd Edition'
,
creators
=
[
'Martin Fowler'
,
'Kent Beck'
],
date
=
datetime
.
date
(
2018
,
11
,
19
),
type
=<
ResourceType
.
BOOK
:
1
>
,
description
=
'Improving the design of existing code'
,
language
=
'EN'
,
subjects
=
[
'computer programming'
,
'OOP'
])
The __repr__
generated by @dataclass
is OK, but we can do better.
This is the format we want from repr(book)
:
>>>
book
# doctest: +NORMALIZE_WHITESPACE
Resource
(
identifier
=
'978-0-13-475759-9'
,
title
=
'Refactoring, 2nd Edition'
,
creators
=
[
'Martin Fowler'
,
'Kent Beck'
],
date
=
datetime
.
date
(
2018
,
11
,
19
),
type
=
<
ResourceType
.
BOOK
:
1
>
,
description
=
'Improving the design of existing code'
,
language
=
'EN'
,
subjects
=
[
'computer programming'
,
'OOP'
],
)
Example 5-21 is the code of __repr__
to produce the format above.
This example uses doctest.fields
to get the names of the data class fields.
dataclass/resource_repr.py
: code for __repr__
method implemented in the Resource
class from Example 5-19.
def
__repr__
(
self
)
:
cls
=
self
.
__class__
cls_name
=
cls
.
__name__
indent
=
'
'
*
4
res
=
[
f
'
{cls_name}
(
'
]
for
f
in
fields
(
cls
)
:
value
=
getattr
(
self
,
f
.
name
)
res
.
append
(
f
'
{indent}
{f.name}
=
{value!r}
,
'
)
res
.
append
(
'
)
'
)
return
'
'
.
join
(
res
)
Start the res
list to build the output string with the class name and open parenthesis.
For each field f
in the class…
Get the named attribute from the instance.
Append an indented line with the name of the field and repr(value)
—that’s what the !r
does.
Append closing parenthesis.
Build multiline string from res
and return it.
With this example inspired by the soul of Dublin, Ohio, we conclude our tour of Python’s data class builders.
Data classes are handy, but your project may suffer if you overuse them. The next section explains.
Whether you implement a data class writing all the code yourself or leveraging one of the class builders described in this chapter, be aware that it may signal a problem in your design.
In Refactoring, Second Edition, Martin Fowler and Kent Beck present a catalog of “code smells”—patterns in code that may indicate the need for refactoring. The entry titled Data Class starts like this:
These are classes that have fields, getting and setting methods for fields, and nothing else. Such classes are dumb data holders and are often being manipulated in far too much detail by other classes.
In Fowler’s personal Web site there’s an illuminating post explaining what is a “code smell”. That post is very relevant to our discussion because he uses data class as one example of a code smell and suggests how to deal with it. Here is the post, reproduced in full.8
The main idea of Object Oriented Programming is to place behavior and data together in the same code unit: a class. If a class is widely used but has no significant behavior of its own, it’s possible that code dealing with its instances is scattered (and even duplicated) in methods and functions throughout the system—a recipe for maintenance headaches. That’s why Fowler’s refactorings to deal with a data class involve bringing responsibilities back into it.
Taking that into account, there are a couple of common scenarios where it makes sense to have a data class with little or no behavior.
In this scenario, the data class is an initial, simplistic implementation of a class to jump start a new project or module. With time, the class should get its own methods, instead of relying on methods of other classes to operate on its instances. Scaffolding is temporary; eventually your custom class may become fully independent from the builder you used to start it.
Python is also used for quick problem solving and experimentation, and then it’s OK leave the scaffolding in place.
A data class can be useful to build records about to be exported to JSON or some other interchange format, or to hold data that was just imported, crossing some system boundary. Python’s data class builders all provide a method or function to convert an instance to a plain dict
, and you can always invoke the constructor with a dict
used as keyword arguments expanded with **
. Such a dict
is very close to a JSON record.
In this scenario, the data class instances should be handled as immutable objects—even if the fields are mutable, you should not change them while they are in this intermediate form. If you do, you’re losing the key benefit of having data and behavior close together. When importing/exporting requires changing values, you should implement your own builder methods instead of using the given “as dict” methods or standard constructors.
After reviewing Python’s data class builders, we’ll end the chapter with the struct
module, also used for importing/exporting records, but at a much lower level.
The struct
module provides functions to parse fields of bytes into a tuple of Python objects,
and to perform the opposite conversion, from a tuple into packed bytes.
struct
can be used used with bytes
, bytearray
, and memoryview
objects.
Suppose you need to read a binary file containing data about metropolitan areas, produced by a program in C with a record defined as Example 5-22
struct
MetroArea
{
int
year
;
char
name
[
12
];
char
country
[
2
];
float
population
;
};
Here is how to read one record in that format, using struct.unpack
:
>>>
from
struct
import
unpack
>>>
FORMAT
=
'i12s2sf'
>>>
data
=
open
(
'metro_areas.bin'
,
'rb'
)
.
read
(
24
)
>>>
data
b"xe2x07x00x00Tokyox00xc5x05x01x00x00x00JPx00x00x11X'L"
>>>
unpack
(
FORMAT
,
data
)
(2018, b'Tokyox00xc5x05x01x00x00x00', b'JP', 43868228.0)
Note how unpack
returns a tuple with four fields, as specified by the FORMAT
string.
The letters and numbers in FORMAT
are Format Characters described in the struct
module documentation.
Table 5-4 explains the elements of the format string from Example 5-23.
part | size | C type | Python type | limits to actual content |
---|---|---|---|---|
|
4 bytes |
|
|
32 bits; range -2,147,483,648 to 2,147,483,647 |
|
12 bytes |
|
|
length = 12 |
|
2 bytes |
|
|
length = 2 |
|
4 bytes |
|
|
32-bits; approximante range ± 3.4×1038 |
One detail about the layout of metro_areas.bin
is not clear from the code in Example 5-22:
size is not the only difference bettween the name
and country
fields.
The country
field always holds a 2-letter country code,
but name
is a null-terminated sequence with up to 12 bytes including the terminating
b' '
—which you can see in Example 5-23 right after the word
Tokyo
.9
Now let’s review a script to extract all records from metro_areas.bin
and produce a simple report like this:
$
python3 metro_read.py2018
Tokyo, JP 43,868,2282015
Shanghai, CN 38,700,0002015
Jakarta, ID 31,689,592
Example 5-24 showcases the handy struct.iter_unpack
function.