Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Record-like data structures

Data classes are like children. They are okay as a starting point, but to participate as a grownup object, they need to take some responsibility.¹

Martin Fowler and Kent Beck

Python offers a few ways to build a simple class that is just a bunch of fields, with little or no extra funcionality. That pattern is known as a “data class”—and dataclass is the name of a Python decorator that supports it. This chapter covers three different class builders that you may use as shortcuts to write data classes:

collections.namedtuple: the simplest way—since Python 2.6;
typing.NamedTuple: an alternative that allows type annotations on the fields—since Python 3.5; class syntax supported since 3.6;
@dataclasses.dataclass: a class decorator that allows more customization than previous alternatives, adding lots of options and potential complexity—since Python 3.7.

After covering those class builders, we will discuss why Data Class is also the name of a code smell: a coding pattern that may be a symptom of poor object-oriented design.

The chapter ends with a section on a very different topic, but still closely related to record-like data: the struct module, designed to parse and build packed binary records that you may find in legacy flat-file databases, network protocols, and file headers.

Note

typing.TypedDict may seem like another data class builder—it’s described right after typing.NamedTuple in the typing module documentation, and uses similar syntax. However, TypedDict does not build concrete classes that you can instantiate. It’s just a way to write static annotations for function parameters and variables, so we’ll study them in Chapter 8, “TypedDict”.

What’s new in this chapter

This chapter is new in Fluent Python 2^nd edition. The sections “Classic Named Tuples” and “Structs and Memory Views” appeared in chapters 2 and 4 in the 1^st edition, but the rest of the chapter is completely new.

We begin with a high level overview of the three class builders.

Overview of data class builders

Consider a simple class to represent a geographic coordinate pair:

Example 5-1. `class/coordinates.py`

class Coordinate:

    def __init__(self, lat, long):
        self.lat = lat
        self.long = long

That Coordinate class does the job of holding latitude and longitude attributes. Writing the __init__ boilerplate becomes old real fast, especially if your class has more than a couple of attributes: each of them is mentioned three times! And that boilerplate doesn’t buy us basic features we’d expect from a Python object:

>>> from coordinates import Coordinate
>>> moscow = Coordinate(55.76, 37.62)
>>> moscow
<coordinates.Coordinate object at 0x107142f10>  
>>> location = Coordinate(55.76, 37.62)
>>> location == moscow  
False
>>> (location.lat, location.long) == (moscow.lat, moscow.long)  
True

: __repr__ inherited from object is not very helpful.
: Meaningless equality; the __eq__ method inherited from object compares object ids.
: Comparing two coordinates requires explicit comparison of each attribute.

The data class builders covered in this chapter provide the necessary __init__, __repr__, and __eq__ methods automatically, as well as other useful features.

Note

None of the class builders discussed here depend on inheritance to do their work. Both collections.namedtuple and typing.NamedTuple build classes that are tuple subclasses. @dataclass is a class decorator that does not affect the class hierarchy in any way. Each of them use different metaprogramming techniques to inject methods and data attributes into the class under construction.

Here is a Coordinate class built with namedtuple—a factory function that builds a subclass of tuple with the name and fields you specify:

>>> from collections import namedtuple
>>> Coordinate = namedtuple('Coordinate', 'lat long')
>>> issubclass(Coordinate, tuple)
True
>>> moscow = Coordinate(55.756, 37.617)
>>> moscow
Coordinate(lat=55.756, long=37.617)  
>>> moscow == Coordinate(lat=55.756, long=37.617)  
True

: Useful __repr__.
: Meaningful __eq__.

The newer typing.NamedTuple provides the same funcionality, adding a type annotation to each field:

>>> import typing
>>> Coordinate = typing.NamedTuple('Coordinate', [('lat', float), ('long', float)])
>>> issubclass(Coordinate, tuple)
True
>>> Coordinate.__annotations__
{'lat': <class 'float'>, 'long': <class 'float'>}

Tip

A typed named tuple can also be constructed with the fields given as keyword arguments, like this:

Coordinate = typing.NamedTuple('Coordinate', lat=float, long=float)

This is more readable, and also lets you provide the mapping of fields and types as **fields_and_types.

Since Python 3.6, typing.NamedTuple can also be used in a class statement, with type annotations written as described in PEP 526—Syntax for Variable Annotations. This is much more readable, and makes it easy to override methods or add new ones. Example 5-2 is the same Coordinate class, with a pair of float attributes and a custom __str__ to display a coordinate formatted like 55.8°N, 37.6°E:

Example 5-2. `typing_namedtuple/coordinates.py`

from typing import NamedTuple

class Coordinate(NamedTuple):

    lat: float
    long: float

    def __str__(self):
        ns = 'N' if self.lat >= 0 else 'S'
        we = 'E' if self.long >= 0 else 'W'
        return f'{abs(self.lat):.1f}°{ns}, {abs(self.long):.1f}°{we}'

Warning

Although NamedTuple appears in the class statement as a superclass, it’s actually not. typing.NamedTuple uses the advanced functionality of a metaclass² to customize the creation of the user’s class. Check this out:

>>> issubclass(Coordinate, typing.NamedTuple)
False
>>> issubclass(Coordinate, tuple)
True

In the __init__ method generated by typing.NamedTuple, the fields appear as parameters in the same order they appear in the class statement.

Like typing.NamedTuple, the dataclass decorator supports PEP 526 syntax to declare instance attributes. The decorator reads the variable annotations and automatically generates methods for your class. For comparison, check out the equivalent Coordinate class written with the help of the dataclass decorator:

Example 5-3. `dataclass/coordinates.py`

from dataclasses import dataclass

@dataclass(frozen=True)
class Coordinate:

    lat: float
    long: float

    def __str__(self):
        ns = 'N' if self.lat >= 0 else 'S'
        we = 'E' if self.long >= 0 else 'W'
        return f'{abs(self.lat):.1f}°{ns}, {abs(self.long):.1f}°{we}'

Note that the body of the classes in Example 5-2 and Example 5-3 are identical—the difference is in the class statement itself. The @dataclass decorator does not depend on inheritance or a metaclass, so it should not interfere with your own use of these mechanisms.³ The Coordinate class in Example 5-3 is a subclass of object.

Main features

The different data class builders have a lot of common. Here we’ll discuss the main features they share. Table 5-1 summarizes.

Table 5-1. Selected features compared accross the three data class builders. `x` stands for an instance of a data class of that kind.
	namedtuple	NamedTuple	dataclass
mutable instances	NO	NO	YES
class statement syntax	NO	YES	YES
construct dict	x._asdict()	x._asdict()	dataclasses.as_dict(x)
get field names	x._fields	x._fields	[f.name for f in dataclasses.fields(x)]
get defaults	x._field_defaults	x._field_defaults	[f.default for f in dataclasses.fields(x)]
get field types	N/A	x.__annotations__	x.__annotations__
new instance with changes	x._replace(…)	x._replace(…)	dataclasses.replace(x, …)
new class at runtime	namedtuple(…)	NamedTuple(…)	dataclasses.make_dataclass(…)

Mutable instances

A key difference between these class builders is that collections.namedtuple and typing.NamedTuple build tuple subclasses, therefore the instances are immutable. By default, @dataclass produces mutable classes. But the decorator accepts several keyword arguments to configure the class, including frozen—shown in Example 5-3. When frozen=True, the class will raise an exception if you try to assign values to the fields after the instance is initialized.

Class statement syntax

typing.NamedTuple and dataclass support the regular class statement syntax, making it easier to add methods and docstrings to the class you are creating; collections.namedtuple does not support that syntax.

Construct dict

Both named tuple variants provide an instance method (._as_dict) to to construct a dict object from the fields in a data class instance. dataclass avoids injecting a similar method in the data class, but provides a module-level function to do it: dataclasses.as_dict.

Get field names and default values

All three class builders let you get the field names and default values that may be configured for them. In named tuple classes, that metadata is in the ._fields and ._fields_defaults class attributes. You can get the same metadata from a dataclass decorated class using the fields function from the dataclasses module. It returns a tuple of Field objects which have several attributes, including name and default.

Get field types

A mapping of field names to type annotations is stored in the __annotations__ class atribute in classes defined with the help of typing.NamedTuple and dataclass.

New instance with changes

Given a named tuple instance x, the call x._replace(**kwargs) returns a new instance with some attribute values replaced according to the keyword arguments given. The dataclasses.replace(x, **kwargs) module-level function does the same for an instance of a dataclass decorated class.

New class at runtime

Although the class statement syntax is more readable, it is hard-coded. A framework may need to build data classes on the fly, at runtime. For that, you can use the default function call syntax of collections.namedtuple, which is likewise supported by typing.NamedTuple. The dataclasses module provides a make_dataclass function for the same purpose.

After this overview of the main features of the data class builders, let’s focus on each of them in turn, starting with the simplest.

Classic Named Tuples

The collections.namedtuple function is a factory that builds subclasses of tuple enhanced with field names, a class name, and a nice __repr__--which helps debugging. Classes built with namedtuple can be used anywhere where tuples are needed, and in fact many functions of the Python standard library that used to return tuples now return named tuples for convenience, without affecting user’s code at all.

Tip

Each instance of a class built by namedtuple takes exactly the same amount of memory a tuple because the field names are stored in the class. They use less memory than a regular object because they don’t store attributes as key-value pairs in one __dict__ for each instance.

Example 5-4 shows how we could define a named tuple to hold information about a city.

Example 5-4. Defining and using a named tuple type

>>> from collections import namedtuple
>>> City = namedtuple('City', 'name country population coordinates')  
>>> tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))  
>>> tokyo
City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722,
139.691667))
>>> tokyo.population  
36.933
>>> tokyo.coordinates
(35.689722, 139.691667)
>>> tokyo[1]
'JP'

: Two parameters are required to create a named tuple: a class name and a list of field names, which can be given as an iterable of strings or as a single space-delimited string.
: Field values must be passed as separete positional arguments to the constructor (in contrast, the tuple constructor takes a single iterable).
: You can access the fields by name or position.

As a tuple subclass, City inherits useful methods such as __eq__ and the special methods for comparison operators (__gt__, __ge__, etc.) which are useful for sorting sequences of City.

In addition to the methods inherited from tuple, a named tuple offers a few attributes and methods. Example 5-5 shows the most useful: the _fields class attribute, the class method _make(iterable), and the _asdict() instance method.

Example 5-5. Named tuple attributes and methods (continued from the previous example)

>>> City._fields  
('name', 'country', 'population', 'location')
>>> Coordinate = namedtuple('Coordinate', 'lat long')
>>> delhi_data = ('Delhi NCR', 'IN', 21.935, Coordinate(28.613889, 77.208889))
>>> delhi = City._make(delhi_data)  
>>> delhi._asdict()  
{'name': 'Delhi NCR', 'country': 'IN', 'population': 21.935,
'location': Coordinate(lat=28.613889, long=77.208889)}
>>> import json
>>> json.dumps(delhi._asdict())  
'{"name": "Delhi NCR", "country": "IN", "population": 21.935,
"location": [28.613889, 77.208889]}'

: ._fields is a tuple with the field names of the class.
: ._make() builds City from an iterable; City(*delhi_data) would do the same.
: ._asdict() returns a dict built from the named tuple instance.
: ._asdict() is useful to serialize the data in JSON format, for example.

Warning

The _asdict method returned an OrderedDict in Python 2.7, and in Python 3.1 to 3.7. Since Python 3.8, a regular dict is returned—which is probably fine now that we can rely on key insertion order. If you must have an OrderedDict, the _asdict documentation recommends building one from the result: OrderedDict(x._asdict()).

Since Python 3.7, namedtuple accepts the defaults keyword-only argument providing an iterable of N default values for each of the N rightmost fields of the class. Example 5-6 show how to define a Coordinate named tuple with a default value for a reference field:

Example 5-6. Named tuple attributes and methods (continued from the previous example)

>>> Coordinate = namedtuple('Coordinate', 'lat long reference', defaults=['WGS84'])
>>> Coordinate(0, 0)
Coordinate(lat=0, long=0, reference='WGS84')
>>> Coordinate._field_defaults
{'reference': 'WGS84'}

In “Class statement syntax” I mentioned it’s easier to code methods with the class syntax supported by typing.NamedTuple and @dataclass. You can also add methods to a namedtuple, but it’s a hack. Skip the following box if you’re not interested in hacks.

Hacking a `namedtuple` to inject a method

Recall how we built the Card class in Example 1-1 in Chapter 1:

Card = collections.namedtuple('Card', ['rank', 'suit'])

Later Chapter 1 I wrote a spades_high function for sorting. It would be nice if that logic was encapsulated in method of Card, but adding spades_high to Card without the benefit of a class statement requires a quick hack: define the function and then assign it to a class attribute. Example 5-7 shows how.

Example 5-7. `frenchdeck.doctest`: Adding a class attribute and a method to `Card`, the `namedtuple` from “A Pythonic Card Deck”

>>> Card.suit_values = dict(spades=3, hearts=2, diamonds=1, clubs=0)  
>>> def spades_high(card):                                            
...     rank_value = FrenchDeck.ranks.index(card.rank)
...     suit_value = card.suit_values[card.suit]
...     return rank_value * len(card.suit_values) + suit_value
...
>>> Card.overall_rank = spades_high                                   
>>> lowest_card = Card('2', 'clubs')
>>> highest_card = Card('A', 'spades')
>>> lowest_card.overall_rank()                                        
0
>>> highest_card.overall_rank()
51

: Attach a class attribute with values for each suit.
: spades_high will become a method; the first argument doesn’t need to be named self, but it must refer to the receiver (the target instance, which we usually call self).
: Attach the function to the Cards class. It becomes a method named overall_rank.
: It works!

For readability and future maintenance, its much better to be able to code methods inside a class statement. But it’s good to know this hack is possible, because it may come in handy.⁴

This was small detour to showcase the power of a dynamic language. Now, on to the next features of the data class builders.

Now let’s check out the typing.NamedTuple variation.

Typed Named Tuples

The Coordinate class with a default field from Example 5-6 can be written like this using typing.NamedTuple:

Example 5-8. `typing_namedtuple/coordinates2.py`

from typing import NamedTuple

class Coordinate(NamedTuple):

    lat: float                
    long: float
    reference: str = 'WGS84'

: Every instance field must be annotated with a type.
: The reference instance field is annotated with a type and a default value

Classes built by typing.NamedTuple don’t have any methods beyond those that collections.namedtuple also generates—and those that are inherited from tuple. The only difference at runtime is the presence of the __attributes__ class field—which Python completely ignores at runtime.

Warning

Classes built with typing.NamedTuple also have a _field_types attribute. Since Python 3.8, that attribute is deprecated in favor of __annotations__ which has the same information and is the canonical place to find type hints in Python objects that have them.

Given that the main feature of typing.NamedTuple are the type annotations, we’ll take a brief look at them before resuming our exploration of data class builders.

Type hints 101

Type hints—a.k.a. type annotations—are ways to declare the expected type of function arguments, return values, and variables.

Note

This is a very brief introduction to type hints, just enough to make sense of the syntax and meaning of the annotations used typing.NamedTuple and @dataclass declarations. We will cover type hints for function signatures in Chapter 8 and more advanced annotations like generics, unions etc. in [Link to Come]. Here we’ll mostly see hints with built-in types, such as str, int, and float, which are probably the most common types used to annotate fields of data classes.

The first thing you need to know about type hints is that they are not enforced at all by the Python bytecode compiler and runtime interpreter.

No runtime effect

Type annotations don’t have any impact on the runtime behavior of Python programs. Check this out:

Example 5-9. Python does not enforce type hints at runtime.

>>> import typing
>>> class Coordinate(typing.NamedTuple):
...     lat: float
...     long: float
...
>>> trash = Coordinate('foo', None)
>>> print(trash)
Coordinate(lat='Ni!', long=None)

: I told you: no type checking at runtime!

If you type the code of Example 5-9 in a Python module, replacing the last line with print(trash), it will happily run and display a meaningless Coordinate, with no error or warning:

$ python3 nocheck_demo.py
Coordinate(lat='Ni!', long=None)

The type hints are intended primarily to support third-party type checkers, like Mypy or the PyCharm IDE built-in type checker. These are static analysis tools: they check Python source code “at rest”, not running code.

To see the effect of type hints, you must run one of those tools on your code—like a linter. For instance, here is what Mypy has to say about the previous example:

$ mypy nocheck_demo.py
nocheck_demo.py:8: error: Argument 1 to "Coordinate" has
incompatible type "str"; expected "float"
nocheck_demo.py:8: error: Argument 2 to "Coordinate" has
incompatible type "None"; expected "float"

As you can see, given the definition of Coordinate, Mypy knows that both arguments to create an instance must be of type float, but the assignment to trash uses a str and None.⁵

Now let’s talk about the syntax and meaning of type hints.

Variable annotation Syntax

Both typing.NamedTuple and @dataclass use the syntax of variable annotations defined in PEP 526. This is quick introduction to that syntax in the context defining attributes in class statements.

The basic syntax of variable annotation is:

var_name: some_type

Acceptable type hints in PEP 484 explains what are acceptable types, but in the context of defining a concrete data class, these types are more useful:

a concrete class, for example str or FrenchDeck;
a parameterized collection type, like List[int], Tuple[str, float] etc.
typing.Optional, for example Optional[str];

Although possible in theory, it’s not very useful to define fields with abstractions such as:

special type constructs like Any, Union, Protocol etc. from the typing module;
an ABC (Abstract Base Class);

You can also initialize the variable with a value. In a typing.NamedTuple or @dataclass declaration, that value will become the default for that attribute, if the corresponding argument is ommitted in the constructor call.

var_name: some_type = a_value

The meaning of variable annotations

We saw in “No runtime effect” that type hints have no effect at runtime. But at import time—when a module is loaded—Python does read them to build the __annotations__ dictionary that typing.NamedTuple and @dataclass then use to enhance the class.

We’ll start this exploration with a simple class, so that we can later see what extra features are added by typing.NamedTuple and @dataclass.

Example 5-10. meaning/demo_plain.py: a plain class with type hints

class DemoPlainClass:

    a: int           
    b: float = 1.1   
    c = 'spam'

: a becomes an annotation, but is otherwise discarded.
: b is saved as an annotation, and also becomes a class attribute with value 1.1.
: c is just a plain old class attribute, not an annotation.

We can verify that in the console, first reading the __annotations__ of the DemoPlainClass, then trying to get its attributes named a, b, and c:

>>> from demo_plain import DemoPlainClass
>>> DemoPlainClass.__annotations__
{'a': <class 'int'>, 'b': <class 'float'>}
>>> DemoPlainClass.a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'DemoPlainClass' has no attribute 'a'
>>> DemoPlainClass.b
1.1
>>> DemoPlainClass.c
'spam'

Note that the __annotations__ special attribute is created by the interpreter to record the type hints that appear in the source code—even in a plain class.

The a survives only as an annotation. It doesn’t become a class attribute because no value is bound to it.⁶ The b and c are stored as class attributes because they are bound to values.

None of those three attributes will be in a new instance of DemoPlainClass. If you create an object o = DemoPlainClass(), o.a will raise AttributeError, while o.b and o.c will retrieve the class attributes with values 1.1 and 'spam'—that’s just normal Python object behavior.

Inspecting a `typing.NamedTuple`

Now let’s examine a class built with typing.NamedTuple, using the same attributes and annotations as DemoPlainClass from Example 5-10.

Example 5-11. meaning/demo_nt.py: a class built with `typing.NamedTuple`.

import typing

class DemoNTClass(typing.NamedTuple):

    a: int           
    b: float = 1.1   
    c = 'spam'

: a becomes an annotation and also an instance attribute.
: b is another annotation, and also becomes an instance attribute with default value 1.1.
: c is just a plain old class attribute; no annotation will refer to it.

Inspecting the DemoNTClass, we get:

>>> from demo_nt import DemoNTClass
>>> DemoNTClass.__annotations__
{'a': <class 'int'>, 'b': <class 'float'>}
>>> DemoNTClass.a
<_collections._tuplegetter object at 0x101f0f940>
>>> DemoNTClass.b
<_collections._tuplegetter object at 0x101f0f8b0>
>>> DemoNTClass.c
'spam'

Here we see the same annotations for a and b as we saw in Example 5-10. But DemoNTClass has three class attributes a, b, and c. The c attribute is just a plain class attribute with the value 'spam'.

The a and b class attributes are actually descriptors—an advanced feature covered in [Link to Come]. For now, think of them as similar to property getters: methods that don’t require the explicit call operator () to retrieve an instance attribute. In practice, this means a and b will work as read-only instance attributes—which makes sense when we recall that DemoNTClass instances are just a fancy tuples, and tuples are immutable.

DemoNTClass also gets a custom docstring:

>>> DemoNTClass.__doc__
'DemoNTClass(a, b)'

Let’s inspect an instance of DemoNTClass:

>>> nt = DemoNTClass(8)
>>> nt.a
8
>>> nt.b
1.1
>>> nt.c
'spam'

To construct nt, we need to give at least the a argument to DemoNTClass. The constructor also takes a b argument, but it has a default value of 1.1, so it’s optional. The nt object has the a and b attributes as expected; it doesn’t have a c attribute, but Python retrieves it from the class, as usual.

If you try to assign values to nt.a, nt.b, nt.c or even nt.z you’ll get AttributeError, with subtly different error messages. Try that and reflect on the messages.

Inspecting a class decorated with `dataclass`

Now we’ll examine Example 5-12:

Example 5-12. meaning/demo_dc.py: a class decorated with `@dataclass`

from dataclasses import dataclass

@dataclass
class DemoDataClass:

    a: int           
    b: float = 1.1   
    c = 'spam'

: a becomes an annotation and also an instance attribute.
: b is another annotation, and also becomes an instance attribute with default value 1.1.
: c is just a plain old class attribute; no annotation will refer to it.

Now let’s check out __annotations__, __doc__, and the a, b, c attributes on DemoDataClass:

>>> from demo_dc import DemoDataClass
>>> DemoDataClass.__annotations__
{'a': <class 'int'>, 'b': <class 'float'>}
>>> DemoDataClass.__doc__
'DemoDataClass(a: int, b: float = 1.1)'
>>> DemoDataClass.a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'DemoDataClass' has no attribute 'a'
>>> DemoDataClass.b
1.1
>>> DemoDataClass.c
'spam'

The __annotations__ and __doc__ are not surprising. However, there is no attribute named a in DemoDataClass—in contrast with DemoNTClass from Example 5-11, which has a descriptor to get a from the instances as read-only attributes (that mysterious <_collections._tuplegetter>). That’s because the a attribute will only exist in instances of DemoDataClass. It will be a public attribute that we can get and set, unless the class is frozen. But b and c exist as class attributes, with b holding the default value for the b instance attribute, while c is just a class attribute that will not be bound to the instances.

Now let’s see how a DemoDataClass instance looks like:

>>> dc = DemoDataClass(9)
>>> dc.a
9
>>> dc.b
1.1
>>> dc.c
'spam'

Again, a and b are instance attributes, and c is a class attribute we get via the instance.

As mentioned, DemoDataClass instances are mutable—and no type checking is done at runtime:

>>> dc.a = 10
>>> dc.b = 'oops'

We can do even sillier assignments:

>>> dc.c = 'whatever'
>>> dc.z = 'secret stash'

Now the dc instance has a c attribute—but that does not change the c class attribute. And we can add a new z attribute. This is normal Python behavior: regular instances can have their own attributes that don’t appear in the class.⁷

More about `@dataclass`

We’ve only seen simple examples of @dataclass use so far. The decorator accepts several arguments. This is its signature:

@dataclass(*, init=True, repr=True, eq=True, order=False,
              unsafe_hash=False, frozen=False)

The * in the first position means the remaining parameters are keyword-only. Table 5-2 describes them.

Table 5-2. Keyword parameters accepted by the `@dataclass` decorator
option	default	meaning	notes
init	True	generate `__init__`	Ignored if `__init__` is implemented by user.
repr	True	generate `__repr__`	Ignored if `__repr__` is implemented by user.
eq	True	generate `__eq__`	Ignored if `__eq__` is implemented by user.
order	False	generate `__lt__`, `__le__`, `__gt__`, `__ge__`	Raises exceptions if `eq=False` or any of the listed special methods are implemented by user.
unsafe_hash	False	generate `__hash__`	Complex semantics and several caveats—see: dataclass documentation.
frozen	False	make instances “immutable”	instances will be reasonably safe from accidental change, but not really immutable^a.
^a `@dataclass` emulates imutability by generating `__setattr__` and `__delattr__` which raise `dataclass.FrozenInstanceError`—a subclass of `AttributeError`—when the user attempts to set or delete a field.

The defaults are really the most useful settings for common use cases. The options you are more likely to change from the defaults are:

order=True: to allow sorting of instances of the data class;
frozen=True: to protect against accidental changes to the class instances.

Given the dynamic nature of Python objects, it’s not too hard for a nosy programmer to go around the protection afforded by frozen=True. But the necessary tricks should be easy to spot on a code review.

If eq and frozen are both true, @dataclass will produce a suitable __hash__ method, so the instances will be hashabke. The generated __hash__ will use data from all fields that are not individually excluded using a field option we’ll see in “Key-sharing dictionary”. If frozen=False (the default), @dataclass will set __hash__ to None, signalling that the instances are unhashable, therefore overriding __hash__ from any superclass.

Regarding unsafe_hash, PEP 557 has this to say:

Although not recommended, you can force Data Classes to create a __hash__ method with unsafe_hash=True. This might be the case if your class is logically immutable but can nonetheless be mutated. This is a specialized use case and should be considered carefully.

I will leave unsafe_hash at that. If you fell you must use that option, check the dataclasses.dataclass documentation.

Further customization of the generated data class can be done at a field level.

Field options

We’ve already seen most basic field option: providing or not a default value with the type hint. Note that fields are read in order, and after you declare a field with a default value, all remaining fields must also have default values. This limitation makes sense: the fields will become parameters in the generated __init__, and Python does not allow non-default parameters following parameters with defaults.

Mutable default values are a common source of bugs for beginning Python developers. In function definitions, a mutable default value is easily corrupted when one invocation of the function mutates the default, changing the behavior of further invocations—an issue we’ll explore in “Mutable Types as Parameter Defaults: Bad Idea” (Chapter 6). Class atributes are often used as default attribute values for instances, including in data classes. And @dataclass uses the default values in the type hints to generate parameters with defaults for __init__. To prevent bugs, @dataclass rejects the class definition in Example 5-13.

Example 5-13. `dataclass/club_wrong.py`: this class raises `ValueError`

@dataclass
class ClubMember:

    name: str
    guests: list = []

If you load the module with that ClubMember class, this is what you get:

$ python3 club_wrong.py
Traceback (most recent call last):
  File "club_wrong.py", line 4, in <module>
    class ClubMember:
  ...several lines ommitted...
ValueError: mutable default <class 'list'> for field guests is not allowed:
use default_factory

The ValueError message explains the problem and suggests a solution: use default_factory. This is how to correct ClubMember:

Example 5-14. `dataclass/club.py`: this `ClubMember` definition works.

from dataclasses import dataclass, field


@dataclass
class ClubMember:

    name: str
    guests: list = field(default_factory=list)

In the guests field of Example 5-14, instead of a literal list, the default value is set by calling the dataclasses.field function with default_factory=list. The default_factory parameter lets you provide a function, class, or any other callable, which will be invoked with zero arguments to build a default value each time an instance of the data class is created. This way, each instance of ClubMember will have its own list—instead of all instances sharing the same list from the class, which is rarely what we want and is often a bug.

Warning

It’s good that @dataclass rejects class definitions with a list default value in a field. However, be aware that it is a partial solution that only applies to list, dict and set. Other mutable values used as defaults will not be flagged by @dataclass. It’s up to you to understand the problem and remember to use a default factory to set mutable default values.

If you browse the dataclasses module documentation, you’ll see a list field defined with a novel syntax, as in Example 5-15.

Example 5-15. `dataclass/club_generic.py`: this `ClubMember` definition is more precise

from dataclasses import dataclass, field
from typing import List                              

@dataclass
class ClubMember:

    name: str
    guests: List[str] = field(default_factory=list)

: Import the List type from typing.
: List[str] means “a list of str”.

The new syntax List[str] is a generic type definition: the List class from typing accepts that bracket notation to specify the type of the list items. We’ll cover generics in [Link to Come]. For now, note that both Example 5-14 and Example 5-15 are correct, and the Mypy type checker does not complain about either of those class definitions. But the second one is more precise, and will allow the type checker to verify code that puts items in the list, or that read items from it.

The default_factory is by far the most frequently used option of the field function, but there are several others, listed in Table 5-3.

Table 5-3. Keyword arguments accepted by the `field` function
option	default	meaning
default	_MISSING_TYPE	default value for field^a
default_factory	_MISSING_TYPE	0-parameter function used to produce a default
init	True	include field in parameters to `__init__`
repr	True	include field `__repr__`
hash	None	use field to compute __hash__`footnote:[`hash=None means the field will be used in `__hash__` only if `compare=True`.]
compare	True	use field in comparison methods `__eq__`, `__gt__` etc.
metadata	None	mapping with user-defined data; ignored by the `@dataclass`
^a `dataclass._MISSING_TYPE` is a sentinel value indicating the option was not provided. It exists so we can set `None` as an actual default value, a common use case.

The default option exists because the field call takes the place of the default value in the field annotation. If you want to create an athlete field with default value of False, and also ommit that field from the __repr__ method, you’d write this:

@dataclass
class ClubMember:

    name: str
    guests: list = field(default_factory=list)
    athlete: bool = field(default=False, repr=False)

Post-init processing

The __init__ method generated by @dataclass only takes the arguments passed and assigns them—or their default values, if missing—to the instance attributes that are instance fields. But you may need to do more than that to initialize the instance. If that’s the case, you can provide a __post_init__ method. When that method exists, @dataclass will add code to the generated __init__ to call __post_init__ as the last step.

Common use cases for __post_init__ are validation and computing field values based on other fields. We’ll study a simple example that uses __post_init__ for both of these reasons.

First, let’s look at the expected behavior of a ClubMember subclass named HackerClubMember, as described by doctests in Example 5-16.

Example 5-16. `dataclass/hackerclub.py`: doctests for `HackerClubMember`

"""
``HackerClubMember`` objects accept an optional ``handle`` argument::

    >>> anna = HackerClubMember('Anna Ravenscroft', handle='AnnaRaven')
    >>> anna
    HackerClubMember(name='Anna Ravenscroft', guests=[], handle='AnnaRaven')

If ``handle`` is ommitted, it's set to the first part of the member's name::

    >>> leo = HackerClubMember('Leo Rochael')
    >>> leo
    HackerClubMember(name='Leo Rochael', guests=[], handle='Leo')

Members must have a unique handle. The following ``leo2`` will not be created,
because its ``handle`` would be 'Leo', which was taken by ``leo``::

    >>> leo2 = HackerClubMember('Leo DaVinci')
    Traceback (most recent call last):
      ...
    ValueError: handle 'Leo' already exists.

To fix, ``leo2`` must be created with an explicit ``handle``::

    >>> leo2 = HackerClubMember('Leo DaVinci', handle='Neo')
    >>> leo2
    HackerClubMember(name='Leo DaVinci', guests=[], handle='Neo')
"""

Note that we must provide handle as a keyword argument, because HackerClubMember inherits name and guests from ClubMember, and adds the handle field. The generated docstring for HackerClubMember shows the order of the fields in the constructor call:

>>> HackerClubMember.__doc__
"HackerClubMember(name: str, guests: list = <factory>, handle: str = '')"

Here, <factory> is a short way of saying that some callable will produce the default value for guests (in our case, the factory is the list class). The point is: to provide a handle but no guests, we must pass handle as a keyword argument.

The Inheritance section of the dataclasses module documentation explains how the order of the fields is computed when there are several levels of inheritance.

Note

In [Link to Come] we’ll talk about misusing inheritance, particularly when the superclasses are not abstract. Creating a hierarchy of data classes is usually a bad idea, but it served us well here to make Example 5-17 shorter, focusing on the handle field declaration and __post_init__ validation.

Example 5-17 is the implementation:

Example 5-17. `dataclass/hackerclub.py`: code for `HackerClubMember`.

from dataclasses import dataclass
from club import ClubMember

@dataclass
class HackerClubMember(ClubMember):                         

    all_handles = set()                                     

    handle: str = ''                                        

    def __post_init__(self):
        cls = self.__class__                                
        if self.handle == '':                               
            self.handle = self.name.split()[0]
        if self.handle in cls.all_handles:                  
            msg = f'handle {self.handle!r} already exists.'
            raise ValueError(msg)
        cls.all_handles.add(self.handle)

: HackerClubMember extends ClubMember.
: all_handles is a class attribute.
: handle is an instance field of type str with empty string as its default value; this makes it optional.
: Get the class of the instance.
: If self.handle is the empty string, set it to the first part of name.
: If self.handle is in cls.all_handles, raise ValueError.
: Add the new handle to cls.all_handles.

Example 5-17 works as intended, but is not satisfactory to a static type checker. Next, we’ll see why, and how to fix it.

Typed class attributes

If we typecheck Example 5-17 with Mypy, we are reprimanded:

$ mypy hackerclub.py
hackerclub.py:38: error: Need type annotation for 'all_handles'
(hint: "all_handles: Set[<type>] = ...")
Found 1 error in 1 file (checked 1 source file)

Unfortunately, the hint provided by Mypy (version 0.750 as I write this) is not helpful in the context of @dataclass usage. If we add a type hint like Set[…] to all_handles, @dataclass will find that annotation and make all_handles an instance field. We saw this happening in “Inspecting a class decorated with dataclass”.

The work-around defined in PEP 526—Syntax for Variable Annotations is a class variable annotation, written with a pseudo-type named typing.ClassVar, which leverages the generics [] notation to set the type of the variable and also declare it a class attribute.

To make the type checker happy, this is how we are supposed to declare all_handles in Example 5-17:

    all_handles: ClassVar[Set[str]] = set()

That type hint is saying:

all_handles is a class attribute of type set-of-str, with an empty set as its default value.

To code that annotation, we must import ClassVar and Set from the typing module.

The @dataclass decorator doesn’t care about the types in the annotations, except in two cases, and this is one of them: if the type is ClassVar, an instance field will not be generated for that attribute.

The other case where the type of the field is relevant to @dataclass is when declaring init-only variables, our next topic.

Initialization variables that are not fields

Sometimes you may neet to pass arguments to __init__ that are not instance fields. Such arguments are called init-only variables by the dataclasses documentation. To declare an argument like that, dataclasses module provides the pseudo-type InitVar, which uses the same syntax of typing.ClassVar. The example given in the documentation is a data class that has a field initialized from a database, and the database object must be passed to the constructor.

This is the code that illustrates the Init-only variables section:

Example 5-18. Example from the `dataclasses` module documentation.

@dataclass
class C:
    i: int
    j: int = None
    database: InitVar[DatabaseType] = None

    def __post_init__(self, database):
        if self.j is None and database is not None:
            self.j = database.lookup('j')

c = C(10, database=my_database)

Note how the database attribute is declared. InitVar will prevent @dataclass from treating database as a regular field. It will not be set as an instance attribute, and the dataclasses.fields function will not list it. However, database will be one of the arguments that the generated __init__ will accept, and it will be also passed to __post_init__—if you write that method, you must add a corresponding argument to the method signature, as shown in Example 5-18

This rather long overview of @dataclass covered the most useful features—some of them appeared in previous sections, like “Main features” where we covered all three data class builders in parallel. The dataclasses documentation and PEP 526 — Syntax for Variable Annotations have all details.

@dataclass Example: Dublin Core Resource Record

Often, classes built with @dataclass will have more fields than the very short examples presented so far. Dublin Core provides the foundation for a more typical @dataclass example.

The Dublin Core Schema is a small set of vocabulary terms that can be used to describe digital resources (video, images, web pages, etc.), as well as physical resources such as books or CDs, and objects like artworks.

Dublin Core on Wikipedia

The standard defines 15 optional fields, the Resource class in Example 5-19 uses 8 of them.

Example 5-19. `dataclass/resource.py`: code for `Resource`, a class based on Dublin Core terms.

from dataclasses import dataclass, field
from typing import List, Optional
from enum import Enum, auto
from datetime import date


class ResourceType(Enum):                              
    BOOK = auto()
    EBOOK = auto()
    VIDEO = auto()


@dataclass
class Resource:
    """Media resource description."""
    identifier: str                                    
    title: str = '<untitled>'                          
    creators: List[str] = field(default_factory=list)
    date: Optional[date] = None                        
    type: ResourceType = ResourceType.BOOK             
    description: str = ''
    language: str = ''
    subjects: List[str] = field(default_factory=list)

: This Enum will provide type-safe values for the Resource.type field.
: identifier is the only required field.
: title is the first field with a default. This forces all fields below to provide defaults.
: The value of date can be a datetime.date instance, or None.
: The type field default is ResourceType.BOOK.

Example 5-20 is a doctest to demonstrate how a Resource record appears in code:

Example 5-20. `dataclass/resource.py`: code for `Resource`, a class based on Dublin Core terms.

    >>> description = 'Improving the design of existing code'
    >>> book = Resource('978-0-13-475759-9', 'Refactoring, 2nd Edition',
    ...     ['Martin Fowler', 'Kent Beck'], date(2018, 11, 19),
    ...     ResourceType.BOOK, description, 'EN',
    ...     ['computer programming', 'OOP'])
    >>> book  # doctest: +NORMALIZE_WHITESPACE
    Resource(identifier='978-0-13-475759-9', title='Refactoring, 2nd Edition',
    creators=['Martin Fowler', 'Kent Beck'], date=datetime.date(2018, 11, 19),
    type=<ResourceType.BOOK: 1>, description='Improving the design of existing code',
    language='EN', subjects=['computer programming', 'OOP'])

The __repr__ generated by @dataclass is OK, but we can do better. This is the format we want from repr(book):

    >>> book  # doctest: +NORMALIZE_WHITESPACE
    Resource(
        identifier = '978-0-13-475759-9',
        title = 'Refactoring, 2nd Edition',
        creators = ['Martin Fowler', 'Kent Beck'],
        date = datetime.date(2018, 11, 19),
        type = <ResourceType.BOOK: 1>,
        description = 'Improving the design of existing code',
        language = 'EN',
        subjects = ['computer programming', 'OOP'],
    )

Example 5-21 is the code of __repr__ to produce the format above. This example uses doctest.fields to get the names of the data class fields.

Example 5-21. `dataclass/resource_repr.py`: code for `repr` method implemented in the `Resource` class from Example 5-19.

    def __repr__(self):
        cls = self.__class__
        cls_name = cls.__name__
        indent = ' ' * 4
        res = [f'{cls_name}(']                            
        for f in fields(cls):                             
            value = getattr(self, f.name)                 
            res.append(f'{indent}{f.name} = {value!r},')  

        res.append(')')                                   
        return '
'.join(res)

: Start the res list to build the output string with the class name and open parenthesis.
: For each field f in the class…
: Get the named attribute from the instance.
: Append an indented line with the name of the field and repr(value)—that’s what the !r does.
: Append closing parenthesis.
: Build multiline string from res and return it.

With this example inspired by the soul of Dublin, Ohio, we conclude our tour of Python’s data class builders.

Data classes are handy, but your project may suffer if you overuse them. The next section explains.

Data class as a code smell

Whether you implement a data class writing all the code yourself or leveraging one of the class builders described in this chapter, be aware that it may signal a problem in your design.

In Refactoring, Second Edition, Martin Fowler and Kent Beck present a catalog of “code smells”—patterns in code that may indicate the need for refactoring. The entry titled Data Class starts like this:

These are classes that have fields, getting and setting methods for fields, and nothing else. Such classes are dumb data holders and are often being manipulated in far too much detail by other classes.

In Fowler’s personal Web site there’s an illuminating post explaining what is a “code smell”. That post is very relevant to our discussion because he uses data class as one example of a code smell and suggests how to deal with it. Here is the post, reproduced in full.⁸

Code smell

By Martin Fowler

A code smell is a surface indication that usually corresponds to a deeper problem in the system. The term was first coined by Kent Beck while helping me with my Refactoring book.

The quick definition above contains a couple of subtle points. Firstly a smell is by definition something that’s quick to spot—or sniffable as I’ve recently put it. A long method is a good example of this—just looking at the code and my nose twitches if I see more than a dozen lines of Java.

The second is that smells don’t always indicate a problem. Some long methods are just fine. You have to look deeper to see if there is an underlying problem there—smells aren’t inherently bad on their own—they are often an indicator of a problem rather than the problem themselves.

The best smells are something that’s easy to spot and most of time lead you to really interesting problems. Data classes (classes with all data and no behavior) are good examples of this. You look at them and ask yourself what behavior should be in this class. Then you start refactoring to move that behavior in there. Often simple questions and initial refactorings can be the vital step in turning anemic objects into something that really has class.

One of the nice things about smells is that it’s easy for inexperienced people to spot them, even if they don’t know enough to evaluate if there’s a real problem or to correct them. I’ve heard of lead developers who will pick a “smell of the week” and ask people to look for the smell and bring it up with the senior members of the team. Doing it one smell at a time is a good way of gradually teaching people on the team to be better programmers.

The main idea of Object Oriented Programming is to place behavior and data together in the same code unit: a class. If a class is widely used but has no significant behavior of its own, it’s possible that code dealing with its instances is scattered (and even duplicated) in methods and functions throughout the system—a recipe for maintenance headaches. That’s why Fowler’s refactorings to deal with a data class involve bringing responsibilities back into it.

Taking that into account, there are a couple of common scenarios where it makes sense to have a data class with little or no behavior.

Data class as scaffolding

In this scenario, the data class is an initial, simplistic implementation of a class to jump start a new project or module. With time, the class should get its own methods, instead of relying on methods of other classes to operate on its instances. Scaffolding is temporary; eventually your custom class may become fully independent from the builder you used to start it.

Python is also used for quick problem solving and experimentation, and then it’s OK leave the scaffolding in place.

Data class as intermediate representation

A data class can be useful to build records about to be exported to JSON or some other interchange format, or to hold data that was just imported, crossing some system boundary. Python’s data class builders all provide a method or function to convert an instance to a plain dict, and you can always invoke the constructor with a dict used as keyword arguments expanded with **. Such a dict is very close to a JSON record.

In this scenario, the data class instances should be handled as immutable objects—even if the fields are mutable, you should not change them while they are in this intermediate form. If you do, you’re losing the key benefit of having data and behavior close together. When importing/exporting requires changing values, you should implement your own builder methods instead of using the given “as dict” methods or standard constructors.

After reviewing Python’s data class builders, we’ll end the chapter with the struct module, also used for importing/exporting records, but at a much lower level.

Parsing binary records with struct

The struct module provides functions to parse fields of bytes into a tuple of Python objects, and to perform the opposite conversion, from a tuple into packed bytes. struct can be used used with bytes, bytearray, and memoryview objects.

Suppose you need to read a binary file containing data about metropolitan areas, produced by a program in C with a record defined as Example 5-22

Example 5-22. MetroArea: a struct in the C language.

struct MetroArea {
    int year;
    char name[12];
    char country[2];
    float population;
};

Here is how to read one record in that format, using struct.unpack:

Example 5-23. Reading a C struct in the Python console.

>>> from struct import unpack
>>> FORMAT = 'i12s2sf'
>>> data = open('metro_areas.bin', 'rb').read(24)
>>> data
b"xe2x07x00x00Tokyox00xc5x05x01x00x00x00JPx00x00x11X'L"
>>> unpack(FORMAT, data)
(2018, b'Tokyox00xc5x05x01x00x00x00', b'JP', 43868228.0)

Note how unpack returns a tuple with four fields, as specified by the FORMAT string. The letters and numbers in FORMAT are Format Characters described in the struct module documentation.

Table 5-4 explains the elements of the format string from Example 5-23.

Table 5-4. Parts of the format string `'i12s2sf'`.
part	size	C type	Python type	limits to actual content
`i`	4 bytes	`int`	`int`	32 bits; range -2,147,483,648 to 2,147,483,647
`12s`	12 bytes	`char[12]`	`bytes`	length = 12
`2s`	2 bytes	`char[2]`	`bytes`	length = 2
`f`	4 bytes	`float`	`float`	32-bits; approximante range ± 3.4×10³⁸

One detail about the layout of metro_areas.bin is not clear from the code in Example 5-22: size is not the only difference bettween the name and country fields. The country field always holds a 2-letter country code, but name is a null-terminated sequence with up to 12 bytes including the terminating b''—which you can see in Example 5-23 right after the word Tokyo.⁹

Now let’s review a script to extract all records from metro_areas.bin and produce a simple report like this:

$ python3 metro_read.py
2018    Tokyo, JP       43,868,228
2015    Shanghai, CN    38,700,000
2015    Jakarta, ID     31,689,592

Example 5-24 showcases the handy struct.iter_unpack function.

Example 5-24. metro_read.py: list all records from `metro_areas.bin`

from struct import iter_unpack

FORMAT = 'i12s2sf'                             

def text(field: bytes) -> str:                 
    octets = field.split(b'', 1)[0]          
    return octets.decode('cp437')              

with open('metro_areas.bin', 'rb') as fp:      
    data = fp.read()

for fields in iter_unpack(FORMAT, data):       
    year, name, country, pop = fields
    place = text(name) + ', ' + text(country)  
    print(f'{year}	{place}	{pop:,.0f}')

: The struct format.
: Utility function to decode and clean up the bytes fields; returns a str.¹⁰
: Handle null-terminated C string: split once on b'', then take the first part.
: Decode bytes into str.
: Open and read the whole file in binary mode; data is a bytes object.
: iter_unpack(…) returns a generator that produces one tuple of fields for each sequence of bytes matching the format string.
: The name and country fields need further processing by the text function.

The struct module provides no way to specify null-terminated string fields. When processing a field like name in the example above, after unpacking we need to inspect the returned bytes to discard the first b'' and all bytes after it in that field. It is quite possible that bytes after the first b'' and up to the end of the field are garbage. You can actually see that in Example 5-23.

Memory views can make it easier to experiment and debug programs using struct, as the next section explains.

Structs and Memory Views

We saw in “Memory Views” that the memoryview type does not let you create or store byte sequences, but provides shared memory access to slices of data from other binary sequences, packed arrays, and buffers such as Python Imaging Library (PIL) images,¹¹ without copying the bytes.

Example 5-25 shows the use of memoryview and struct together to extract the width and height of a GIF image.

Example 5-25. Using memoryview and struct to inspect a GIF image header

>>> import struct
>>> fmt = '<3s3sHH'  
>>> with open('filter.gif', 'rb') as fp:
...     img = memoryview(fp.read())  
...
>>> header = img[:10]  
>>> bytes(header)  
b'GIF89a+x02xe6x00'
>>> struct.unpack(fmt, header)  
(b'GIF', b'89a', 555, 230)
>>> del header  
>>> del img

: struct format: < little-endian; 3s3s two sequences of 3 bytes; HH two 16-bit integers.
: Create memoryview from file contents in memory…
: …then another memoryview by slicing the first one; no bytes are copied here.
: Convert to bytes for display only; 10 bytes are copied here.
: Unpack memoryview into tuple of: type, version, width, and height.
: Delete references to release the memory associated with the memoryview instances.

Note that slicing a memoryview returns a new memoryview, without copying bytes.¹²

I will not go deeper into memoryview or the struct module, but if you work with binary data, you’ll find it worthwhile to study their docs: Built-in Types » Memory Views and struct — Interpret bytes as packed binary data.

Should we use `struct`?

I only recommend struct to handle data from legacy systems or standardized binary formats.

Proprietary binary records in the real world are brittle and can be corrupted easily. Our super simple example exposed one of many caveats: a string field may be limited only by its size in bytes, it may be padded by spaces, or it may contain a null-terminated string followed by random garbage up to a certain size. There is also the problem of endianness: the order of the bytes used to represent integers and floats, which depends on the CPU architecture.

If you need to exchange binary data among Python systems, the pickle module is the easiest way, by far. If the exchange involves programs in other languages, use JSON or a multi-platform binary serialization format like MessagePack or Protocol Buffers.

Chapter Summary

The main topic of this chapter were the data class builders collections.namedtuple, typing.NamedTuple and dataclasses.dataclass. We saw that each of them generate data classes from descriptions provided as arguments to a factory function or from class statements with type hints—in the case of the latter two. In particular, both named tuple variants produce tuple subclasses, adding only the ability to access fields by name, and providing a _fields class attribute listing the field names as a tuple of strings.

Next we studied the main features of the three class builders side by side, including how to extract instance data as a dict, how to get the names and default values of fiels, and how to make a new instance from an existing one.

This prompted our first look into type hints, particularly those used to annotate attributes in a class statement, using the notation introduced in Python 3.6 with PEP 526—Syntax for Variable Annotations. Probably the most surprising aspect of type hints in general is the fact that they have no effect at all at runtime. Python remains a dynamic language. External tools, like Mypy, are needed to take advantage of typing information do detect errors via static analysis of the source code. After a basic overview of the syntax from PEP 526, we studied the effect of annotations in a plain class and in classes built by typing.NamedTuple and @dataclass.

Next we covered the most commonly used features provided by @dataclass and the default_factory option of the dataclasses.field function. We also looked into the special pseudo-type hints typing.ClassVar and dataclasses.InitVar that are important in the context of data classes. This main topic concluded with an example based on the Dublin Core Schema, which illustrated how to use dataclassess.fields to iterate over the attributes of a Resource instance in a custom __repr__.

The “Data class as a code smell” came after that, warning against possible abuse of data classes defeating a basic principle of Object Oriented Programming: data and the functions that touch it should be together in the same class. Classes with no logic may be a sign of misplaced logic.

The final topic was a brief overview of the struct package, used to read and write records encoded as packed byte sequences. That section ended with suggestion for alternative libraries for binary data exchange, including Python’s pickle and language-independend alternatives.

Table of Contents for 5. Record-like data structures

Create new playlist

Sign In

Sign Up

Chapter 5. Record-like data structures

Note

What’s new in this chapter

Overview of data class builders

Example 5-1. class/coordinates.py

Note

Tip

Example 5-2. typing_namedtuple/coordinates.py

Warning

Example 5-3. dataclass/coordinates.py

Main features

Mutable instances

Class statement syntax

Construct dict

Get field names and default values

Get field types

New instance with changes

New class at runtime

Classic Named Tuples

Tip

Example 5-4. Defining and using a named tuple type

Example 5-5. Named tuple attributes and methods (continued from the previous example)

Warning

Example 5-6. Named tuple attributes and methods (continued from the previous example)

Typed Named Tuples

Example 5-8. typing_namedtuple/coordinates2.py

Warning

Type hints 101

Note

No runtime effect

Example 5-9. Python does not enforce type hints at runtime.

Variable annotation Syntax

The meaning of variable annotations

Example 5-10. meaning/demo_plain.py: a plain class with type hints

Inspecting a typing.NamedTuple

Example 5-11. meaning/demo_nt.py: a class built with typing.NamedTuple.

Inspecting a class decorated with dataclass

Example 5-12. meaning/demo_dc.py: a class decorated with @dataclass

More about @dataclass

Field options

Example 5-13. dataclass/club_wrong.py: this class raises ValueError

Example 5-14. dataclass/club.py: this ClubMember definition works.

Warning

Example 5-15. dataclass/club_generic.py: this ClubMember definition is more precise

Post-init processing

Example 5-16. dataclass/hackerclub.py: doctests for HackerClubMember

Note

Example 5-17. dataclass/hackerclub.py: code for HackerClubMember.

Typed class attributes

Initialization variables that are not fields

Example 5-18. Example from the dataclasses module documentation.

@dataclass Example: Dublin Core Resource Record

Example 5-19. dataclass/resource.py: code for Resource, a class based on Dublin Core terms.

Example 5-20. dataclass/resource.py: code for Resource, a class based on Dublin Core terms.

Example 5-21. dataclass/resource_repr.py: code for __repr__ method implemented in the Resource class from Example 5-19.

Data class as a code smell

Data class as scaffolding

Data class as intermediate representation

Parsing binary records with struct

Example 5-22. MetroArea: a struct in the C language.

Example 5-23. Reading a C struct in the Python console.

Example 5-24. metro_read.py: list all records from metro_areas.bin

Structs and Memory Views

Example 5-25. Using memoryview and struct to inspect a GIF image header

Should we use struct?

Chapter Summary

Further Reading

Table of Contents for
5. Record-like data structures

Example 5-1. `class/coordinates.py`

Example 5-2. `typing_namedtuple/coordinates.py`

Example 5-3. `dataclass/coordinates.py`

Example 5-8. `typing_namedtuple/coordinates2.py`

Inspecting a `typing.NamedTuple`

Example 5-11. meaning/demo_nt.py: a class built with `typing.NamedTuple`.

Inspecting a class decorated with `dataclass`

Example 5-12. meaning/demo_dc.py: a class decorated with `@dataclass`

More about `@dataclass`

Example 5-13. `dataclass/club_wrong.py`: this class raises `ValueError`

Example 5-14. `dataclass/club.py`: this `ClubMember` definition works.

Example 5-15. `dataclass/club_generic.py`: this `ClubMember` definition is more precise

Example 5-16. `dataclass/hackerclub.py`: doctests for `HackerClubMember`

Example 5-17. `dataclass/hackerclub.py`: code for `HackerClubMember`.

Example 5-18. Example from the `dataclasses` module documentation.

Example 5-19. `dataclass/resource.py`: code for `Resource`, a class based on Dublin Core terms.

Example 5-20. `dataclass/resource.py`: code for `Resource`, a class based on Dublin Core terms.

Example 5-21. `dataclass/resource_repr.py`: code for `repr` method implemented in the `Resource` class from Example 5-19.

Example 5-24. metro_read.py: list all records from `metro_areas.bin`

Should we use `struct`?