© Jacob Zimmerman 2018
Jacob ZimmermanPython Descriptorshttps://doi.org/10.1007/978-1-4842-3727-4_7

7. Storing the Attributes

Jacob Zimmerman1 
(1)
New York, USA
 

Now that all the preliminaries are out of the way, it is time to see the part of descriptors that is useful: storing the attributes that the descriptor represents. There are a lot of ways to store attributes with a descriptor, and this chapter will go over every option that I’m aware of, starting with the easiest.

Class-Level Storage

Class-level storage is easy; it’s normal storage on the descriptor. As an example, here is a descriptor that creates a basic class-level variable:
class ClassAttr:
    def __init__(self, value):
        self.value = value
    def __get__(self, instance, owner):
        return self.value
    def __set__(self, instance, value):
        self.value = value

This descriptor saves a value on itself as a typical instance attribute, which is simply returned in the __get__() method , ignoring whether instance is provided or not, since it’s a class-level attribute. This attribute can also be accessed through an instance, but making any change to it from the instance will apply the change to every instance of the class. Unfortunately, due to __set__() not being called when a descriptor is accessed from the class level, the variable storing the descriptor will be reassigned to the new value, rather than it being passed to __set__().

For more details about making class-level descriptors that __set__() and __delete__() can be used on, check out the section at the end of this chapter about metadescriptors.

Descriptors aren’t just for class-level attributes, though; they’re used for instance-level attributes too. There are two broad strategies for storing instance-level attributes with descriptors:
  • On the descriptor

  • In the instance dictionary

Each strategy has some hurdles to clear for a reusable descriptor. When storing it on the descriptor, there are hurdles as to how to store it without memory leaks or hashing issues. As for storing the attributes on the instance dictionary, the difficulty comes from trying to figure out what name to store it under in the dictionary to avoid clashing.

Storing Data on the Descriptor

As shown before, saving a simple value on the descriptor is how a class-level value is stored. What must be done to store a value on a per-instance basis in one place? What is needed is some way to map an instance to its attribute value. Well, another name for a mapping is a dictionary. Maybe a dictionary would work. Here’s what using a dictionary for its storage might look like.
class Descriptor:
    def __init__(self):
        self.storage = {}
    def __get__(self, instance, owner):
        return self.storage[instance]
    def __set__(self, instance, value):
        self.storage[instance] = value
    def __delete__(self, instance):
        del self.storage[instance]

The __get__() method doesn’t deal with the if instance is None case, and in all other examples, it will be ignored for the sake of brevity and removing distractions while reading the code.

The dict in the code example has solved our first issue of storage per instance. Unfortunately, there are a couple shortcomings to using a plain old dict for the job.

The first shortcoming to address is memory leaks. A typical dict will store the instance used as the key long after the object should have been otherwise garbage collected from lack of use. This is fine for short-lived programs that won’t use a lot of memory and if the instances don’t suffer from the second shortcoming mentioned later, but if this isn’t the case, we need a way to deal with the issue.

Let’s look at how to get around this problem. The descriptor needs a way to stop caring about instances that are no longer in use. The weakref module provides just that. Weak references allow variables to reference an instance as long as there is a normal reference to it somewhere, but allow it to be garbage collected otherwise. They also allow you to specify behavior that will run as soon as the reference is removed.

The module also provides a few collections that are designed to remove items from themselves as the items are garbage collected. Of those, we want to look at a WeakKeyDictionary . A WeakKeyDictionary keeps a weak reference to its key, and therefore once the instance that is used as the key is no longer in use, the dictionary cleans the entire entry out.

So, here’s the example again, this time using the WeakKeyDictionary.
from weakref import WeakKeyDictionary
class Descriptor:
    def __init__(self):
        self.storage = WeakKeyDictionary()
    def __get__(self, instance, owner):
        return self.storage[instance]
    def __set__(self, instance, value):
        self.storage[instance] = value
    def __delete__(self, instance):
        del self.storage[instance]

Every change between the previous example and this one has been made bold, and this shows that there really isn’t much of a difference. The only difference is that the special dictionary needs to be imported and a WeakKeyDictionary needs to be created instead of the normal dict. This is a very easy upgrade to make, and many descriptor guides stop here. It works in most situations, so it isn’t a bad solution.

Unfortunately, it still suffers from the other shortcoming that a regular dict does: it doesn’t support unhashable types.

To use an object as a key in a dict, it must be hashable. There are a few built-in types that cannot be hashed, namely the mutable collections (list, set, and dict), and maybe a few more. Any object that is mutable (values inside can be changed) and overrides __eq__() to compare internal values must be unhashable. If the object is changed in a way that changes equality, suddenly the hash code changes so that it can’t be looked up as a dictionary key. Thus, such mutable objects are generally advised to mark themselves as unhashable using __hash__ = None. Overriding __eq__() will do this automatically; overriding __hash__ should therefore be done only if equality is constant.

If it weren’t for Python providing default implementations of __eq__() and __hash__() (equality is the same as identity—an object is equal to itself, and nothing else), most objects wouldn’t be hashable and thus supported for descriptors using a hashing collection. Luckily, this means that types are hashable by default, but there are still many unhashable types out there.

Again, the WeakKeyDictionary is not a bad solution; it just doesn’t cover all possibilities. Much of the time, it is good enough, but it generally advised not to use it for public libraries, at least not without good warnings in the documentation. After all, the descriptor protocol provides ways to set and delete attributes, so they should support instances of mutable classes.

There needs to be a solution that doesn’t suffer from this problem, and there is. The simplest solution is to use the instance’s ID as the key instead of the instance itself. Hooray! Now the dictionary doesn’t hold onto unused instances anymore, and it doesn’t require the classes to be hashable.

Here’s what that solution would look like.
class Descriptor:
    def __init__(self):
        self.storage = {}
    def __get__(self, instance, owner):
        return self.storage[id(instance)]
    def __set__(self, instance, value):
        self.storage[id(instance)] = value
    def __delete__(self, instance):
        del self.storage[id(instance)]

The example switches back to a normal dict, so the changes mentioned are based on the differences between this example and the first one again, rather than comparing to the previous one. Every time the storage is being accessed, it’s being accessed by id(instance) instead of just instance.

This seems like a pretty good solution, since it doesn’t suffer from either of the problems of the previous two solutions. But it’s not a good solution. It doesn’t suffer from exactly the same problems of the previous solutions, but it still suffers from a memory leak. Yes, the dictionary no longer stores the instances, so those aren’t being kept, but there’s no mechanism to clear useless IDs from the dictionary. In fact, there’s a chance (it’s a tiny chance, but it exists) that a new instance of the class may be created with the same ID of an older, deleted instance, so the new instance has an attribute equal to the old one until it’s changed. That’s assuming it can be changed; what if the descriptor is designed to be read-only (more on that later)? Then the new instance is absolutely stuck with the old value.

So, this still doesn’t solve the on-descriptor storage problem, but it’s leading in the right direction. What is needed is a storage system that works like a dictionary, with instance as the key, but uses id(instance) instead of hash(instance) for storage. It also needs to clean itself out if an instance is no longer in use.

Since such a thing isn’t built in; it will have to be custom-made. Here is that custom dictionary, designed specifically for this book.
import weakref
class DescriptorStorage:
    def __init__(self, **kwargs):
        self.storage = {}
        for k, v in kwargs.items():
            self.__setitem__(k, v)
    def __getitem__(self, item):
        return self.storage[id(item)]
    def __setitem__(self, key, value):
        self.storage[id(key)] = value
        weakref.finalize(key, self.storage.__delitem__, id(key))
    def __delitem__(self, key):
        del self.storage[id(key)]

The real version obviously has more methods, such as __iter__, __len__, etc., but the main three uses for storage with a descriptor are implemented here. The rest of the implementation can be found in the descriptor-tools library.

This class is surprisingly simple. The basics of it is that there is a facade class that acts like a dictionary, delegating most functionality to an inner dictionary, but transforming the given keys to their IDs. The only real difference is that, in __setitem__() , this new class creates a finalize weak reference, which takes a reference, a function, and any arguments to send to that function when the reference is garbage collected. In this case, it removes the item (again, stored using id()) from the internal dictionary.

The keys to how this storage class works are using an ID as the key (which means the instances do not need to be hashable) and weak reference callbacks (which remove unused objects from the dictionary). In essence, this class is a WeakKeyDictionary that internally uses the ID of the given key as the actual key.

Storing the attribute in the descriptor safely takes a lot more consideration than most people ever actually put into it, but now there is a nice, catch-all solution for doing that. The first two solutions are imperfect, but not useless. If the use case for the descriptor allows for the use of either of those solutions, it wouldn’t hurt to consider them. They are viable enough for many cases and are likely to be slightly more performant than the custom storage system provided here. For public libraries, though, either the custom dictionary or a on-instance solution from the following section should be considered.

Storing on the Instance Dictionary

It’s often better to store the data on the instance instead of within the descriptor, provided that a worthwhile strategy for deriving the key can be found. This is because it doesn’t require an additional object for storage; the instance’s built-in storage dictionary is used. However, some classes will define __slots__ , and, as such, will not have the storage dictionary to mess with. This limits the usefulness of on-instance strategies a little bit, but __slots__ is used rarely enough that it’s barely worth considering.

If you want to make a descriptor safe with __slots__ while still defaulting to using the instance dictionary, you may want to create some sort of alternative that uses on-descriptor storage when a Boolean flag is set on creation. There are plenty of ways to implement that, whether using a factory that chooses a different descriptor if the flag is set or the class within has alternate paths based on the flag value. Another, simpler alternative is to document the name that the descriptor stores its values under so that users of the descriptor who want to use __slots__ can prepare a slot for it. This requires that the descriptor does direct instance attribute setting (either with dot notation or with getattr(), setattr(), and delattr()) rather than getting the instance dictionary first.

Another way to go about this (which doesn’t require explicitly asking the user) is to check if the class has the storage dictionary; if it does, then simply use it, but if it doesn’t, you can store it on the descriptor instance directly. Checking for the existence of __slots__ is unreliable as subclasses may not define __slots__ (while the base class does), so they will have both an instance dictionary and __slots__ .

Storing the data on the instance using the instance dictionary is easy (although often verbose, since referencing the attribute as vars(a)['x'] is often needed instead of a.x in order to avoid recursively calling the descriptor), as the following example will show. It’s a simple example with a location of where to store the data being hard-coded as "desc_store".
class InstanceStoringDescriptorBasic:
    name = "desc_store"
    def __get__(self, instance, owner):
        return vars(instance)[self.name]
    def __set__(self, instance, value):
        vars(instance)[self.name] = value
    def __delete__(self, instance):
        del vars(instance)[self.name]

As shown, it is pretty easy to store on the instance. Some of you may not know about vars(), though, so I will explain. Calling vars() on an object returns the instance dictionary. Many of you probably knew about __dict__. The vars() function returns that same dictionary and is the preferred (read “Pythonic”) way of accessing it, though lesser known. It is preferred largely because of the lack of double underscores. Like nearly every other “magic” attribute with double underscores, there is a clean way of using it. Hopefully, now you will inform all of your Python-using buddies about this and it can become a much more widely known function.

But why should the values be accessed via vars() and not simple dot notation? There are actually plenty of situations where using dot notation would work just fine. In fact, it works in most situations. The only times there are problems is when the data descriptor has the same name that is being used for storage in the dictionary or if the name being used is not a legal Python identifier. Often, this case pops up because the descriptor is purposely storing the attribute under its own name, which is almost guaranteed to prevent name conflicts. But it’s still possible that an outside data descriptor has the same name as where the main descriptor is trying to store its data. In order to avoid this, it is preferable to always directly reference the instance’s dictionary. Another good reason is that it makes it more explicit and obvious where the data is being stored.

The next thing to be figured out is how the descriptor knows the name to store the attribute under. Hopefully it’s obvious that hard-coding a location is a bad idea; it prevents multiple instances of that type of descriptor from being used on the same class since they will all be contending for the same name.

Asking for the Location

The simplest way to get a location name is to ask for it in the constructor. A descriptor like that would look something like this:
class GivenNameInstanceStoringDescriptor:
    def __init__(self, name):
        self.name = name
    def __get__(self, instance, owner):
        return instance.__dict__[self.name]
    def __set__(self, instance, value):
        instance.__dict__[self.name] = value
    def __delete__(self, instance):
        del instance.__dict__[self.name]

The only real difference between this one and the previous one is that it has an __init__() method that receives the preferred location name from the user instead of hard-coding it. In fact, the rest of the code is exactly the same.

Asking for the location to store the attribute value is easy when it comes to creating the descriptor, but is tedious for the user and can even be dangerous in the event that the location is required to have the same name as the descriptor, since the user can mess that up. Such is the case with set-it-and-forget-it descriptors, such as the following descriptor, which is a descriptor used for validating data using the function provided.
class Validated:
    def __init__(self, name, validator):
        self.name = name
        self.validator = validator
    def __set__(self, instance, value):
        if self.validator(value):
            instance.__dict__[self.name] = value
        else:
            raise ValueError("not a valid value for" + self.name)
In this Validated descriptor, __init__() asks for the location to store the real data. Since this is a set-it-and-forget-it descriptor that lets the instance handle retrieval instead of providing a __get__(), the location that the user provides must be the same as the descriptor’s name on the class in order for the descriptor to work as intended. For example, if a class was accidentally written like this:
class A:
    validatedAttr = Validated('validatedAttribute', validatorFunc)

validatedAttr is all screwed up. To set it, the user writes a.validatedAttr = someValue, but retrieving it requires the user to write a. validatedAttribute . This may not seem all that bad since it can be fixed easily, but these are the types of bugs that can often be very difficult to figure out and can take a long time to notice. Also, why should the user be required to write in the location when it can be derived somehow?

Set-It-and-Forget-It Descriptors

Now set-it-and-forget-it descriptors can finally be explained. Of the three methods in the descriptor protocol, these descriptors generally only implement __set__(), as seen in the example. That’s not always the case, though. For example the following lazy initialization descriptor only uses __get__().
class lazy:
    def __init__(self, func):
        self.func = func
    def __get__(self, instance, owner):
        value = self.func(instance)
        instance.__dict__[func.__name__] = value
        return value

This lazy descriptor can also be used as a decorator over a function, which it replaces and uses to do the lazy initialization. In this case, and in the case of other set-it-and-forget-it descriptors, the descriptor sets the value directly onto the instance, using the same name the descriptor is referenced by. This allows the descriptor to either be a non-data descriptor that is never used more than once—as in the case of lazy—or to be a data descriptor that has no need to implement __get__(), which is the case with most set-it-and-forget-it descriptors. In many cases, set-it-and-forget-it descriptors can increase lookup speeds by just looking in the instance or even provide other optimizations, like the lazy descriptor.

Indirectly Asking for the Location

Something else can be noted about the lazy descriptor from the set-it-and-forget-it section, and that’s how it was able to determine where to store the attribute; it pulled it from the function that it decorated.

This is a great way to indirectly ask for the name of the descriptor. Since the descriptor, initialized as a decorator, is provided with a function that the descriptor is replacing, it can use that function’s name to look up that name for a place to store the information on instance.

Name Mangling

Using the name directly like that, though, can be dangerous for most non-data descriptors, since setting it directly to that location would override its own access (which lazy actually intended to have happen). When building a non-data descriptor that doesn’t want to write over itself—although the chances are probably pretty slim for that situation to come up—it is best to do some “name mangling” when storing the data. To do so, just add an underscore or two to the beginning of the name. Using at least two leading underscores and at most one trailing underscore causes Python to add its own mangling to the name; using one leading underscore simply signals that the attribute is “private” to those using the object. There’s an incredibly low chance that the name is already taken on the instance.

Next, what can be done if asking the user for the name is a bad idea and the descriptor isn’t also a decorator? How does a descriptor determine its name then? There are several options, and the first one that will be discussed is how a descriptor can try to dig up its own name.

Fetching the Name

It would seem so simple to just look up what a descriptor’s name is, but, like any object, a descriptor could be assigned to multiple variables with different names. No, a more roundabout way of discovering one’s own name is required.

Note

Inspiration for this technique is attributed to “The Zipline Show” on YouTube, specifically their video about descriptors3. This technique shows up around 22 minutes in. They may have gotten the technique from the book they mention at the beginning of the video, but I took the idea from them, not the book.

The original version of this technique that I adapted a little used the following code.
def name_of(self, instance):
    for attr in type(instance).__dict__:
        if attr.startswith('__'): continue
        obj = type(instance).__dict__[attr]
        if obj is self:
            self.name = self.mangle(attr)
            break

This method is meant to be added to any descriptor in order to look up its name. If the descriptor’s name attribute isn’t set, the descriptor just runs this method to set it. On the second to last line, it sends the name to a name mangler—which just makes sure it starts with two underscores—instead of using the name as it is. As mentioned in the name mangling section, this may be necessary, but not always.

There’s a problem with this method, though: it doesn’t handle subclasses. If a class with this descriptor is subclassed and an instance of that subclass tries to use the descriptor before an instance of the original class does, it will fail to look up its name. This is because the descriptor is on the original class, not the subclass, but the name_of() method looks in the class’ dictionary for itself. The subclass will not have the descriptor in its dictionary.

Not to worry, though. The version in the library solves this problem by using dir() to get all the names of attributes, including from superclasses, and then it delegates those to a function that digs into the __dict__ of each class on the MRO until it finds what it’s looking for. I also removed the name mangling function, allowing you to use that only as necessary. Lastly, it doesn’t bother with ignoring attributes that start with a double underscore. Such a check may actually be slower than accessing the attribute and comparing identity, but even if it’s not, it largely just clutters the code. Plus, you never know; your descriptor may be used in place of a special method.

The final result looks like this:
def name_of(descriptor, owner):
    return first(attr for attr in dir(owner)
                 if (get_descriptor(owner, attr) is descriptor))
def first(iter):
    return next(iter, None)
def get_descriptor(cls, descname):
    selected_class = first(clss for clss in cls.__mro__
                           if descname in clss.__dict__)
    return selected_class.__dict__[descname]

Python 3.2 also added a new function in the inspect module called getattr_static() , which works just like getattr() except that it doesn’t activate a descriptor’s __get__() method upon lookup. You could replace the call to get_descriptor() with getattr_static() and it would work the same.

__set_name__()

In Python 3.6, something else was added that makes fetching the name even easier! Python gained an additional optional method in its protocol: __set_name__() . This new method is called during the creation of a class that contains a descriptor object. Its parameters are self, owner, and name. The first one, self, is super obvious; it’s the same first parameter that all methods have. You should recognize the second one, owner, as the class that the descriptor is on. And the last one, name, should also be evident as the name that we’re looking for; the name of the variable that the descriptor object is stored on.

Store the Original and the Mangled

When storing the name used for the descriptor, it’s often best to store both the original name and the mangled name. Keeping the mangled name is obvious, but why in the world would you want to also store the original name? For error messages. If something goes wrong when trying to use your descriptor, you want to at least provide the name of the attribute to the user to get a better idea of where it all went wrong.

Keying on the ID

Another thing that can be done for relatively safe places to store on the instance is to use the id() of the descriptor to generate a location on the instance, somehow. It seems strange, but a non-string can be used as the key in an instance dictionary.

Unfortunately, it can only be accessed directly via vars(instance)[id(desc)] and not via dot notation or get/set/hasattr(). This may actually seem like a plus, since it prevents unwanted access to the attribute, but it also messes up dir(instance), which raises an exception when it finds a non-string key.

On the plus side, it’s impossible for this location to clash with user-defined attributes, since those must be strings, and this is an integer. But causing dir() to fail is undesirable, so a different solution must be found. Defining a __dir__() method would be overkill and inappropriate in most cases. However, the aggressive programmer could call object.__dir__() and remove the id() from the list before returning it. As stated, however, this is overkill.

A simple solution is to change the ID into a string, i.e. str(id(desc)) instead of just id(desc). This fixes the dir() problem and also opens up the use of get/set/hasattr() while still preventing dot notation access, since it’s an invalid Python identifier. The likelihood of name clashes is still extremely low, so this is still an acceptable solution.

Note

An interesting little twist of str(id(desc)) is to use the hexadecimal value, as hex(id(desc)) instead of the straight string version of the number, preferably removing '0x' at the beginning, such as hex(id(desc))[2:]. The benefit of this is that the hex string will generally be shorter, which shortens the time needed to calculate the hash value (which is done on lookup and assignment in __dict__) by a tiny bit. Yes, the amount of time needed to calculate the hex value is greater than that of calculating the plain string value, but that only needs to be done once (you can save the hex string to be used later), whereas attribute lookup is likely to happen many times. It’s a tiny optimization and may not even be worth noting.

There’s no good reason to add acceptable characters to the front of the key in order to support dot notation, since dot notation requires the user to know what the name is going to be ahead of time, which they can’t know since the name changes every time the program is run when using id() to derive it. There are other restrictions that a consistently-changing key imposes, one of which is that it makes serialization and deserialization (pickling and unpickling, respectively, done with the pickle module , are one of those ways, among others) a little more difficult.

If it’s desirable to be able to derive some sort of information from the save location, additional information can be added to the key. For example, the descriptor’s class name could be added to the front of the key, for example type(self).__name__ + str(id(self)). This gives users who use dir() to look through the names on the instance some clue as to what that name refers to, especially if there are multiple descriptors that base their name on id() on the instance.

Letting the User Take Care Of It

The title of this section may sound like it’s about asking the user for the name in the descriptor’s constructor, but that’s not it at all. Instead, this is referring to the approach property uses.

One could say that property “cheats” by simply assigning functions that you give it to its different methods. It acts as the ultimate descriptor by being almost infinitely customizable, and that’s largely what it is. The biggest descriptor-y thing it can’t do is become a non-data descriptor (since it defines all three methods of the descriptor protocol), which is fine, since that doesn’t work with the intent anyway. Also, the functions fed to the descriptor don’t have easy access to the descriptor’s internals, so there’s a limit to what can be done there.

Interestingly, a large percentage of descriptors could be written using property—and actually work better, since there would be no difficulties in figuring out where to save the data—but it certainly has major setbacks. The biggest of those is the lack of DRYness when it comes to reusing the same descriptor idea. (Don’t Repeat Yourself; DRYness is the lack of unnecessarily repeated code.) If the same code has to largely be rewritten many times for the same effect with property, it should be turned into a custom descriptor that encapsulates the repeated part. Sadly, it isn’t likely to be a really easy copy-over because of the fact of storing a value. If the descriptor doesn’t need to figure that out, though, which is sometimes the case, then the conversion is much easier.

In summary, property is a highly versatile descriptor, and it even makes some things extremely easy (namely the difficult thing this entire chapter was about), but it’s not easily reusable. Custom descriptors are the best solution for that, which is why this book exists!

There aren’t many use cases out there for recreating “storage” the way that property does it, but there are enough use cases for extending what property does in little ways to make it worthwhile to look into.

Metadescriptors

The restrictions of descriptors and their use with classes can be quite the pain, limiting some of the possibilities that could be wanted from descriptors, such as class constants. It turns out that there is a way around it, and that solution will be affectionately called metadescriptors in this book (hopefully the idea and name spreads throughout the advanced Python community).

The reason they are called metadescriptors is because the descriptors, instead of being stored on the classes, are stored on metaclasses. This causes metaclasses to take the place of owner while classes take the place of instance. Technically, that’s all there really is to metadescriptors. It’s not even required for a descriptor to be specially designed in order for it to be a metadescriptor.

While the idea of metadescriptors is actually pretty simple, the restrictions around metaclasses can make using metadescriptors more difficult. The biggest restriction that must be noted is the fact that no class can be derived from more than one metaclass, whether that is specified directly on the class or having multiple subclasses have different metaclasses. Don’t forget that, even if there is no metaclass specified, a class is still being derived from the type metaclass.

Because of this, choosing to use metadescriptors must be done with caution. Luckily, if the codebase is following the guideline of preferring composition to inheritance, this is less likely to be a problem.

For a good example of a metadescriptor, check out the ClassConstant metadescriptor near the end of the next chapter.

Summary

In this chapter, we looked at a bunch of examples of techniques for storing values in descriptors, including options for storing on the descriptor as well as on the instances themselves. Now that we know the basics that apply to a majority of descriptors, we’ll start looking at some other relatively common functionality and how it can be implemented.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.89.18