Chapter 10. Memoization

Using pure functions has more advantages than just producing predictable results. Granted, that’s a good thing to have, but there’s another way to use that behavior to our advantage.

Memoization is somewhat like caching, specifically like the GetOrAdd() function from the MemoryCache() class. Memoization takes a key value of some kind, and if that key is already present in the cache, it returns the object. If the key isn’t present, you need to pass in a function that will generate the required value. Unlike with standard caching, you don’t need to be concerned with cache invalidation or updating values already stored. Typically, memoization holds onto values for the duration of a larger calculation and then discards everything it has stored.

Memoization works to the exact same principle as standard caching, except its scope might not extend beyond a single unit of work, which might be simply a single calculation. It isn’t a replacement for standard caching.

Memoization is useful in a multistep calculation that might be recursive, or involve the same calculations being performed multiple times for some reason. Maybe the best way to explain this is with an example.

Bacon Numbers

Have you ever wanted an entertaining way to waste an afternoon or two? Have a look into Bacon numbers. They’re based on the idea that Kevin Bacon is the center of the acting universe, connecting all actors together. Like all roads lead to Rome, all actors somehow connect at some level to Bacon. An actor’s Bacon number is the number of film connections you have to work through in order to reach Kevin Bacon. Let’s work through a few examples:

Kevin Bacon

Easy. He has a Bacon number of 0, because he is the Big Bacon himself.

Tom Hanks

Bacon number of 1. He was with Bacon in one of my personal favorites, Apollo 13. Frequent Hanks collaborator Meg Ryan is also 1, because she appeared with Bacon in In The Cut.

David Tennant

Bacon number of 2. He appeared with Colin Firth in St. Trinian’s 2. Colin Firth appeared in Where the Truth Lies with Bacon. That’s two films before we find a connection, so a Bacon number of 2. Believe it or not, Marilyn Monroe also has a score of 2 because Bacon appeared in JFK with Jack Lemmon, who was also in Some Like It Hot.

Aamir Khan

This Bollywood superstar has a Bacon number of 3. He was with living legend Amitabh Bachchan in Bombay Talkies. Amitabh was in The Great Gatsby with Tobey McGuire. McGuire was in Beyond All Boundaries with Bacon.

My Bacon number is infinity! This is because I’ve never appeared in a film as an actor.1 Also, the holder of the highest Bacon number I’m aware of is William Rufus Shafter, an American Civil War general who also appeared in a nonfiction film made in 1898, which still secures him a Bacon number. It’s a whopping great 10!

Right, hopefully you understand the rules. Let’s imagine we want to work out which of these actors has the lowest Bacon number programmatically. Our code looks like this:

var actors = new []
{
    "Tom Hanks",
    "Meg Ryan",
    "David Tennant",
    "Marilyn Monroe",
    "Aamir Khan"
};

var actorsWithBaconNumber = actors.Select(x => (a: x, b: GetBaconNumber(x)));

var report = string.Join("
", actorsWithBaconNumber.Select(x =>
    x.a+ ": " + x.b);

GetBaconNumber() could be calculated in many ways—most likely, by using a web API of film data of some kind. More advanced “shortest path” algorithms are out there, but for the sake of simplicity I’ll say the process is something like this:

  1. Get all of Kevin Bacon’s films. Assign all actors in these films a Bacon number of 1. If the target actor (e.g., Tom Hanks) is among them, return an answer of 1. Otherwise, continue.

  2. Take each of the actors from the previous step (excluding Bacon himself), get a list of all their films not already checked. Assign all actors from these films not already assigned a value of 2.

  3. Continue in iterations, with each set of actors being assigned progressively higher values until we finally reach the target actor and return their number.

Since an API is working to calculate these numbers, every actor whose filmography we download, or every film whose cast list we download, all have a significant cost in processing time. Furthermore, an awful lot of overlap occurs between these actors and their films, so unless we step in and do something, we’re going to be checking the same film multiple times.

One option is to create a state object to pass into an Aggregate() function. The process is an indefinite loop, so we’d also need to select one of the options for compromising on the functional principles and allowing the loop. The code might look something like this:

public int CalculateBaconNumber(string actor)
{

    var initialState = (
        checkedActors: new Dictionary<string, int>(),
        actorsToCheck: new[] { "Kevin Bacon" },
        baconNumber: 0
    );

    var answer = initialState.IterateUntil(
        x => x.checkedActors.ContainsKey(actor),
        acc => {
            var filmsToCheck =
             acc.actorsToCheck.SelectMany(GetAllActorsFilms);
            var newActorsFound = filmsToCheck.SelectMany(x => x.ActorList)
                .Distinct()
                .ToArray();

            return (
                acc.checkedActors.Concat(acc.actorsToCheck
                    .Where(x => !acc.checkedActors.ContainsKey(x))
                    .Select(x =>
                     new KeyValuePair<string, int>(x, acc.baconNumber)))
                    .ToArray()
                    .ToDictionary(x => x.Key, x => x.Value),

                newActorsFound.SelectMany(GetAllActorsFilms)
                    .SelectMany(x => x.ActorList).ToArray(),

                acc.baconNumber + 1
            );
        });
    return answer.checkedActors[actor];
}
Note

I’ve made up the web API for the sake of this example, so you won’t be able to follow along at home, unless you make your own API first!

This code is fine but could be better. A lot of boilerplate code is concerned with tracking whether an actor has already been checked. For instance, the code contains a lot of uses of Distinct.

With memoization, we get a generic version of the check, like a cache, but it exists within the scope of the calculation we’re performing and doesn’t persist beyond it. If you do want the saved, calculated value to persist between calls to this function, MemoryCache may be a better choice.

We could create a memoized function to get a list of films that the actors listed previously have been in, like this:

var getAllActorsFilms = (String a) => this._filmGetter.GetAllActorsFilms(a);
var getAllFilmsMemoized = getAllActorsFilms.Memoize();

var kb1 = getAllFilmsMemoized("Kevin Bacon");
var kb2 = getAllFilmsMemoized("Kevin Bacon");
var kb3 = getAllFilmsMemoized("Kevin Bacon");
var kb4 = getAllFilmsMemoized("Kevin Bacon");

This calls the same function four times, and by rights it should have gone away to the film data repository and fetched a fresh copy of the data four times. In fact, the function fetched the data only a single time, when kb1 was populated. Every time since then, a copy of the same data was returned.

Note also, by the way, that the memoized version and original version are on separate lines. That’s a limitation of C#. You can’t call an extension method on a function, only on a Func delegate, and the arrow function isn’t a Func until it has been stored in a variable.

Some functional languages have out-of-the-box support for memoization, but F# doesn’t, oddly enough.

This is an updated version of the Bacon number calculation, this time taking advantage of the memoization feature:

public int CalculateBaconNumber2(string actor)
{

    var initialState = (
        checkedActors: new Dictionary<string, int>(),
        actorsToCheck: new[] { "Kevin Bacon" },
        baconNumber: 0
    );

    var getActorsFilms = GetAllActorsFilms();
    var getActorsFilmsMem = getActorsFilms.Memoize();

    var answer = initialState.IterateUntil(
        x => x.checkedActors.ContainsKey(actor),
        acc => {
            var filmsToCheck = acc.actorsToCheck.SelectMany(getActorsFilmsMem);
            var newActorsFound = filmsToCheck.SelectMany(x => x.ActorList)
                .Distinct()
                .ToArray();

            return (
                acc.checkedActors.Concat(acc.actorsToCheck
                        .Where(x => !acc.checkedActors.ContainsKey(x))
                        .Select(x =>
                            new KeyValuePair<string, int>(x, acc.baconNumber)))
                    .ToArray()
                    .ToDictionary(x => x.Key, x => x.Value),

                newActorsFound.SelectMany(getActorsFilmsMem)
                    .SelectMany(x => x.ActorList).ToArray(),

                acc.baconNumber + 1
            );
        });
    return answer.checkedActors[actor];
}

The only difference here is that we’ve created a local version of the call to get film data from a remote resource, then memoized it, and thereafter referenced only the memoized version. Therefore, we’re guaranteed no wasteful repeat requests for data.

Implementing Memoization in C#

Now that you understand some of the basics, this is how you’d make a memoization function for a simple, single parameter function:

public static Func<T1, TOut> Memoize<T1, TOut>(this Func<T1, TOut> @this)
{
    var dict = new Dictionary<T1, TOut>();
    return x =>
    {
        if (!dict.ContainsKey(x))
            dict.Add(x, @this(x));
        return dict[x];
    };
}

This version of memoize expects that the live data function has only a single parameter of any type. To expect more parameters, we’d need further Memoize() extension methods, like this:

public static Func<T1, T2, TOut> Memoize<T1, T2, TOut>(
    this Func<T1, T2, TOut> @this)
{
    var dict = new Dictionary<string, TOut>();
    return (x, y) =>
    {
        var key = $"{x},{y}";
        if (!dict.ContainsKey(key))
            dict.Add(key, @this(x, y));
        return dict[key];
    };
}

Now, to make this work, we assume that ToString() returns something meaningful, meaning that most likely it’ll have to be a primitive type (like a string or int). The ToString() method on a class tends to simply return a description of the type of the class, not its properties.

If we absolutely have to memoize classes as parameters, some creativity is needed. The easiest way to keep it generic is probably to add parameters to the Memoize() function that require the developer to provide a custom ToString() function:

public static Func<T1, TOut> Memoize<T1, TOut>(
 this Func<T1, TOut> @this,
 Func<T1, string> keyGenerator)
{
    var dict = new Dictionary<string, TOut>();
    return x =>
    {
        var key = keyGenerator(x);
        if (!dict.ContainsKey(key))
            dict.Add(key, @this(x));
        return dict[key];
    };
}

public static Func<T1, T2, TOut> Memoize<T1, T2, TOut>(
 this Func<T1, T2, TOut> @this,
 Func<T1, T2, string> keyGenerator)
{
    var dict = new Dictionary<string, TOut>();
    return (x, y) =>
    {
        var key = keyGenerator(x, y);
        if (!dict.ContainsKey(key))
            dict.Add(key, @this(x, y));
        return dict[key];
    };
}

We might call it like this, in that case:

var getCastForFilm((Film x) => this.castRepo.GetCast(x.Filmid);
var getCastForFilmM = getCastForFilm.Memoize(x => x.Id.ToString());

This is possible only if you keep your functions pure. If there are side effects of any kind in your live Func, you might not necessarily get the results you expect (depending on what those side effects are).

Logging that you’ve called the API wouldn’t be done with the cached version, but that’s probably fine. Calling a secondary operation inside the function you’re memoizing that you expect to be called every time is not only likely to be an example of poor coding practice, but also wouldn’t happen when memoizing.

An example would be properties that were expected to be calculated uniquely in every instance of the generated class. If you wanted something like that, you’d probably have to split up the system domains a little more, so that the “get from API” function does literally just that, and another function and/or class handles the conversion of API data to something the rest of your application can understand.

In some cases, you might want the results to persist between calls to the Memoize() extension method. In that case, you’ll also need to provide the Memoize() function with an instance of something like MemoryCache from the outside world.

If you really wanted, you could probably even write a version that uses a database to persist results between instances of the application running. That’s probably defeating the whole purpose of what Memoize() is attempting to achieve, but the choice depends on just how resource intensive your live calls to the function you’re memoizing are. I’ve come across functions that legitimately take half an hour to run; perhaps you might want to memoize and persist those to storage. Alternatively, you might also want to look into a “proper” caching solution. As ever, it’s up to you. This book and I, we’re both nonprescriptive.

Summary

In this chapter, we looked at memoization and how to implement it. This lightweight alternative to caching can be used to drastically reduce the amount of time taken to complete a complex calculation with a lot of repetitive elements.

That’s it for theory now! This isn’t just the end of this chapter, but also of this part of the book.

Part I showed how to use functional ideas and out-of-the-box C# to improve your daily coding. Part II took a deeper dive into the theory behind FP and how to implement it with a little creative hacking about. Part III is a little more philosophical and will give you a few hints as to where to go next with what you’ve learned here with me.

Continue, if you dare, to enter Part III.

1 I wouldn’t say no, though. Anyone know a film director wanting to cast an aging, overweight, British tech-nerd? I probably couldn’t play James Bond, but I’m willing to give it a go!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.175.7