12

Functional Programming for Bioinformatics

Python is a multi-paradigm language that allows you to express computations in multiple different ways. It is sometimes called an object-oriented (OO) language: sure, you can write code in an OO dialect, but you can also use other styles. Most code in Python is written in an imperative style: there is no structuring along a class hierarchy, as typical of the OO paradigm, and most code changes state that, for example, if you write x = x + 1, you are changing the state of variable x.

If you write complex code, especially code that requires parallel processing, imperative and OO paradigms will hit some limits. For simple scripts that run on a single machine, imperative and OO styles will do fine, but bioinformatics is a big-data enterprise, and you will often need to scale up and scale out, as there is lots of information to process, and many algorithms are computationally heavy.

Functional programming is quite useful for the complex and parallel problems that are becoming more common within bioinformatics. Many modern architectures for high-throughput data analysis are based on functional programming ideas, for example, the MapReduce paradigm originally developed at Google and on platforms such as Apache Spark. These ideas are also directly applicable to Dask and even to making sequential code more robust.

Functional programming is based on functions, pure function evaluation, and avoiding mutability. In this chapter, we are going to take a very practical approach to presenting these concepts, with examples of their use cases within typical bioinformatics applications. By the end, you should have a basic conceptual grasp of functional programming, but above all, you will understand its utility and applicability.

If you are using Docker, and because all these libraries are fundamental for data analysis, they all can be found in the Docker image tiagoantao/bioinformatics_base.

In this chapter, we will cover the following recipes:

  • Understanding pure functions
  • Understanding immutability
  • Avoiding mutability as a robust development pattern
  • Using lazy programming for pipelining
  • The limits of recursion with Python
  • A showcase of Python’s functools module

Understanding pure functions

A pure function has a couple of important properties: for the same inputs, it produces the same outputs, and it has no side effects (it doesn’t change global variables and doesn’t do I/O). In this recipe, we are going to introduce a few simple examples to make the concept clear. We are mostly – but not exclusively – interested in the second property: the lack of side effects. In later recipes, it will be made clear why pure functions can be quite useful.

We are going to develop a very simple example where we are counting the genes sequenced per sample. We will have a database on a text file with counts for genes. For example, we might have sequenced LCT and TP53 on a sample, and LCT, MRAP2, and POMC on another. The total count would be: TP53: 1, LCT: 2, MRPA2: 1, and POMC: 1. We will be using a CSV file that can be easily read with Pandas or even just the CSV module.

Getting ready

We will be using a simple CSV file as our little database. The code for this recipe can be found in Chapter12/Pure.py. The database file is my_genes.csv and there is a saved original state at my_genes.csv.base in case you need to get back to the original database state.

How to do it...

Take a look at the following steps to get started:

  1. Let’s start by creating two simple functions to load and save the data. We also need some time to restore the original database:

    import shutil

    import pandas as pd

    def restore_db(file_name):

        shutil.copyfile(f'{file_name}.base', file_name)

    def load(file_name):

        df = pd.read_csv(file_name).set_index('gene')

        return dict(df['count'])

    def save(dict_db, file_name):

        pd.Series(dict_db).to_csv(

            file_name, index_label='gene', header=['count'])

We use Pandas to take care of persistence. This is a very simple example; you could have also only used the csv module.

  1. We will consider three alternative functions to report the genes seen in a sample; here is the first:

    def add_sample_csv(gene_list):

        gene_count = load('my_genes.csv')

        for gene in gene_list:

            gene_count[gene]=gene_count(gene,0)+1

        save(gene_count, 'my_genes.csv')

This function automatically persists the data with a new sample. It has the side effect of reading and writing a file, so it’s not a pure function.

  1. Here is a second alternative to also report the genes in a sample:

    def add_sample_global_dict(gene_list):

        global gene_count

        for gene in gene_list:

            gene_count[gene] = gene_count.get(0) + 1

This function uses a global variable from the module and updates it, a side effect. We will not be using this function, but it is provided as a bad example of a function with side effects. We should avoid global variables.

  1. Here is a pure function variant:

    def add_sample_dict(dict_db, gene_list):

        for gene in gene_list:

            dict_db[gene] = dict_db.get(0) + 1

        return dict_db

This function mutates one of its parameters – dict_db – and as we will see in the next recipe, this is not a best practice from a functional programming perspective. However, it always returns the same result for equal output and doesn’t have side effects, so it will be a good first approach.

  1. Imagine that we now run the following:

    add_sample_csv(['MC4R', 'TYR'])

But, if we run it 10 times by mistake, what would be the consequence? We would over-report both genes 9 times, corrupting the content of our database from a logical perspective.

  1. As an alternative, consider the following:

    add_sample_dict(gene_count, ['MC4R', 'TYR'])

If you run it 10 times, what would be the consequence? It would still be an error (as we are changing gene_count), but at least it is not yet committed to the disk. This dialect would be much more comfortable when doing exploratory data analysis – you could run it knowing that no external data would be damaged. In the next recipe, we are going to see an alternative to make re-running code less problematic.

There’s more...

In the next few recipes, we are going to see, with very practical cases, why pure functions can be very useful. But there is one specific case that would be cumbersome to explain with an example and as such, we are going to present the theory here.

Pure functions make parallelizing code much easier. Imagine a framework to execute distributed computing such as Dask or Spark. Such a framework has to deal with the potential case where hardware fails, and code needs to be moved and potentially repeated elsewhere. If your code writes to disk (as an example of a side effect) every time it’s run, it’s much more complicated for distributed frameworks to recover. If your code has no side effects, the distributed framework could repeat your function without any concern for data consistency. Indeed, many distributed frameworks do not allow you to have side effects in your parallelizable code. They also might frown at data structure mutability, something we are going to discuss now.

Understanding immutability

Another common trait of functional programming is that data structures are typically immutable. This is a tough concept to get around when you are used to imperative programming – the idea of programming without objects that change state over time. Here, we are going to see a simple example of making a function from the previous recipe work in an immutable way: that is, so that no objects are changed, and if we need to pass new information, we create new ones.

This recipe will give a short presentation on immutability from a data structure perspective. It will be, in a sense, the standard presentation that you can find in most books. Our main consideration, though, is to discuss mutability as a code design pattern, the topic of the following recipe. But for this, we need to understand immutability first.

We will be looking at two functions: one that mutates data structures and another that doesn’t. This will be done in the context of the example that we followed in the previous recipes of this chapter.

Getting ready

We will be using the same data as in the previous recipe. The code for this recipe can be found in Chapter12/Mutability.py.

How to do it...

Follow these steps:

  1. From the previous recipe, we have a function that mutates the input dictionary:

    def add_sample_dict(dict_db, gene_list):

        for gene in gene_list:

            dict_db.get(gene,0) +1

If you call this code with, say, add_sample_dict(gene_count, ['DEPP']), while there are no output parameters, your dictionary’s gene_count will be mutated to add 1 to the gene DEPP. If you run this function an excessive number of times – which is typical when doing exploratory data analysis – you would change your dictionary erroneously.

  1. Contrast the previous implementation with the following:

    def add_sample_new_dict(dict_db, gene_list):

        my_dict_db = dict(dict_db)

        for gene in gene_list:

            my_dict_db[gene] = my_dict_db.get(0) + 1

        return my_dict_db

In this case, we are copying dict_db to a new dictionary, my_dict_db, and adding the extra gene list. There is a memory and time cost of making the copy of the existing dictionary, but the original one, dict_db, is not changed.

  1. If you use this implementation, you are sure that your input parameters are never changed:

    new_gene_count = add_sample_new_dict(gene_count, ['DEPP'])

gene_count is not changed and new_gene_count is a completely new dictionary. You can repeat the execution as many times you want without being worried about the impact of each execution.

There’s more...

This simple recipe provides an example of a function that doesn’t mutate its parameters. We are now in a situation where we can execute this function as many times as we want – by mistake or on purpose – without any consequences for the rest of the code. This is quite useful when we are testing code or running exploratory data analysis, as we don’t have to be worried about side effects. This also facilitates the work of distributed executors such as Dask or Spark, as they don’t have to worry about checking the state of existing objects: they are fixed.

There are also important lessons here for general software design, even if you don’t use distributed executors. That is what we are going to investigate in the following recipe.

Avoiding mutability as a robust development pattern

The previous recipe introduced the concept of immutable data structures. In this recipe, we are going to discuss a design pattern that avoids persistent database mutability in your code until the very end. In terms of pseudocode, most applications in a long script work as follows:

Do computation 1
Write computation 1 to disk or database
Do computation 2
Write computation 2 to disk or database
….
Do computation n
Write computation n to disk or database

Here, we are going to present an alternative paradigm and discuss why it is generally better from a resilience point of view:

Do computation 1
Write computation 1 to temporary storage
Do computation 2
Write computation 2 to temporary storage
...
Do computation n
Write computation n to temporary storage
Take all temporary data and write it to definitive disk and database

First, we will show the code for both approaches, and then discuss why, for complex and sophisticated scripts, the latter is a better approach in most cases.

We will use the same example as in the previous recipes: we will report on genes seen in different samples.

Getting ready

We will be using the same data as in the previous recipe. The code for this recipe can be found in Chapter12/Persistence1.py and Chapter12/Persistence2.py.

How to do it...

Let’s set up both solutions along with some shared code.

  1. Both solutions still use load and save functions:

    import pandas as pd

      

    def restore_db(file_name):

        shutil.copyfile(f'{file_name}.base', file_name)

      

    def load(file_name):

        df = pd.read_csv(file_name).set_index('gene')

        return dict(df['count'])

    def save(dict_db, file_name):

        pd.Series(dict_db).to_csv(

            file_name, index_label='gene', header=['count'])

  2. The first alternative, where we save as we go, is the following:

    def add_sample_csv(gene_list):

        gene_count = load('my_genes.csv')

        for gene in gene_list:

            gene_count[gene]=gene_count(gene,0)+1

        save(gene_count, 'my_genes.csv')

    add_sample_csv(['MC4R', 'TYR'])

    add_sample_csv(['LCT', 'HLA-A'])

    add_sample_csv(['HLA-B', 'HLA-C'])

Every time we add a sample, we update the main database. While this solution is not very efficient from an I/O perspective, that is not the issue at hand here – actually, most design patterns of this kind tend to be more efficient from an I/O perspective than what we are going to look at next.

  1. The second alternative, when we store definitive data at the end, is the following:

    def add_sample_new_dict(dict_db, gene_list):

        my_dict_db = dict(dict_db)  # next recipe

        for gene in gene_list:

            dict_db.get(gene,0) +1

        return my_dict_db

    gene_count = load('my_genes.csv')

    gene_count = add_sample_new_dict(gene_count, ['MC4R', 'TYR'])

    gene_count = add_sample_new_dict(gene_count, ['LCT', 'HLA-A'])

    gene_count = add_sample_new_dict(gene_count, ['HLA-B', 'HLA-C'])

    save(gene_count, 'my_genes.csv')

In this case, we only update the main database at the very end. We use an in-memory dictionary to maintain the intermediate data.

There’s more...

So, why do this? The solution that delays writing the final data is, if anything, more complex, as we have to potentially store partial results along the way and manage that (which we do with the gene_count dictionary in this case – a simple example). In fact, this is true but code has bugs, disk storage fills up, and computers break in the real world, so facing this is the more important consideration in pragmatic terms.

Imagine that your first solution stops working in the middle of the execution for some reason. The database is in an unknown state, as you don’t know where the code was. You simply cannot restart the code from scratch, as part of it might already have been executed. So, you have to check what the state of the database is, see where the code stopped, and partially run what is missing, or find a way to roll back whatever part was executed. It is both cumbersome and bug-prone.

Now, imagine that there is a problem during the execution of the second approach: you do not have to worry about the intermediate state of the data because it doesn’t affect your final database and it will be completely recomputed in a second execution. You can just re-run the code, without worrying about inconsistent states. The only exception is the very last line, but not only will this be less common but you can also concentrate all your effort on this point to make the process as resilient as possible.

Can this always be done? Or does it only work in some cases? Most code can be done so that it creates temporary files. Also, much of bioinformatics analysis is not real-time transactional, so it’s normally easy to update a SQL database only at the very end. And you now have a single point of failure that is relevant to your whole code – if there is a problem, you know you only need to concentrate on that very final part.

This approach, where possible, will make the day-to-day maintenance of complex code much easier.

Using lazy programming for pipelining

Lazy programming is a strategy where we defer computation until it’s really needed. It has many advantages compared with its counterpart, eager programming, where we compute everything as soon as we invoke a computation.

Python provides many mechanisms for lazy programming – indeed, one of the biggest changes from Python 2 to Python 3 is that the language became lazier.

To understand lazy programming, we are going again to take our gene database and do an exercise with it. We are going to check whether we have at least n genes with y reads each (for example, three genes with five reads each). This can be, say, a measure of the quality of our database – that is, a measure of whether we have enough genes with a certain number of samples.

We are going to consider two implementations: one lazy and one eager. We will then compare the implications of both approaches.

Getting ready

The code for this recipe can be found in Chapter12/Lazy.py.

How to do it...

Take a look at these two different approaches:

  1. Let’s start with the eager implementation:

    import pandas as pd

    def load(file_name):

        df = pd.read_csv(file_name).set_index('gene')

        return dict(df['count'])

      

    def get_min_reads(all_data, min_reads):

        return {

            gene: count

            for gene, count in all_data.items()

            if count >= min_reads

        }

    def has_min_observations(subset_data, min_observations):

        return len(subset_data) >= min_observations

    print(has_min_observations(

        get_min_reads(

            load('my_genes.csv'), 4

        ), 3))

Notice that the code loads all the data, checks all entries for a count of 4, and sees if there are at least 3 observations with that count.

  1. This is an alternative, lazy version:

    def get_rec(file_name):

        with open(file_name) as f:

            f.readline()  # header

            for line in f:

                toks = line.strip().split(',')

                yield toks[0], int(toks[1])

    def gene_min_reads(source, min_reads):

        for gene, count in source:

            if count >= min_reads:

                yield gene

    def gene_min_observations(subset_source, min_observations):

        my_observations = 0

        for gene in subset_source:

            my_observations += 1

            if my_observations == min_observations:

                return True

        return False

    print(gene_min_observations(

        gene_min_reads(

            get_rec('my_genes.csv'), 4

        ), 2))

This code behaves in a completely different way. First, get_rec and gene_min_reads are generators (note the yield), so they will only return results when required, one by one. gene_min_observations will return as soon as it has seen the necessary number of observations. This means that the only data that will be read and processed is the bare minimum to arrive at a result. In a worst-case scenario, this is still the whole file, but in many cases, it can be much less.

There’s more...

The advantages of this approach can be seen with very large files. While there are no perceivable advantages with our toy file, if the file is too large to fit in memory, the first approach would simply crash! Also, every step of the pipeline will require enough memory to store all the data that it consumes, and all of the steps will need to be in-memory at the same time. So, even for medium-sized files, memory problems can occur with the eager version that would be avoided with the lazy version.

Many times, lazy code also does substantially less processing: as in the previous example, it may stop computation before seeing all the data. This is not always assured but happens with many use cases.

As a historical note, Python 2 was substantially more eager than Python 3. It is actually one of the main changes from Python 2 to Python 3. As a trivial example, consider the behavior of range in Python 2 versus 3.

Part of our lazy implementation can be written in a more succinct way by using some Python built-in modules – we will revisit this in the final recipe.

The limits of recursion with Python

Recursion – a function calling itself – is a fundamental concept in functional programming. Many languages that cater to functional programming are able to recurse ad infinitum. Python is not one of those languages. You can use recursion in Python, but you have to be aware of its limits, namely that recursion incurs quite a lot of memory costs and that it cannot be used to replace iteration – a typical case when writing functional programs.

To see the limits of recursion with Python, we are going to implement the Fibonacci sequence in several ways. As a reminder, Fibonacci can be defined as follows:

Fib(0) = 0
Fib(1) = 1
Fib(n) = Fib(n -1) + Fib(n –2)

We are also computing the factorial function, but in this case only in a recursive way:

Fact(1) = 1
Fact(n) = n * Fact(n –1 )

Getting ready

Our code is available in Chapter12/Recursion.py.

How to do it...

Follow these steps to get started:

  1. Let’s start with an iterative version of Fibonacci:

    def fibo_iter(n):

        if n < 2:

            return n

        last = 1

        second_last = 0

        for _i in range(2, n + 1):

            result = second_last + last

            second_last = last

            last = result

        return result

This version is perfectly sufficient and also quite efficient. We are not suggesting that the iterative version is in any way lesser than the recursive version. On the contrary, with Python, it will probably perform better.

  1. Here is a recursive version:

    def fibo_naive(n):

        if n == 0:

            return 0

        if n == 1:

            return 1

        return fibo_naive(n - 1) + fibo_naive(n - 2)

If you run it, you will see that the performance suffers quite substantially when compared with the iterative version. Furthermore, if you ask for fibo_naive(1000) or a similarly large number, the code will not perform well at all (it can take many hours for 1,000 cases), whereas the iterative version will perform adequately. In the following recipe, we will actually fix part of this. But for now, let’s dig deeper into the issues with recursion and Python.

  1. To make the case quite obvious, let’s implement a factorial function that is as simple as it can get from a recursive implementation perspective (even simpler than Fibonacci):

    def factorial(n):

        if n == 1:

            return 1

        return n * factorial(n - 1)

If you run this function with a small number, such as factorial(5), you will get the correct answer, 5. But if you try a large number, such as factorial(20000), you will get the following:

Traceback (most recent call last):

  File "Recursion.py", line 8, in <module>

    factorial(20000)

  File "Recursion.py", line 4, in factorial

    return n * factorial(n - 1)

  File "Recursion.py", line 4, in factorial

    return n * factorial(n - 1)

  File "Recursion.py", line 4, in factorial

    return n * factorial(n - 1)

  [Previous line repeated 995 more times]

  File "Recursion.py", line 2, in factorial

    if n == 1:

RecursionError: maximum recursion depth exceeded in comparison

This error suggests that there is a limit to how much you can recurse in Python: Python can only recurse to a certain limit (see sys.setrecursionlimit to change this). While you can change the number of recursions to something bigger, there will always be a limit bound by the memory necessary to maintain the stack. There are also alternative ways to explicitly implement recursion, but at the cost of speed. Many languages implement a feature called Tail Call Optimization (TCO), which allows for infinite levels of recursion with high performance, but Python doesn’t implement it.

There’s more...

You can use recursion in Python for simple cases without many recurring calls, but in Python, recursion is not a general substitute for iterative solutions. Recursion is probably the least well-implemented major feature from the functional world in Python.

A showcase of Python’s functools module

Python has three built-in modules that greatly help when writing code in a functional dialect: functools, operator, and itertools. In this recipe, we are going to briefly discuss the functools module. For example, the fundamental reduce function (where part of the name of MapReduce comes from) is only available if you import functools.

While a detailed exploration of these modules would be too long for a single recipe, we are going to showcase some functionality by improving some of the code of the previous recipes with the functionality from functools and showcase some illustrative examples of the utility of the module.

Getting ready

Our code is available in Chapter12/Builtin.py. We will make references to previous recipes.

How to do it...

Let’s look at several illustrative examples:

  1. Remember that our recursive implementation of a factorial function in the previous recipe was not very efficient? Let’s mitigate many of its problems in a very simple way:

    import functools

    @functools.cache

    def fibo(n):

        if n == 0:

            return 0

        if n == 1:

            return 1

        return fibo(n - 1) + fibo(n - 2)

If you remember the recursive code in the previous recipe, it was very slow, even for small numbers. Doing fibo_naive(1000) could take hours. This function, just by adding the @functools.cache decorator, can process much higher numbers far more quickly. This is due to the cache decorator implementing memoization.

What is memoization?

Memoization is the process through which the computer caches the results of execution and, the next time it is called, instead of computing it again, just returns the stored result. For example, the first time you call fibo(10), to get the result, the function is actually executed, but the second time you call it, the result was cached on the first run and returned without execution. Memoization only works correctly if the function always returns the same output for an equal input and has no side effects. That is, memoization only works with pure functions.

  1. For another example of an alternative to the existing code using a functional approach, take the function from the Using lazy programming for pipelining recipe:

    def gene_min_reads(source, min_reads):

        for gene, count in source:

            if count >= min_reads:

                yield gene

There is already some functional flavor here, as we are using a generator.

  1. This function can be written in a more functional dialect:

    def gene_min_reads(source, min_reads):

        return map(

            lambda x: x[0],

            filter(lambda x: x[1] >= min_reads,

            source.items()))

There is a lot to unpack here. First, look at the built-in filter function: it will apply the function defined in the first parameter to objects of the iterator on the second parameter. In our case, objects where the second element is larger or equal to min_reads will be preserved. Then, the map function takes each object (which is of the type ('GENE', count)) and returns only the first part. map and filter functions are very common in dialects in functional programming. Also note the quintessential functional concept of the anonymous function, a lambda: this is a function definition that will only be used in a single place – it’s quite useful for very tiny functions. In this case, there is no direct reason to rewrite in this fashion (especially because the generator type of the previous definition already provides the most important feature of a functional dialect to our problem), but you will find many cases where this type of representation is more succinct and expressive.

  1. Another important concept that you will probably need when using distributed systems is partial function application – here is a simple example using the most basic of arithmetic functions:

    def multiplication(x, y):

        return x * y

    double = functools.partial(multiplication, 2)

    double(3)

double is defined as partially resolving one of the arguments of multiplication – hence, we use the term partial function application. This is not only a stylistic issue, as sometimes in distributed platforms (or even just with the map function of multiprocessing pools), you are limited to only supplying a single parameter – hence, a partial function application becomes a necessity.

There’s more...

There are many more interesting applications in the functools module to check out. The itertools and operator modules also have many functions that are idiomatic from a functional perspective.

See also...

There are several functional programming libraries for Python – for example, see the following:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.8.42