Python is a multi-paradigm language that allows you to express computations in multiple different ways. It is sometimes called an object-oriented (OO) language: sure, you can write code in an OO dialect, but you can also use other styles. Most code in Python is written in an imperative style: there is no structuring along a class hierarchy, as typical of the OO paradigm, and most code changes state that, for example, if you write x = x + 1, you are changing the state of variable x.
If you write complex code, especially code that requires parallel processing, imperative and OO paradigms will hit some limits. For simple scripts that run on a single machine, imperative and OO styles will do fine, but bioinformatics is a big-data enterprise, and you will often need to scale up and scale out, as there is lots of information to process, and many algorithms are computationally heavy.
Functional programming is quite useful for the complex and parallel problems that are becoming more common within bioinformatics. Many modern architectures for high-throughput data analysis are based on functional programming ideas, for example, the MapReduce paradigm originally developed at Google and on platforms such as Apache Spark. These ideas are also directly applicable to Dask and even to making sequential code more robust.
Functional programming is based on functions, pure function evaluation, and avoiding mutability. In this chapter, we are going to take a very practical approach to presenting these concepts, with examples of their use cases within typical bioinformatics applications. By the end, you should have a basic conceptual grasp of functional programming, but above all, you will understand its utility and applicability.
If you are using Docker, and because all these libraries are fundamental for data analysis, they all can be found in the Docker image tiagoantao/bioinformatics_base.
In this chapter, we will cover the following recipes:
A pure function has a couple of important properties: for the same inputs, it produces the same outputs, and it has no side effects (it doesn’t change global variables and doesn’t do I/O). In this recipe, we are going to introduce a few simple examples to make the concept clear. We are mostly – but not exclusively – interested in the second property: the lack of side effects. In later recipes, it will be made clear why pure functions can be quite useful.
We are going to develop a very simple example where we are counting the genes sequenced per sample. We will have a database on a text file with counts for genes. For example, we might have sequenced LCT and TP53 on a sample, and LCT, MRAP2, and POMC on another. The total count would be: TP53: 1, LCT: 2, MRPA2: 1, and POMC: 1. We will be using a CSV file that can be easily read with Pandas or even just the CSV module.
We will be using a simple CSV file as our little database. The code for this recipe can be found in Chapter12/Pure.py. The database file is my_genes.csv and there is a saved original state at my_genes.csv.base in case you need to get back to the original database state.
Take a look at the following steps to get started:
import shutil
import pandas as pd
def restore_db(file_name):
shutil.copyfile(f'{file_name}.base', file_name)
def load(file_name):
df = pd.read_csv(file_name).set_index('gene')
return dict(df['count'])
def save(dict_db, file_name):
pd.Series(dict_db).to_csv(
file_name, index_label='gene', header=['count'])
We use Pandas to take care of persistence. This is a very simple example; you could have also only used the csv module.
def add_sample_csv(gene_list):
gene_count = load('my_genes.csv')
for gene in gene_list:
gene_count[gene]=gene_count(gene,0)+1
save(gene_count, 'my_genes.csv')
This function automatically persists the data with a new sample. It has the side effect of reading and writing a file, so it’s not a pure function.
def add_sample_global_dict(gene_list):
global gene_count
for gene in gene_list:
gene_count[gene] = gene_count.get(0) + 1
This function uses a global variable from the module and updates it, a side effect. We will not be using this function, but it is provided as a bad example of a function with side effects. We should avoid global variables.
def add_sample_dict(dict_db, gene_list):
for gene in gene_list:
dict_db[gene] = dict_db.get(0) + 1
return dict_db
This function mutates one of its parameters – dict_db – and as we will see in the next recipe, this is not a best practice from a functional programming perspective. However, it always returns the same result for equal output and doesn’t have side effects, so it will be a good first approach.
add_sample_csv(['MC4R', 'TYR'])
But, if we run it 10 times by mistake, what would be the consequence? We would over-report both genes 9 times, corrupting the content of our database from a logical perspective.
If you run it 10 times, what would be the consequence? It would still be an error (as we are changing gene_count), but at least it is not yet committed to the disk. This dialect would be much more comfortable when doing exploratory data analysis – you could run it knowing that no external data would be damaged. In the next recipe, we are going to see an alternative to make re-running code less problematic.
In the next few recipes, we are going to see, with very practical cases, why pure functions can be very useful. But there is one specific case that would be cumbersome to explain with an example and as such, we are going to present the theory here.
Pure functions make parallelizing code much easier. Imagine a framework to execute distributed computing such as Dask or Spark. Such a framework has to deal with the potential case where hardware fails, and code needs to be moved and potentially repeated elsewhere. If your code writes to disk (as an example of a side effect) every time it’s run, it’s much more complicated for distributed frameworks to recover. If your code has no side effects, the distributed framework could repeat your function without any concern for data consistency. Indeed, many distributed frameworks do not allow you to have side effects in your parallelizable code. They also might frown at data structure mutability, something we are going to discuss now.
Another common trait of functional programming is that data structures are typically immutable. This is a tough concept to get around when you are used to imperative programming – the idea of programming without objects that change state over time. Here, we are going to see a simple example of making a function from the previous recipe work in an immutable way: that is, so that no objects are changed, and if we need to pass new information, we create new ones.
This recipe will give a short presentation on immutability from a data structure perspective. It will be, in a sense, the standard presentation that you can find in most books. Our main consideration, though, is to discuss mutability as a code design pattern, the topic of the following recipe. But for this, we need to understand immutability first.
We will be looking at two functions: one that mutates data structures and another that doesn’t. This will be done in the context of the example that we followed in the previous recipes of this chapter.
We will be using the same data as in the previous recipe. The code for this recipe can be found in Chapter12/Mutability.py.
Follow these steps:
def add_sample_dict(dict_db, gene_list):
for gene in gene_list:
dict_db.get(gene,0) +1
If you call this code with, say, add_sample_dict(gene_count, ['DEPP']), while there are no output parameters, your dictionary’s gene_count will be mutated to add 1 to the gene DEPP. If you run this function an excessive number of times – which is typical when doing exploratory data analysis – you would change your dictionary erroneously.
def add_sample_new_dict(dict_db, gene_list):
my_dict_db = dict(dict_db)
for gene in gene_list:
my_dict_db[gene] = my_dict_db.get(0) + 1
return my_dict_db
In this case, we are copying dict_db to a new dictionary, my_dict_db, and adding the extra gene list. There is a memory and time cost of making the copy of the existing dictionary, but the original one, dict_db, is not changed.
new_gene_count = add_sample_new_dict(gene_count, ['DEPP'])
gene_count is not changed and new_gene_count is a completely new dictionary. You can repeat the execution as many times you want without being worried about the impact of each execution.
This simple recipe provides an example of a function that doesn’t mutate its parameters. We are now in a situation where we can execute this function as many times as we want – by mistake or on purpose – without any consequences for the rest of the code. This is quite useful when we are testing code or running exploratory data analysis, as we don’t have to be worried about side effects. This also facilitates the work of distributed executors such as Dask or Spark, as they don’t have to worry about checking the state of existing objects: they are fixed.
There are also important lessons here for general software design, even if you don’t use distributed executors. That is what we are going to investigate in the following recipe.
The previous recipe introduced the concept of immutable data structures. In this recipe, we are going to discuss a design pattern that avoids persistent database mutability in your code until the very end. In terms of pseudocode, most applications in a long script work as follows:
Do computation 1 Write computation 1 to disk or database Do computation 2 Write computation 2 to disk or database …. Do computation n Write computation n to disk or database
Here, we are going to present an alternative paradigm and discuss why it is generally better from a resilience point of view:
Do computation 1 Write computation 1 to temporary storage Do computation 2 Write computation 2 to temporary storage ... Do computation n Write computation n to temporary storage Take all temporary data and write it to definitive disk and database
First, we will show the code for both approaches, and then discuss why, for complex and sophisticated scripts, the latter is a better approach in most cases.
We will use the same example as in the previous recipes: we will report on genes seen in different samples.
We will be using the same data as in the previous recipe. The code for this recipe can be found in Chapter12/Persistence1.py and Chapter12/Persistence2.py.
Let’s set up both solutions along with some shared code.
import pandas as pd
def restore_db(file_name):
shutil.copyfile(f'{file_name}.base', file_name)
def load(file_name):
df = pd.read_csv(file_name).set_index('gene')
return dict(df['count'])
def save(dict_db, file_name):
pd.Series(dict_db).to_csv(
file_name, index_label='gene', header=['count'])
def add_sample_csv(gene_list):
gene_count = load('my_genes.csv')
for gene in gene_list:
gene_count[gene]=gene_count(gene,0)+1
save(gene_count, 'my_genes.csv')
add_sample_csv(['MC4R', 'TYR'])
add_sample_csv(['LCT', 'HLA-A'])
add_sample_csv(['HLA-B', 'HLA-C'])
Every time we add a sample, we update the main database. While this solution is not very efficient from an I/O perspective, that is not the issue at hand here – actually, most design patterns of this kind tend to be more efficient from an I/O perspective than what we are going to look at next.
def add_sample_new_dict(dict_db, gene_list):
my_dict_db = dict(dict_db) # next recipe
for gene in gene_list:
dict_db.get(gene,0) +1
return my_dict_db
gene_count = load('my_genes.csv')
gene_count = add_sample_new_dict(gene_count, ['MC4R', 'TYR'])
gene_count = add_sample_new_dict(gene_count, ['LCT', 'HLA-A'])
gene_count = add_sample_new_dict(gene_count, ['HLA-B', 'HLA-C'])
save(gene_count, 'my_genes.csv')
In this case, we only update the main database at the very end. We use an in-memory dictionary to maintain the intermediate data.
So, why do this? The solution that delays writing the final data is, if anything, more complex, as we have to potentially store partial results along the way and manage that (which we do with the gene_count dictionary in this case – a simple example). In fact, this is true but code has bugs, disk storage fills up, and computers break in the real world, so facing this is the more important consideration in pragmatic terms.
Imagine that your first solution stops working in the middle of the execution for some reason. The database is in an unknown state, as you don’t know where the code was. You simply cannot restart the code from scratch, as part of it might already have been executed. So, you have to check what the state of the database is, see where the code stopped, and partially run what is missing, or find a way to roll back whatever part was executed. It is both cumbersome and bug-prone.
Now, imagine that there is a problem during the execution of the second approach: you do not have to worry about the intermediate state of the data because it doesn’t affect your final database and it will be completely recomputed in a second execution. You can just re-run the code, without worrying about inconsistent states. The only exception is the very last line, but not only will this be less common but you can also concentrate all your effort on this point to make the process as resilient as possible.
Can this always be done? Or does it only work in some cases? Most code can be done so that it creates temporary files. Also, much of bioinformatics analysis is not real-time transactional, so it’s normally easy to update a SQL database only at the very end. And you now have a single point of failure that is relevant to your whole code – if there is a problem, you know you only need to concentrate on that very final part.
This approach, where possible, will make the day-to-day maintenance of complex code much easier.
Lazy programming is a strategy where we defer computation until it’s really needed. It has many advantages compared with its counterpart, eager programming, where we compute everything as soon as we invoke a computation.
Python provides many mechanisms for lazy programming – indeed, one of the biggest changes from Python 2 to Python 3 is that the language became lazier.
To understand lazy programming, we are going again to take our gene database and do an exercise with it. We are going to check whether we have at least n genes with y reads each (for example, three genes with five reads each). This can be, say, a measure of the quality of our database – that is, a measure of whether we have enough genes with a certain number of samples.
We are going to consider two implementations: one lazy and one eager. We will then compare the implications of both approaches.
The code for this recipe can be found in Chapter12/Lazy.py.
Take a look at these two different approaches:
import pandas as pd
def load(file_name):
df = pd.read_csv(file_name).set_index('gene')
return dict(df['count'])
def get_min_reads(all_data, min_reads):
return {
gene: count
for gene, count in all_data.items()
if count >= min_reads
}
def has_min_observations(subset_data, min_observations):
return len(subset_data) >= min_observations
print(has_min_observations(
get_min_reads(
load('my_genes.csv'), 4
), 3))
Notice that the code loads all the data, checks all entries for a count of 4, and sees if there are at least 3 observations with that count.
def get_rec(file_name):
with open(file_name) as f:
f.readline() # header
for line in f:
toks = line.strip().split(',')
yield toks[0], int(toks[1])
def gene_min_reads(source, min_reads):
for gene, count in source:
if count >= min_reads:
yield gene
def gene_min_observations(subset_source, min_observations):
my_observations = 0
for gene in subset_source:
my_observations += 1
if my_observations == min_observations:
return True
return False
print(gene_min_observations(
gene_min_reads(
get_rec('my_genes.csv'), 4
), 2))
This code behaves in a completely different way. First, get_rec and gene_min_reads are generators (note the yield), so they will only return results when required, one by one. gene_min_observations will return as soon as it has seen the necessary number of observations. This means that the only data that will be read and processed is the bare minimum to arrive at a result. In a worst-case scenario, this is still the whole file, but in many cases, it can be much less.
The advantages of this approach can be seen with very large files. While there are no perceivable advantages with our toy file, if the file is too large to fit in memory, the first approach would simply crash! Also, every step of the pipeline will require enough memory to store all the data that it consumes, and all of the steps will need to be in-memory at the same time. So, even for medium-sized files, memory problems can occur with the eager version that would be avoided with the lazy version.
Many times, lazy code also does substantially less processing: as in the previous example, it may stop computation before seeing all the data. This is not always assured but happens with many use cases.
As a historical note, Python 2 was substantially more eager than Python 3. It is actually one of the main changes from Python 2 to Python 3. As a trivial example, consider the behavior of range in Python 2 versus 3.
Part of our lazy implementation can be written in a more succinct way by using some Python built-in modules – we will revisit this in the final recipe.
Recursion – a function calling itself – is a fundamental concept in functional programming. Many languages that cater to functional programming are able to recurse ad infinitum. Python is not one of those languages. You can use recursion in Python, but you have to be aware of its limits, namely that recursion incurs quite a lot of memory costs and that it cannot be used to replace iteration – a typical case when writing functional programs.
To see the limits of recursion with Python, we are going to implement the Fibonacci sequence in several ways. As a reminder, Fibonacci can be defined as follows:
Fib(0) = 0 Fib(1) = 1 Fib(n) = Fib(n -1) + Fib(n –2)
We are also computing the factorial function, but in this case only in a recursive way:
Fact(1) = 1 Fact(n) = n * Fact(n –1 )
Our code is available in Chapter12/Recursion.py.
Follow these steps to get started:
def fibo_iter(n):
if n < 2:
return n
last = 1
second_last = 0
for _i in range(2, n + 1):
result = second_last + last
second_last = last
last = result
return result
This version is perfectly sufficient and also quite efficient. We are not suggesting that the iterative version is in any way lesser than the recursive version. On the contrary, with Python, it will probably perform better.
def fibo_naive(n):
if n == 0:
return 0
if n == 1:
return 1
return fibo_naive(n - 1) + fibo_naive(n - 2)
If you run it, you will see that the performance suffers quite substantially when compared with the iterative version. Furthermore, if you ask for fibo_naive(1000) or a similarly large number, the code will not perform well at all (it can take many hours for 1,000 cases), whereas the iterative version will perform adequately. In the following recipe, we will actually fix part of this. But for now, let’s dig deeper into the issues with recursion and Python.
def factorial(n):
if n == 1:
return 1
return n * factorial(n - 1)
If you run this function with a small number, such as factorial(5), you will get the correct answer, 5. But if you try a large number, such as factorial(20000), you will get the following:
Traceback (most recent call last):
File "Recursion.py", line 8, in <module>
factorial(20000)
File "Recursion.py", line 4, in factorial
return n * factorial(n - 1)
File "Recursion.py", line 4, in factorial
return n * factorial(n - 1)
File "Recursion.py", line 4, in factorial
return n * factorial(n - 1)
[Previous line repeated 995 more times]
File "Recursion.py", line 2, in factorial
if n == 1:
RecursionError: maximum recursion depth exceeded in comparison
This error suggests that there is a limit to how much you can recurse in Python: Python can only recurse to a certain limit (see sys.setrecursionlimit to change this). While you can change the number of recursions to something bigger, there will always be a limit bound by the memory necessary to maintain the stack. There are also alternative ways to explicitly implement recursion, but at the cost of speed. Many languages implement a feature called Tail Call Optimization (TCO), which allows for infinite levels of recursion with high performance, but Python doesn’t implement it.
You can use recursion in Python for simple cases without many recurring calls, but in Python, recursion is not a general substitute for iterative solutions. Recursion is probably the least well-implemented major feature from the functional world in Python.
Python has three built-in modules that greatly help when writing code in a functional dialect: functools, operator, and itertools. In this recipe, we are going to briefly discuss the functools module. For example, the fundamental reduce function (where part of the name of MapReduce comes from) is only available if you import functools.
While a detailed exploration of these modules would be too long for a single recipe, we are going to showcase some functionality by improving some of the code of the previous recipes with the functionality from functools and showcase some illustrative examples of the utility of the module.
Our code is available in Chapter12/Builtin.py. We will make references to previous recipes.
Let’s look at several illustrative examples:
import functools
@functools.cache
def fibo(n):
if n == 0:
return 0
if n == 1:
return 1
return fibo(n - 1) + fibo(n - 2)
If you remember the recursive code in the previous recipe, it was very slow, even for small numbers. Doing fibo_naive(1000) could take hours. This function, just by adding the @functools.cache decorator, can process much higher numbers far more quickly. This is due to the cache decorator implementing memoization.
What is memoization?
Memoization is the process through which the computer caches the results of execution and, the next time it is called, instead of computing it again, just returns the stored result. For example, the first time you call fibo(10), to get the result, the function is actually executed, but the second time you call it, the result was cached on the first run and returned without execution. Memoization only works correctly if the function always returns the same output for an equal input and has no side effects. That is, memoization only works with pure functions.
def gene_min_reads(source, min_reads):
for gene, count in source:
if count >= min_reads:
yield gene
There is already some functional flavor here, as we are using a generator.
def gene_min_reads(source, min_reads):
return map(
lambda x: x[0],
filter(lambda x: x[1] >= min_reads,
source.items()))
There is a lot to unpack here. First, look at the built-in filter function: it will apply the function defined in the first parameter to objects of the iterator on the second parameter. In our case, objects where the second element is larger or equal to min_reads will be preserved. Then, the map function takes each object (which is of the type ('GENE', count)) and returns only the first part. map and filter functions are very common in dialects in functional programming. Also note the quintessential functional concept of the anonymous function, a lambda: this is a function definition that will only be used in a single place – it’s quite useful for very tiny functions. In this case, there is no direct reason to rewrite in this fashion (especially because the generator type of the previous definition already provides the most important feature of a functional dialect to our problem), but you will find many cases where this type of representation is more succinct and expressive.
def multiplication(x, y):
return x * y
double = functools.partial(multiplication, 2)
double(3)
double is defined as partially resolving one of the arguments of multiplication – hence, we use the term partial function application. This is not only a stylistic issue, as sometimes in distributed platforms (or even just with the map function of multiprocessing pools), you are limited to only supplying a single parameter – hence, a partial function application becomes a necessity.
There are many more interesting applications in the functools module to check out. The itertools and operator modules also have many functions that are idiomatic from a functional perspective.
There are several functional programming libraries for Python – for example, see the following:
3.145.8.42