2
Deep Learning in Population Genetics: Prediction and Explanation of Selection of a Population

Romila Ghoshand1 and Satyakama Paul2

1Department of Statistics, Amity University, Kolkata, India

2Institute of Intelligent Systems, University of Johannesburg, Johannesburg, South Africa

2.1 Introduction

A major aim in population genetic studies is to draw insights into the evolutionary background of a population. Until decades after its emergence in twentieth century, theories on genetics have been way ahead of the data required to prove the statements. Improvement of the data generating capacity of genomes in the end of twentieth century have enabled analysis of genomic data over related disciplines. Currently, with the advent of whole genome sequencing data, demographic inference can be carried out with more efficient models [1].

In this paper, we focus on using deep learning (DL) to infer on selection on a whole‐genome variation data of Drosophila melanogaster species. Since the availability of genome sequencing data, several studies on demography and selection have been carried on the populations of Drosophila. However, determination or joint inference of both demography and selection has been argued by several researchers to be difficult because of the high influence of selection over shaping the demography of a population [2]. In many of the previous research studies on selection, it is clearly indicated that for Drosophila genome, selection I confounded demography [3] and vice versa. Attempts to infer upon selection over D.melanogaster populations have been made using maximum‐likelihood procedure where selection has been found to be difficult to infer upon, given the presence of demographic parameters [4].

2.2 Literature Review

Machine learning (ML) models are applied in various disciplines and applications, from text processing to quantitative modeling. In population genetics, we are mainly interested in supervised learning algorithms and classification models. If we look at ML approaches developed to infer upon only selection, then notable research studies include methods based on support vector machines (SVMs), boosting, etc. [5].

Pavlidis et al. [6] provide an approach of parameter estimation previously for deploying a classification algorithm to classify between neutral or positive selection, wherein adaptations of the w‐statistic for the parameter estimation part and Sweep Finder algorithm for classification are used. Another improvement on the above approach is proposed with SweeD algorithm, which is a modification over Sweep Finder algorithm allowing direct calculations of neutral site frequency (SF) over a demographic model. The SweeD algorithm shows the ability to handle larger samples, thereby resulting in determination of positive selection sweeps with more precision [7]. Another approach to separating selective sweeps using a demographic model is done by developing site frequency (SF) select that improves test scores, where SVMs are trained on normalized scaled site frequency spectrum (SFS) vectors developed from simulated populations [8]. Boosting is a statistical method for improving model predictions of simple learning algorithms. Functional gradient descent (FGD) algorithm [9] is used upon simulated populations showing neutral and selective sweep and to predict upon positive selection [10]. Another method using L21 boosting algorithm is proposed [9] to infer upon positive selection over genome‐wide single‐nucleotide variant (SNV) data and performed using hierarchical classification, where different boosting functions were sequentially applied on the input data [11].

Few works have addressed selection along with population size changes. A maximum likelihood‐based approach to identify events inducing low genetic variability and differentiating between demographic bottlenecks and positive selection is described by Galtier et al. [12] over African population data of D.melanogaster. The results inferred that the occurrence of positive selection is evident over any demographic bottleneck. A research by Gossmann et al. [13] to estimate the population size of D.melanogaster found that population size is correlated with positive selection. Another study [14] on D.melanogaster genomic data using the background selection model (BGS) assumes a uniform distribution of deleterious mutations among chromosomes. Upon review of popular models on selection, it is concluded that BGS framework can be used as a null model to study forms of natural selection [15]. Another research [16] on studying the patterns and probabilities of selection footprints under rapid adaptation contrasts model predictions against observed adaptation patterns in D.melanogaster population. It concludes that soft sweeps are frequent in rapid adaptation, but complex adaptive patterns also play a significant role in rapid evolution. A study to detect incomplete selective sweeps over African populations of D.melanogaster was made by identifying sweeping halotypes (SHs) that carry beneficial alleles that have rapidly increased in frequency [17]. Harris et al. [18] proposed a method to detect hard and soft selective sweeps by identifying multilocus genotypes using the H12 and H1/H2 statistics proposed by Garud et al. [19].

DL is one of the modern approaches being adapted for population genetic inferences. Originally inspired by the connection between neurons in the brain [20], neural network consists of layers of interconnected nodes (neurons), wherein each node is a perceptron that calculates a simple output function from weighted incoming information. In the presence of a well‐defined number of neurons and an efficient connectivity pattern, neural networks are very efficient in learning the features of input and provide class discriminations as output or capture the structure of the data, given that enough training data or instances are provided [21].

In genomics, DL models are often used for genomic data classification with the aim to obtain feature classification among populations. A DL method named D‐GEX is proposed by Chen et al. [22] to infer upon the expression of target genes from a subset of landmark genes. D‐GEX outperforms the linear regression‐based approach by a relative improvement of approximately 6.6images. An approach with DL‐based algorithmic framework named PEDLA is developed to predict enhancers and is found to achieve 95images accuracy [23]. Another similar framework is devised to identify enhancer and promoter regions in the human genome. A deep three‐layer feedforward neural network used to train the data and the model is denoted as DECRES (DEep learning for identifying Cis‐Regulatory Elements) [24]. An approach provided by Song and Sheehan [25] uses a combination of approximate Bayesian computation [26] and a DL architecture with 3 hidden layers and 60 neurons to infer upon selection of D.melanogaster population where a subset of carefully chosen summary statistics as per their importance is used as the input. Another study by Jones et al. [27] shows a brief overview of the advances of DL models in the field of computational biology where the scope and potential of DL as a tool in genomics, biological image processing, and medical diagnosis is described. Lastly, an approach to classify selective sweeps exploits a supervise ML algorithms termed diploS/HIC. diploS/HIC uses coalescent simulations to generate training data followed by a feature vector calculation and further uses a convolutional neural network (CNN) architecture consisting of three convolution layers and finally a softmax activation layer to get the classification results [28].

2.3 Dataset Description

2.3.1 Selection and Its Importance

Selection is the procedure by which certain traits in a population gain reproductive advantage over other traits in the population provided that the traits are heritable. Natural selection is an explanation of how diversity of life increases. Life forms vary and have different reproduction rates, causing populations to shift toward the most successful variations. Mutation is a phenomenon that causes genetic variation in an individual. However, the genetic variation due to mutation is random and mostly change the genetic structure because of which mutation is not visible. Although selection acts on that variation in a very nonrandom way, genetic variants that aid survival and reproduction are much more likely to become dominant than variants that do not.

In an evolutionary scenario, selection is the process by which populations evolve by inheriting adaptive changes. Natural selection increases the mean fitness of a population by allowing more fit individuals to evolve in the later generations, thus producing changes in the frequencies of the genetic alleles associated with variation in fitness. Thus, selection plays an important role in shaping demography by means of adaptation and helps in the study of evolutionary distance between different species' genomes.

In this paper, we use the simulation procedure followed by Song and Sheehan [25] and Peter et al. [29] and use the simulation data provided (as part of the software evoNet). We briefly describe the simulation procedure here for completeness. Measuring the impact of selection, each 100 kb region was divided into three smaller regions: (i) close to the selected site (40–60 kb), (ii) mid‐range from the selected site (20–40 and 60–80 kb), and (iii) far from the selected site (0–20 and 80–100 kb). The following statistics are calculated within each of these regions:

  1. Selection: If a genomic region is selected, it represents higher frequency than is expected by chance. In other words, the mating is not random, and the Darwinian principle of the survival of the fittest is favoring some region. This region might code for genes that help in fighting diseases or in general help in survival.
  2. Site frequency spectrum (SFS): The SFS is the distribution of allele frequencies of a given set of loci or single‐nucleotide polymorphisms (SNPs) in a population. Allele frequency is the relative frequency of an allele at a particular locus in a population. A SNP is a substitution of a single nucleotide that occurs at a specific position in the genome. In the dataset, images is the number of segregating sites where the number of minor allele occurs i times images images. Enough samples were simulated to get 60 haplotypes. Normalization was done by dividing with the sum of the entries, i.e. 60.
  3. Identity by state (IBS) tract length distribution: IBS refers to two identical alleles or two identical segments or similar nucleotide sequences in the genome. IBS tract is a set of continuous genomic regions (denoted by, say L) between a pair of samples delimited by bases where they differ. Similar to length distribution between segregating sites, we count the tract lengths that fall in each bin of a set of 24 equally spaced bins() starting at 0–1500. The tract length distribution is given by:
    (2.1)equation
  4. Linkage disequilibrium (LD) distribution: LD measures correlation between segregating sites. It is the deviation from the law of Mendel, which states that the identity of two alleles at the loci of the same chromatid is independent. Site alleles should be independent if there is high recombination between the sites. If there is no recombination, alleles should be tightly linked. Here, LD is computed between one site in the selected region and one site in each of the three regions (including the selected region). LD distribution is computed over 10 equally spaced regions, the first one ending at images and the last bin starting at 0.2.
  5. Length distribution between segregating sites (BIN): The BIN system clusters gene sequences using refined single linkage algorithm (RESL) to produce operational taxonomic units (OTUs) that closely correspond to species [30]. In the dataset [26] [29], 16 equally spaced bins are defined starting from 0 to 300 to compute the distribution of the number of bases between successive segregating sites. Let images be the number of bases between k and the (k + 1) segregating sites. Then, the distribution of these lengths is given by
    (2.2)equation

2.4 Objective

A dataset for computing the summary statistics was simulated that is relevant to selection in Drosophila. Refer to Elyashiv et al. [31] for a detailed analysis of selection on Drosophila. The program developed by Ewing and Hermisson [32] is used for simulating the datasets. Each dataset corresponded to 100 haplotypes. For a particular selection scenario, a 100 kb region was simulated with the selected site (if present) occurring randomly in the middle 20 kb of the region. Baseline effective population size is images, a per‐base, per‐generation mutation rate is images, and a per‐base, per‐generation recombination rate r equal to images. This process was repeated 10 000 times to generate 10 000 rows or instances.

The variables in the dataset are formed mainly by four statistics, SFS, IBS, LD, and BIN. For each statistic name, Close, Mid, and Far represent the genomic subregion where the statistic is calculated. The numbers after each of the colons refer to the position of the statistic within its distribution or order. For example – for the folded SFS statistics, the value after the colon represents a number of minor alleles, for the LD statistic. In the dataset, each region of a gene refers to – 87 SFS statistics, 48 BIN statistics, 75 IBS statistics, and 30 LD statistics. In effect, there are 240 independent variables or predictors in the dataset, and each predictor corresponds to a specific genomic region. It might also be noted that there are no missing values in the dataset. The dependent variable or response is selection – which comprises two classes, neutral region and hard sweep. The description of the selection classes are as follows: neutral region is the class corresponds to no selection in the genome, and hard sweep is the class corresponds to positive selection on a de novo mutation. Figure 2.1 shows the proportional frequency of the two classes of selection, and Figure 2.2 shows the spatial distribution of the two classes across 240 predictors. t‐Distributed stochastic neighborhood embedding has been used to compress the 240 predictors into a two‐dimensional hyperplane.

Illustration depicting the proportional frequency distribution of two classes of selection - hard sweep and neutral region.

Figure 2.1 Proportional frequency distribution of hard sweep and neutral region.

The objective of this work is to create a classification algorithm (among the family of advanced ML and DL algorithms) that can best predict the two classes of selection. The event here is hard sweep and as shown in Figure 2.1, the event rate is 40.65images.

A two-dimensional hyperplane depicting the spatial distribution of the hard sweep and neutral regions across 240 predictors

Figure 2.2 Spatial distribution of the two classes of selection.

2.5 Relevant Theory, Results, and Discussions

As discussed in Section 2.4, our objective is to first decide upon a classification algorithm that is best suited to our problem. Because there are multiple ML algorithms and DL architectures to choose from and each of their implementation is time consuming, we use the automatic machine learning (automl) framework to choose the best classification algorithm for the task.

2.5.1 automl

As the name suggests, automl is a framework that automates various components of a ML pipeline such as feature engineering, reduction of high dimensional data, hyperparameter tuning of various kinds of models, etc. However, in this work, we use one of its specific features – selection of the best algorithm for a classification task. In particular, we use the open source H2O automl framework in images.2

In H2O, the choice set (from which the best algorithm is chosen) consists of random forest (RF) and extremely randomized trees (XRT), generalized linear models (GLMs), gradient boosting machine (GBM), extreme gradient boosted machines (XGBoost), deep neural networks (DNNs), and a stacked ensemble (SE) of all the previous models [33]. It might also be noted that each of these models is built using a random set of relevant hyperparameters.

Let us assume that for the above set of given algorithms A, their corresponding hyperparameters are denoted by images. Also, let us assume that the training set is denoted by images and the training set is split into K‐fold cross‐validation samples. The cross‐validation samples are denoted by images. Now, the best algorithm is chosen that gives the minimum value for the loss function or metric

equation

where images is the performance of the loss function over the algorithm images that is run over the training data images and evaluated over the validation set images using a random set of hyperparameters [34].

Following the above concept, we divide our total dataset of 10 000 rows randomly into training and testing set in the ratio of 0.8 : 0.2. Thus, our training set comprises 8037 rows and the test set has 1963 rows. In addition, a fivefold cross‐validation is done. We also use area under the curve (AUC) as a loss function or metric and the automl pipeline is run for 90 minutes to extract the best classification algorithm. Table 2.1 shows the five best models in decreasing order of AUC.

As can be noted, DNN and GBM outperforms the rest of the algorithms such as RF, GLM, SE, etc. In addition, from the random set of hyperparameters used by automl, it suggests that the best model DNNimages has the following architecture and the hyperparameters:

  • number of neurons in the input layer: 240
  • number of hidden layers: 3
  • neurons in each hidden layer: 50
  • activation function in all three hidden layers: rectifier with dropout
  • number of neurons in the output layer: 2
  • activation function used in the output layer: softmax
  • percentage dropout used in all the input and hidden layers: 10
  • images and images regularization: 0
  • momentum: 0

2.5.2 Hypertuning the Best Model

In Section 2.5.1, automl suggests that the best classification algorithm for this test to be DNN, among the set of six models. In this subsection, we further intend to hypertune the DNN's parameters to see if AUC can be improved above 0.820 390 2. In this regard, our choice set of the hyperparameters consists of the following combinations:

  • number of hidden layers: 3
    1. various combinations in the first, second, and third hidden layers
      1. – 250, 250, 250
      2. – 500, 500, 500

        Table 2.1 automl results.

        Algorithm AUC
        DNNimages 0.820 390 2
        GBMimages 0.819 398 4
        GBMimages 0.814 290 7
        GBMimages 0.812 530 5
        GBMimages 0.812 455 3
      3. – 750, 750, 750
  • number of hidden layers: 4
    1. various combinations in the first, second, third, and fourth hidden layers
      1. – 30, 30, 30, 30
      2. – 60, 60, 60, 60
      3. – 90, 90, 90, 90
      4. – 120, 120, 120, 120
  • number of hidden layers: 5
    1. 240, 120, 60, 30, 15
  • various options of activation function in each neuron each layer:
    • – Rectifier
    • – Rectifier with dropout
    • – tanh
    • – tanh with dropout
    • – Maxout
    • – Maxout with dropout
  • various options of input dropout images: 0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5
  • various options of images and images values: 0, 1.0eimages, 2.0eimages, …, 1.0eimages
  • various options of learning rates: 0, 01, 0.005, 0.001
  • various options of rate of annealing: 1eimages, 1eimages, 1eimages
  • various options of epochs: 50, 100, 200, 500

Based on the above wide combinations of hyperparameter values and an early stopping rate of 1eimages for improvement in the value of AUC (at which the algorithm stops), the developed DNN shows an improvement in the value of AUC – 0.869 831 2 (0.820 390 2 as shown by automl). Also, the architecture of the best DNN model is as follows:

  • number of neurons each in the four hidden layers: 30
  • activation function in the four hidden layers: Maxout
  • activation function in the output layer: Softmax
  • dropout images in the input layer: 10
  • dropout images in the three hidden layers: 0
  • images in the four hidden layers and output: 1eimages
  • images in the four hidden layers and output: 0.000 034

Finally, the accuracy metrics on the test set is as shown in Table 2.2. The vertical values are actual and the across ones are predicted. Precision, recall, and overall accuracy are 0.996, 0.38, and 0.85, respectively.

Table 2.2 Accuracy on test data.

Hard sweep Neutral region Error rate in images
Hard sweep 500 296 0.372
Neutral region 0 1167 0.0
Total 500 1463 0.151

As can be seen from Figure 2.2, as there is a high level of overlap between hard sweep and neutral region classes in the circular blog, we expect the miss‐classification to happen in that region. Lastly, Figure 2.3 shows the first 10 most important predictors.

Horizontal bars depicting the first 10 most important predictors of deep learning.

Figure 2.3 Importance of the predictors.

2.6 Conclusion

This research work is a detailed explanation of utilization of the concepts of automl and hyperparameter tuning of a DNN to find the best DNN architecture to classify and predict the selection classes of D.melanogaster species. To the best of authors' knowledge, such a detailed explanation of the implementation of automl and parameter hypertuning has not been carried out in the past. In future, we would like to incorporate the concepts of explainable AI to further understand the importance of various predictors in a model agnostic framework.

References

  1. 1 Pool, J.E., Hellmann, I., Jensen, J.D., and Neilsen, R. (2010). Population genetic inference from genomic sequence variation. Genome Research 20 (3): 291–300.
  2. 2 Li, J., Li, H., Jakobsson, M. et al. (2012). Joint analysis of demography and selection in population genetics: where do we stand and where could we go? Molecular Ecology 21 (1): 28–44.
  3. 3 Sella, G., Petrov, D.A., Prezworski, M., and Andolfatto, P. (2009). Pervasive natural selection in the Drosophila genome? PLoS Genetics 5 (6): 1–13.
  4. 4 González, J., Macpherson, J.M., Messer, P.W., and Petrov, D.A. (2009). Inferring the strength of selection in Drosophila under complex demographic models. Molecular Biology and Evolution 26 (3): 513–526.
  5. 5 Schrider, D.R. and Kern, A.D. (2018). Supervised machine learning for population genetics: a new paradigm. Trends in Genetics 34 (4): 301–312.
  6. 6 Pavlidis, P., Jensen, J.D., and Stephan, W. (2010). Searching for footprints of positive selection in whole‐genome SNP data from nonequilibrium populations. Genetics 185 (3): 907–922.
  7. 7 Pavlidis, P., živković, D., Stamatakis, A., and Alachiotis, N. (2013). SweeD: likelihood‐based detection of selective sweeps in thousands of genomes. Molecular Biology and Evolution 30 (9): 2224–2234.
  8. 8 Ronen, R., Udpa, N., Halperin, E., and Bafna, V. (2013). Learning natural selection from the site frequency spectrum. Genetics 195 (1): 181–193.
  9. 9 Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: regularization, prediction and model fitting. Statistical Science 22 (5): 477–505.
  10. 10 Lin, K., Li, H., Schlötterer, C., and Futschik, A. (2011). Distinguishing positive selection from neutral evolution: boosting the performance of summary statistics. Genetics 187 (1): 229–244.
  11. 11 Pybus, M., Luisi, P., Dall'Olio, G.M. et al. (2015). Hierarchical boosting: a machine‐learning framework to detect and classify hard selective sweeps in human populations. Bioinformatics 31 (24): 3946–3952.
  12. 12 Galtier, N., Depaulis, F., and Barton, N.H. (2000). Detecting bottlenecks and selective sweeps from DNA sequence polymorphism. Genetics 155 (2) 981–987.
  13. 13 Gossmann, T.I., Woolfit, M., and Eyre‐Walker, A. (2011). Quantifying the variation in the effective population size within a genome. Genetics 189 (4): 1389–1402.
  14. 14 Charlesworth, B. (2009). Background selection and patterns of genetic diversity in Drosophila melanogaster. Genetics Research 68 (2): 131–149.
  15. 15 Comeron, J.M. (2017). Background selection as null hypothesis in population genomics: insights and challenges from Drosophila studies. Philosophical Transactions of the Royal Society B: Biological Sciences 372 (1736) 1–13. (available at https://royalsocietypublishing.org/doi/pdf/10.1098/rstb.2016.0471)
  16. 16 Hermisson, J. and Pennings, P.S. (2017). Soft sweeps and beyond: understanding the patterns and probabilities of selection footprints under rapid adaptation. Methods in Ecology and Evolution 8 (6): 700–716.
  17. 17 Vy, H.M.T., Won, Y.‐J., and Kim, Y. (2017). Multiple modes of positive selection shaping the patterns of incomplete selective sweeps over African populations of Drosophila melanogaster. Molecular Biology and Evolution 34 (11): 2792–2807.
  18. 18 Harris, A.M., Garud, N.R., and DeGiorgio, M. (2018). Detection and classification of hard and soft sweeps from unphased genotypes by multilocus genotype identity. Genetics 210 (4): 1429–1452.
  19. 19 Garud, N.R., Messer, P.W., Buzbas, E.O., and Petrov, D.A. (2015). Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genetics 11 (2): 1–32. (available at https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1005004&type=printable)
  20. 20 Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America 79 (8): 2554–2558.
  21. 21 Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (2): 251–257.
  22. 22 Chen, Y., Li, Y., Narayan, R. et al. (2016). Gene expression inference with deep learning. Bioinformatics 32 (12): 1832–1839.
  23. 23 Liu, F., Li, H., Ren, C. et al. (2016). PEDLA: predicting enhancers with a deep learning‐based algorithmic framework. Scientific Reports 6: 28517.
  24. 24 Li, Y., Shi, W., and Wasserman, W.W. (2018). Genome‐wide prediction of cis‐regulatory regions using supervised deep learning methods. BMC Bioinformatics 19: 202.
  25. 25 Song, Y.S. and Sheehan, S. (2016). Deep learning for population genetic inference. PLoS Computational Biology 12 (3): 1–28..
  26. 26 Beaumont, M.A., Zhang, W., and Balding, D.J. (2002). Approximate Bayesian computation in population genetics. Genetics 2025–2035. (available at https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1004845&type=printable)
  27. 27 Jones, W., Alasoo, K., Fishman, D., and Parts, L. (2017). Computational biology: deep learning. Emerging Topics in Life Sciences 1 (3): 257–274.
  28. 28 Kern, A.D. and Schrider, D.R. (2018). diploS/HIC: an updated approach to classifying selective sweeps. G3: Genes Genome Genetics 8 (6): 1959–1970.
  29. 29 Peter, B.M., Huerta‐Sanchez, E., and Nielsen, R. (2012). Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS Genetics 8 (10): 1–14. (available at https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1003011&type=printable)
  30. 30 Ratnasingham, S. and Hebert, P.D.N. (2013). A DNA‐based registry for all animal species: the barcode index number (BIN) system. PLoS One 8 (8).
  31. 31 Elyashiv, E., Sattath, S., Hu, T.T. et al. (2016). A genomic map of the effects of linked selection in Drosophila. PLoS Genetics 12 (8): 1–24. (available at https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1006130&type=printable)
  32. 32 Ewing, G. and Hermisson, J. (2010). MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26 (16): 2064–2065.
  33. 33 AutoML: Automatic Machine Learning. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html (accessed 01 July 2019).
  34. 34 Zoller, M.A. and Huber, M.F. (2019) Benchmark and Survey of Automated Machine Learning Frameworks. https://arxiv.org/pdf/1904.12054.pdf (accessed 01 July 2019).

Notes

  1. 1 L1 and L2 are common regularization methods used in ML to reduce a model's overfitting.
  2. 2 Other open source automl frameworks are provided by Scikit‐Learn, TPOT, etc.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
174.129.59.198