Chapter 18: Integer period discrete Fourier transform-based algorithm for the identification of tandem repeats in the DNA sequences

Sunil Datt Sharma; Pardeep Garg    Department of Electronics and Communication Engineering, Jaypee University of Information Technology, Solan, India

Abstract

In this modern era, genomics signal processing is playing an important role to understand the genomics data and biological functionalities of living organisms. The deoxyribonucleic acid (DNA) sequences are found in the genome of living organisms. Based on the studies, it has been found that the DNA sequences contain tandem repeats, and expansion of these tandem repeats is accountable for the diseases like Huntington disease, Mytonic Dystrophy, Spinocerebellar Ataxia, Friedreich Ataxia (FRDA), Fragile X syndrome, and other diseases. Tandem repeats are also useful for other studies like DNA fingerprinting, an area population’s study, and DNA forensics analysis, etc. Therefore, we have proposed an integer period discrete Fourier transform (IPDFT)-based algorithm for the detection of the tandem repeats in DNA sequences. The comparison of the proposed algorithm’s performance has also been done with existing methods.

Keywords

Tandem repeats; DNA sequences; Discrete Fourier transform; Genomics signal processing; Health informatics

1: Introduction

Most of the DNA sequences possess specific repeated patterns of particular periods and hence the identification of these repeats plays a significant job to analyze the DNA sequences. There exists an association of repeated patterns with various diseases. It is well known that a string of nucleotides constitute the DNA sequences. These nucleotides are guanine (G), adenine (A), thymine (T), and cytosine (C). The respective periodicities exist in DNA sequences because of various nucleotide repeat patterns. The various biological functionalities of the living organism occur because of these repetitive patterns. Therefore, detection and analysis of repeated patterns in DNA sequences play a significant role in the field of biology, medicine, disease diagnosis, and forensic sciences (Gupta and Prasad, 2018). The categorization of repeats existing in DNA sequences can be done as tandem and interspersed repeats, and it has been shown in Fig. 1A and B, respectively.

Fig. 1
Fig. 1 (A) Tandem repeats, (B) Interspersed repeats.

The tandem repeats (TRs) are the adjacent repeated patterns and interspersed repeats are the nonadjacent repeated patterns. Tandem repeats are also classified as satellite, minisatellite, microsatellite TRs, and these are shown in Fig. 2.

Fig. 2
Fig. 2 Classification of tandem repeats (Leprae, n.d.).

The number of copies, location, and pattern of TRs play an important role in the diagnosis of genetic diseases and cancer identification, etc. (Yin, 2017). TRs are also associated with various human diseases like Frederick’s ataxia, some types of cancer, Fragile-X syndrome, Huntington’s disease, and more than 40 other neuromuscular, neurodegenerative, and neurological diseases (Usdin, 2008). TRs are also applied to know about the human evolutionary history (Butler, 2003), genetic marker in microbial forensics, human identity, and for the investigation of the infectious disease irruptions (Cummings and Relman, 2002). To identify TRs in DNA sequences, periodicity present in DNA sequences has been utilized. Following are the key contributions of this chapter:

  1. (i) Integer period discrete Fourier transform (IPDFT) has been applied to visualize the tandem repeats in DNA sequence.
  2. (ii) Appropriate threshold selection.
  3. (iii) The number of copies of tandem repeats has been enhanced.

2: Related work

Methods for the identification of the TRs have been classified into two categories such as string matching methods and signal processing methods. Lim et al. (2012) reviewed the string matching methods to identify the TRs. Currently, conversion of DNA characters to numerical values using numerical mapping schemes has provided the signal processing techniques a direction to analyze the genomic data (Sharma et al., 2011). Hence, the focus in this work is on the signal processing-based methods reported for the detection of TRs (Sharma et al., 2017). These methods are modified Fourier product spectrum (MFPS) (Tran et al., 2004), short time periodicity transform (STPT) (Buchner and Janjarasjitt, 2003), spectral repeat finder (SRF) (Sharma et al., 2004), Fourier product spectrum-based spectral methods for localization of tandem repeats (Pop, 2007; Pop and Lupu, 2008), exactly periodic subspace decomposition (EPSD) (Jiang and Yan, 2011), optimized moving window spectral analysis (OMWSA) (Du et al., 2007), parametric spectral estimation (PSE) (Zhou et al., 2009), S-transform (Zribi et al., 2019), and adaptive S-transform (AST) (Sharma et al., 2014). Tran et al. proposed an MFPS for the identification of the approximate tandem repeats (Tran et al., 2004). Buchner et.al proposed a STPT to properly localize the periodicity of tandem repeat and the location of these periodicities using a periodogram. Periodicity transform is the basis of this method and multiple periodicities is the drawback of this method (Buchner and Janjarasjitt, 2003). As it is well known that Fourier analysis is a powerful tool to extract unknown periodicities in the DNA sequences in the proximity of deletion, substitution and insertion, a computer program-based method known as spectral repeat finder (SRF) has been developed by Sharma et al. the basis of which is Fourier transformation. In this method, firstly the pattern length of repeats which is obtained from the power spectrum of a DNA sequence is computed, and consequently, the identification of the location of repeats pattern has been performed using a sliding window approach (Sharma et al., 2004). Pop et al. have proposed an algorithm to detect the TRs in spectrogram using windowed discrete Fourier transform, the basis of which is the Fourier product spectrum (Pop, 2007; Pop and Lupu, 2008). As per the study from literature, it is observed that both the FT- and STPT-based methods have the drawback of multiple periodicities. Gupta et al. proposed EPSD method for the detection of TRs to find the solution to the problem of multiple periodicities (Jiang and Yan, 2011). Liping et al. suggested an OMWSA to detect the DNA repeats, this method being perfect and more robust than Fourier transform-based spectral analysis in the proximity of deletion, substitution, and insertion; however, its drawback is that location of repeats is identified by inspecting the spectrogram (Du et al., 2007). Zhou et al. proposed the parametric spectral estimation (PSE) technique to detect the location of TRs and its pattern automatically (Zhou et al., 2009). However, PSE has limitations of order selection and selection of the appropriate window length. The S-transform-based method (Zribi et al., 2019), which employs p-nuc coding scheme, has been proposed to overcome the limitations of PSE. An adaptive S-transform-based algorithm (Sharma et al., 2014) for the TRs detection has been proposed by Sharma et al. In this chapter, an integer period discrete Fourier transform (Epps, 2009)-based algorithm for the TRs detection has been proposed. The remaining parts of this chapter have been described in the later sections.

3: Algorithm for detection of TRs

In this section, an algorithm based on the Integer period discrete Fourier transform (IPDFT) has been proposed to detect the tandem repeats (TRs) in DNA sequences and it has been depicted in Fig. 3. An example sequence with accession no. X64775 (NCBI, n.d) has been selected to explain the steps used in the algorithm for the TRs detection.

Fig. 3
Fig. 3 Flow graph for the proposed algorithm.

3.1: DNA sequences

DNA sequences contain the nucleotides or characters adenine (A), thymine (T), guanine (G) and cytosine (C), and these characters are considered inputs.

3.2: Numerical mapping

Characters of the DNA sequences received from the previous step have been converted into numerical sequences using the electron-ion-interaction pseudo-potential (EIIP) mapping scheme. The EIIP mapping is preferred over other mapping schemes because it has less computational complexity (Sharma et al., 2011). The character to numerical conversion with an example has been shown in Fig. 4.

Fig. 4
Fig. 4 Numerical conversion scheme.

3.3: Short time integer period discrete Fourier transform

The IPDFT of a signal x(n) is defined (Epps, 2009) using the following equation.

xIPp=n=0N1xnej2πnp,p=1,2,3,,P<N,

si1_e  (1)

where P is the maximum period. IPDFT is linearly related to periodicity “p,” whereas the discrete Fourier transform (DFT) is linear with respect to the frequency.

To localize the TRs present in the DNA sequences, the short time IPDFT (STIPDFT) has been calculated using the following equation.

xIPpm=n=0N1xnwnmej2πnp,

si2_e  (2)

where w(n) is a hamming window, which is centered at nucleotides position m = 0 initially, and further it is shifted by one nucleotide up to the end of the sequences. The 20*p window length has been selected for the experiment. Finally, nucleotides position versus periodicity plot has been plotted in Fig. 5.

Fig. 5
Fig. 5 Nucleotide position versus periodicity plot.

3.4: Thresholding

Threshold has been calculated to identify the location of TRs of a particular periodicity and it is calculated using the following equation.

Th=meanx5pmaxx5p.

si3_e  (3)

where, x5 is the value of the sum of power spectrum along with position axis and it is given by –

x5p=m=0MxIPpm,

si4_e  (4)

BIPpm=1,ifxIPpmTh0,ifxIPpmTh

si5_e  (5)

Using Eq. (5), we have plotted the nucleotide position vs periodicity plot obtained postthresholding and it is shown in Fig. 6.

Fig. 6
Fig. 6 Nucleotide position versus periodicity post-thresholding.

After thresholding, the periodicity 3, 7, 10, and 11 have been detected as candidate TRs along with their nucleotide positions 1–182 and 234–303, 234–289, 225–284, and 44–139, respectively.

3.5: Verification of the detected candidate TRs

The method proposed by Boeva et al. (2006) has been used to verify the candidate TRs detected after thresholding. Candidate TRs are the segments highlighted in Fig. 3, and these are located at the nucleotide positions 1–182 and 234–303, 234–289, 225–284, and 44–139. These candidate TRs after verification have been tabulated in Table 1.

Table 1

Verification of candidate TRs detected after thresholding.
Sr. No.PeriodicityDetected candidate TRs after thresholdingVerification of detected candidate TRs
Nucleotide positionPatternsLocationPatternsNo. of copies
131–182ATGGAGAGCGACTGCCAGTTCTTGGTGG
CGCCGCCGCAGCCGCACATGTACT
ACGACACGGCGGCGG
CGGCGGTGGACGAGGCGCAGTTCTT
GCGGCAGATGGTGGC
CGCGGCGGATCACCACGC
GGCCGCCGCTGGGAG
AGGAGGCGGCGACGGCGA
CGGCGGCGGCGGCG
GCGGCGGCG
19–24GTT
CTT
02
25–30GGT
GGC
02

31–45

GCC
GCC
GCA
GCC
GCA

05

50–58

TAC
TAC
GAC

03

60–77

CGG
CGG
CGG
CGG
CGG
TGG

06

78–83

ACG
AGG

02

89–94

TTC
TTG

02

102–107

TGG
TGG

02

108–116

CCG
CGG
CGG

03

117–123

ATC
ACC

02

125–136

GCG
GCC
GCC
GCT

04

142–183

AGG
AGG
CGG
CGA
CGG
CGA
CGG
CGG
CGG
CGG
CGG
CGG
CGG
CG

14




234–303
AGACGCGTTCCACGCGCG
GCGGGCCAAGCTGGA
GCCGCGGGAGAAGGCG
GACGTGGCGCGGGAG
CTCGGG




250–255

CGG
CGG

02

268–273

CCG
CGG

02

274–279

GAG
AAG

02
27234–289AGACGCGTTCCACG
CGCGGCGGGCCAAGCTG
GAGCCGCGGGAGAA
GGCGGACGTGG
RejectedRejectedRejected
310225–284GGTCGCTGGAGACGC
GTTCCACGCGCGGCGGGC
CAAGCTGGAGCCGCGGGA
GAAGGCGGA
RejectedRejectedRejected

4

11

11–139
GACTGCCAGTTCTTGGTGG
CGCCGCCGCAGCCGCACATGTACTACGACA
CGGCGGCGGCGGCGGT
GGACGAGGCGCAGTT
CTTGCGGCAGATGGTGG
CCGCGGCGGATCACC
ACGCGGCCGCCGCTGGG

Rejected

Rejected

Rejected

Table 1

In Table 1, it has been presented that the proposed algorithm has detected true TRs of periodicity 3 bps, and other falsely detected candidate TRs of periodicity 7, 10, and 11 have been removed after verification.

4: Performance analysis of the proposed algorithm

To assess the performance of the proposed algorithm for the TRs detections, DNA sequence having the accession number X64775 (NCBI, n.d) has been used in this study. The proposed algorithm for TRs detection has been implemented in MATLAB. The system used for simulation has 8 GB RAM, 64 bit operating system, and core i5 processor. The performance comparison of the proposed algorithm has been done with the reported methods. The comparison of results has been shown in Table 2.

Table 2

Comparison with reported methods.
PeriodicityMethodNucleotide position after thresholdingNucleotides positions after verificationConsensus patternCopiesTotal copies
3Proposed method1–18219–24GTT
CTT
0253

25–30

GGT
GGC

02

31–45

GCC
GCC
GCA
GCC
GCA

05

50–58

TAC
TAC
GAC

03

60–77

CGG
CGG
CGG
CGG
CGG
TGG

06

78–83

ACG
AGG

02

89–94

TTC
TTG

02

102–107

TGG
TGG

02

108–116

CCG
CGG
CGG

03

117–123

ATC
ACC

02
125–136GCG
GCC
GCC
GCT
04

142–183

AGG
AGG
CGG
CGA
CGG
CGA
CGG
CGG
CGG
CGG
CGG
CGG
CGG
CGG

14

250–255

CGG
CGG

02

268–273

CCG
CGG

02

274–279

GAG
AAG

02
AST (Sharma et al., 2014)19–4420–25TTC
TTG
0248
25–42GCC
06
61–8661–79GGC07

89–104

89–94

TTC(TTG)

02

94–99

GCG
GCA

02

108–122

108–116

CGG

03
117–122ATC
ACC
02
125–135125–135CCG03

141–149

141–149

GAG

03

160–186

160–186

CGG

09

194–207

194–199

AGG
AAG

02

199–204

GCG

02

211–223

211–219

GGA

03

274–283

274–279

GAG
AAG

02
blank cellEMWD (Pop and Lupu, 2008)57–7257–72CGG5.521

140–187

140–187

GGC

15.5
blank cellParametric spectral estimation (Du et al., 2007)
45–90

49–57

TAC

03

24.7

59–76

CGG

06

140–200

141–188

GGC

15.7

Table 2Table 2

From Table 2 and Fig. 7, it has been found that the proposed method has detected more number of copies as compared to other reported methods.

Fig. 7
Fig. 7 Comparison of methods with w.r.t. total number of copies detected.

5: Conclusion

The proposed approach presented in this chapter has been applied successfully to identify TRs detection in the DNA sequences. It has been concluded that the performance of proposed method is better as compared to AST, EMWD, PSE, in terms of number of copies. Also, it has been concluded that fixed window length is the limitation of the proposed method and it may be overcome in future work. This chapter will also attract and motivate scientists and researchers of computer science, bioinformatics, medical sciences, and Big Data to do research in the area of genomics signal processing.

References

Boeva V., Regnier M., Papatsenko D., Makeev V.Short fuzzy tandem repeats in genomic sequences identification and possible role in regulation of gene expression. Bioinformatics. Jan. 2006;22(6):676–684.

Buchner M., Janjarasjitt S. Detection and visualization of tandem repeats in DNA sequences. IEEE Trans. Signal Process. 2003;51(9).

Butler J.Forensic DNA Typing: Biology and Technology behind STR Markers. London: Academic Press; 2003.

Cummings C.A., Relman D.A.Microbial forensics—cross-examining pathogens. Science. 2002;296:1976–1979.

Du L.P., Zhou H.X., Yan H. OMWSA: detection of DNA repeats using moving window spectral analysis. Bioinformatics. 2007;23(5):631–633 pp. 2280–2287, Sep. 2003.

Epps J. A hybrid technique for the periodicity characterization of genomic sequence data. EURASIP J. Bioinform. Syst. Biol. 2009;924601.

Gupta S., Prasad R.Searching exact tandem repeats in DNA sequences using enhanced suffix array. Curr. Bioinform. 2018;13(2):216–222.

Jiang R., Yan H. Detection and 2-dimensional display of short tandem repeats based on signal decomposition. Int. J. Data Min. Bioinform. 2011;5(6):661–690.

Lim K.G., Kwoh C.K., Hsu L.Y., Wirawan A.Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Brief. Bioinform. 2012 Page 1 of 15.

Leprae, n.d., Available at: https://www.jalma-icmr.org.in/LEPStr/leprae/microsatellites/leprae_micro.html (Accessesd May 2020).

NCBI, n.d., National Centre for Biotechnology Information (2020). https://www.ncbi.nlm.nih.gov (Accessed May 2020).

Pop P.G. Tandem Repeats Localization Using Spectral Techniques [C]. Romania: IEEE Intelligent Computer Communication and Processing; 2007.

Pop G.P., Lupu E. DNA repeats detection using BW spectrograms. In: IEEE-TTTC International Conference on Automation, Quality and Testing, Robotics, AQTR 2008; Romania: Tome III; 2008:408–412.

Sharma D., Issac B., Raghava G.P.S., Ramaswamy R. Spectral repeat finder (SRF): identification of repetitive sequences using Fourier transforms. Bioinformatics. 2004;20(9):1405–1412.

Sharma S.D., Shakya D.K., Sharma S.N. Advanced Numerical representation of DNA sequences. In: IEEE Proceeding of ICCET. 2011.

Sharma S.D., Saxena R., Sharma S.N. Identification of microsatellites in DNA using adaptive S-transform. IEEE J. Biomed. Health Inform. 2014;19(3):1097–1105.

Sharma S.D., Saxena R., Sharma S.N.Tandem repeats detection in DNA sequences using Kaiser window based adaptive S-transform. Bio-Algorith. Med. Syst. 2017;13(3):167–173. doi:10.1515/bams-2017-0014.

Tran T.T., Emanuele II V.A., Zhou G.-T. Techniques for detecting approximate tandem repeats in DNA. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. (ICASSP 2004). 449–452. 2004;vol. 5.

Usdin K.The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome Res. 2008;18:1011–1019.

Yin C.Identification of repeats in DNA sequences using nucleotide distribution uniformity. J. Theor. Biol. 2017;412:138–145.

Zhou H., Liping D., Yan H. Detection of tandem repeats in DNA sequences based on parametric spectral estimation. IEEE Trans. Inf. Technol. Biomed. 2009;13(5).

Zribi S., Messaoudi I., Oueslati A.E., Lachiri Z. Microsatellite’s detection using the S -transform analysis based on the synthetic and experimental coding. Int. J. Adv. Comput. Sci. Appl. 2019;10(3).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.233.43