Chapter 12

Building Strings

Strings are useful for more than representing sentences and words. For example, genetic information is usually represented by character strings.

DNA is a double helix of two chains of nucleotides. Each nucleotide base can be represented by a single letter, and so a chain of nucleotides can be thought of as a string. Even though DNA has two chains (also called strands), the two are closely related: given one, it is easy to calculate the other. Thus, DNA is generally described by one string of characters representing the nucleotide bases of one of its strands.

The four bases that make up DNA are adenine (A), cytosine (C), guanine (G), and thymine (T).

The second strand of DNA is always the reverse complement of the first. The complement of a strand of DNA swaps each base with its complementary base: AT and CG, and the reverse complement just reverses the order of the complementary sequence. For example, the complement of AGGTC is TCCAG, and the reverse complement is GACCT.

Listing 12.1: DNA Sequences

 1 # dna.py
 2
 3 from random import choice
 4
 5 def complement(dna):
 6 result = ""
 7 for c in dna:
 8  if c == "A":
 9    result += "T"
10  elif c == "T":
11    result += "A"
12  elif c == "C":
13    result += "G"
14  elif c == "G":
15    result += "C"
16 return result
17
18 def reversecomp(dna):
19 return complement(dna)[::−1]
20
21 def random_dna(length=30):
22 fragment = ""
23 for j in range(length):
24  fragment += choice("ACGT")
25 return fragment
26
27 def main():
28 for i in range(10):
29  dna = random_dna()
30  print(dna + " " + reversecomp(dna))
31  print(complement(dna) + "
")
32
33 main()

This program generates random strings of DNA and displays each with its reverse complement. On the left side of the output, you will see the random strand with its complement below it. This is how the two strands are tied together in the double-helix: each base binds with its complement across the helix. However, while the top strand is read left-to-right, the second strand is read from right to left, and so it is the reverse complement, printed on the right, that is more useful than the complement.

String Accumulators

Both the complement() and random_dna() functions use string accumulation loops to build their return values. These follow the same pattern as numeric accumulation loops (see Chapter 4):

<accumulator> = <starting value>
loop:
 <accumulator> += <string to add> # adds on the right

The reason these are so similar is that concatenation is analogous to addition for strings, and Python uses “+” to represent both.

There are two main differences between numeric and string accumulators:

The starting value for string accumulators is usually the empty string "", denoted in Python by two quotation marks with nothing between them.

Concatenation is not commutative. Adding on the right is usually different from adding on the left (see Exercise 12.2):

s = s + t

Add t to s on the right.

s = t + s

Add t to s on the left.

Most of the time (as in Listing 12.1), you will want to add on the right, and you can use the shorthand “+=” to do that. However, if you need to add on the left, you will have to write out the full statement instead of using the shorthand.

Loops over Strings

There is another new feature in the complement() function of Listing 12.1: the for loop in line 7. In fact, the syntax is not new:

for <variable> in <string>: # loop over each character
 <body>

Compare this with the syntax given in Chapter 4: they are identical except that the earlier version wrote <sequence> instead of string. Python treats strings as sequences of characters, and so when a string is used in a for loop, the variable takes on the value of each character in the string.

Escape Sequences

The main() function in Listing 12.1 uses one other new feature: an escape sequence inside of a string. Escape sequences begin with a backslash” and are used to insert non-alphabetic characters into a string:

Newline.

Tab.

"

To get " inside a double-quoted string.

To get inside a single-quoted string.

\

If you need a backslash itself.

Escape sequences and concatenation give us more control over printing than we have had to this point.

Exercises

  1. 12.1 Use Listing 12.1 to answer these questions:
    1. (a) Identify the accumulator variables in both the complement() and random_dna() functions. Do these accumulate on the right or left? How do you know?
    2. (b) Find the code that reverses the (complement) string.
    3. (c) Modify the reversecomp() function to call a separate reverse(s) function to reverse the complement. Write the reverse() function, as well.
    4. (d) Look up the choice() function from the random module and explain what it does in the random_dna() function.
  2. 12.2
    1. (a) Give example strings to show that concatenation on the right may produce different results from concatenation on the left.
    2. (b) Give example strings to show that concatenation on the right may produce the same result as concatenation on the left.
  3. 12.3 Use a string accumulator to write a reverse(s) function that returns the string s in reverse order. Do not use a slice. Modify Listing 12.1 to use your function.
  4. 12.4 Use a string loop to write a function is_dna(s) that returns True if s is a string of DNA nucleotides and otherwise returns False. Test your function on random strings of DNA, as well as other, non-DNA strings.
  5. 12.5 Write a function random_rna(length) that returns a random fragment of RNA of the given length. RNA uses the same bases as DNA, except that uracil (U) replaces thymine (T). Include a main() function that prints 10 random strings of RNA of length 30.
  6. 12.6 Write a function is_rna(s) that returns True if s is a string of RNA nucleotides and otherwise returns False. Test your function on random strings of RNA and DNA.
  7. 12.7 DNA transcription transforms a strand of DNA to RNA by replacing every thymine (T) with uracil (U). Write a function transcription(dna) that takes a DNA string and returns the corresponding RNA. Use a string accumulator. Test your function on random strings of DNA.
  8. 12.8 A fragment of DNA is a palindrome if it is the same as its reverse complement. (Note that this is different from the definition of English palindromes.) Write a function ispalindrome(dna) that returns True if the given dna is palindromic, and otherwise returns False. Use your function to write a program that finds a palindrome of length 10 by testing randomly generated strings of DNA until it finds one.
  9. 12.9 Write a function countbases(dna) that counts the number of each of the four bases in the given string of dna. Return all four counts, separated by commas. Test your function on random strings of DNA.
  10. 12.10 Write a function dec_to_bin(n) that takes a nonnegative integer n and returns the corresponding binary string (without the "0b" prefix). Do not use the built-in bin() function. Instead, use a string accumulator:
     repeat while n is positive:
      concatenate the bit n % 2 to the left end of the result
      integer-divide n by 2

    You may need a type conversion from Chapter 3. Write a main() that tests your function on all integers between 0 and 100.

  11. 12.11 Write a function bin_to_dec(s) that takes a binary string (without the "0b") and returns the corresponding decimal integer. You may need a type conversion from Chapter 3. Test your function on the output from the previous exercise.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.26.217