0%

Life scientists today urgently need training in bioinformatics skills. Too many bioinformatics programs are poorly written and barely maintained, usually by students and researchers who've never learned basic programming skills. This practical guide shows postdoc bioinformatics professionals and students how to exploit the best parts of Python to solve problems in biology while creating documented, tested, reproducible software.

Ken Youens-Clark, author of Tiny Python Projects (Manning), demonstrates not only how to write effective Python code but also how to use tests to write and refactor scientific programs. You'll learn the latest Python features and tools including linters, formatters, type checkers, and tests to create documented and tested programs. You'll also tackle 14 challenges in Rosalind, a problem-solving platform for learning bioinformatics and programming.

  • Create command-line Python programs to document and validate parameters
  • Write tests to verify refactor programs and confirm they're correct
  • Address bioinformatics ideas using Python data structures and modules such as Biopython
  • Create reproducible shortcuts and workflows using makefiles
  • Parse essential bioinformatics file formats such as FASTA and FASTQ
  • Find patterns of text using regular expressions
  • Use higher-order functions in Python like filter(), map(), and reduce()

Table of Contents

  1. Preface
    1. Who Should Read This?
    2. Programming Style: Why I Avoid OOP and Exceptions
    3. Structure
    4. Test-Driven Development
    5. Using the Command Line and Installing Python
    6. Getting the Code and Tests
    7. Installing Modules
    8. Installing the new.py Program
    9. Why Did I Write This Book?
    10. Conventions Used in This Book
    11. Using Code Examples
    12. O’Reilly Online Learning
    13. How to Contact Us
    14. Acknowledgments
  2. I. The Rosalind.info Challenges
  3. 1. Tetranucleotide Frequency: Counting Things
    1. Getting Started
    2. Creating the Program Using new.py
    3. Using argparse
    4. Tools for Finding Errors in the Code
    5. Introducing Named Tuples
    6. Adding Types to Named Tuples
    7. Representing the Arguments with a NamedTuple
    8. Reading Input from the Command Line or a File
    9. Testing Your Program
    10. Running the Program to Test the Output
    11. Solution 1: Iterating and Counting the Characters in a String
    12. Counting the Nucleotides
    13. Writing and Verifying a Solution
    14. Additional Solutions
    15. Solution 2: Creating a count() Function and Adding a Unit Test
    16. Solution 3: Using str.count()
    17. Solution 4: Using a Dictionary to Count All the Characters
    18. Solution 5: Counting Only the Desired Bases
    19. Solution 6: Using collections.defaultdict()
    20. Solution 7: Using collections.Counter()
    21. Going Further
    22. Review
  4. 2. Transcribing DNA into mRNA: Mutating Strings, Reading and Writing Files
    1. Getting Started
    2. Defining the Program’s Parameters
    3. Defining an Optional Parameter
    4. Defining One or More Required Positional Parameters
    5. Using nargs to Define the Number of Arguments
    6. Using argparse.FileType() to Validate File Arguments
    7. Defining the Args Class
    8. Outlining the Program Using Pseudocode
    9. Iterating the Input Files
    10. Creating the Output Filenames
    11. Opening the Output Files
    12. Writing the Output Sequences
    13. Printing the Status Report
    14. Using the Test Suite
    15. Solutions
    16. Solution 1: Using str.replace()
    17. Solution 2: Using re.sub()
    18. Benchmarking
    19. Going Further
    20. Review
  5. 3. Reverse Complement of DNA: String Manipulation
    1. Getting Started
    2. Iterating Over a Reversed String
    3. Creating a Decision Tree
    4. Refactoring
    5. Solutions
    6. Solution 1: Using a for Loop and Decision Tree
    7. Solution 2: Using a Dictionary Lookup
    8. Solution 3: Using a List Comprehension
    9. Solution 4: Using str.translate()
    10. Solution 5: Using Bio.Seq
    11. Review
  6. 4. Creating the Fibonacci Sequence: Writing, Testing, and Benchmarking Algorithms
    1. Getting Started
    2. An Imperative Approach
    3. Solutions
    4. Solution 1: An Imperative Solution Using a List as a Stack
    5. Solution 2: Creating a Generator Function
    6. Solution 3: Using Recursion and Memoization
    7. Benchmarking the Solutions
    8. Testing the Good, the Bad, and the Ugly
    9. Running the Test Suite on All the Solutions
    10. Going Further
    11. Review
  7. 5. Computing GC Content: Parsing FASTA and Analyzing Sequences
    1. Getting Started
    2. Get Parsing FASTA Using Biopython
    3. Iterating the Sequences Using a for Loop
    4. Solutions
    5. Solution 1: Using a List
    6. Solution 2: Type Annotations and Unit Tests
    7. Solution 3: Keeping a Running Max Variable
    8. Solution 4: Using a List Comprehension with a Guard
    9. Solution 5: Using the filter() Function
    10. Solution 6: Using the map() Function and Summing Booleans
    11. Solution 7: Using Regular Expressions to Find Patterns
    12. Solution 8: A More Complex find_gc() Function
    13. Benchmarking
    14. Going Further
    15. Review
  8. 6. Finding the Hamming Distance: Counting Point Mutations
    1. Getting Started
    2. Iterating the Characters of Two Strings
    3. Solutions
    4. Solution 1: Iterating and Counting
    5. Solution 2: Creating a Unit Test
    6. Solution 3: Using the zip() Function
    7. Solution 4: Using the zip_longest() Function
    8. Solution 5: Using a List Comprehension
    9. Solution 6: Using the filter() Function
    10. Solution 7: Using the map() Function with zip_longest()
    11. Solution 8: Using the starmap() and operator.ne() Functions
    12. Going Further
    13. Review
  9. 7. Translating mRNA into Protein: More Functional Programming
    1. Getting Started
    2. K-mers and Codons
    3. Translating Codons
    4. Solutions
    5. Solution 1: Using a for Loop
    6. Solution 2: Adding Unit Tests
    7. Solution 3: Another Function and a List Comprehension
    8. Solution 4: Functional Programming with the map(), partial(), and takewhile() Functions
    9. Solution 5: Using Bio.Seq.translate()
    10. Benchmarking
    11. Going Further
    12. Review
  10. 8. Find a Motif in DNA: Exploring Sequence Similarity
    1. Getting Started
    2. Finding Subsequences
    3. Solutions
    4. Solution 1: Using the str.find() Method
    5. Solution 2: Using the str.index() Method
    6. Solution 3: A Purely Functional Approach
    7. Solution 4: Using K-mers
    8. Solution 5: Finding Overlapping Patterns Using Regular Expressions
    9. Benchmarking
    10. Going Further
    11. Review
  11. 9. Overlap Graphs: Sequence Assembly Using Shared K-mers
    1. Getting Started
    2. Managing Runtime Messages with STDOUT, STDERR, and Logging
    3. Finding Overlaps
    4. Grouping Sequences by the Overlap
    5. Solutions
    6. Solution 1: Using Set Intersections to Find Overlaps
    7. Solution 2: Using a Graph to Find All Paths
    8. Going Further
    9. Review
  12. 10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search
    1. Getting Started
    2. Finding the Shortest Sequence in a FASTA File
    3. Extracting K-mers from a Sequence
    4. Solutions
    5. Solution 1: Counting Frequencies of K-mers
    6. Solution 2: Speeding Things Up with a Binary Search
    7. Going Further
    8. Review
  13. 11. Finding a Protein Motif: Fetching Data and Using Regular Expressions
    1. Getting Started
    2. Downloading Sequences Files on the Command Line
    3. Downloading Sequences Files with Python
    4. Writing a Regular Expression to Find the Motif
    5. Solutions
    6. Solution 1: Using a Regular Expression
    7. Solution 2: Writing a Manual Solution
    8. Going Further
    9. Review
  14. 12. Inferring mRNA from Protein: Products and Reductions of Lists
    1. Getting Started
    2. Creating the Product of Lists
    3. Avoiding Overflow with Modular Multiplication
    4. Solutions
    5. Solution 1: Using a Dictionary for the RNA Codon Table
    6. Solution 2: Turn the Beat Around
    7. Solution 3: Encoding the Minimal Information
    8. Going Further
    9. Review
  15. 13. Location Restriction Sites: Using, Testing, and Sharing Code
    1. Getting Started
    2. Finding All Subsequences Using K-mers
    3. Finding All Reverse Complements
    4. Putting It All Together
    5. Solutions
    6. Solution 1: Using the zip() and enumerate() Functions
    7. Solution 2: Using the operator.eq() Function
    8. Solution 3: Writing a revp() Function
    9. Testing the Program
    10. Going Further
    11. Review
  16. 14. Finding Open Reading Frames
    1. Getting Started
    2. Translating Proteins Inside Each Frame
    3. Finding the ORFs in a Protein Sequence
    4. Solutions
    5. Solution 1: Using the str.index() Function
    6. Solution 2: Using the str.partition() Function
    7. Solution 3: Using a Regular Expression
    8. Going Further
    9. Review
  17. II. Other Programs
  18. 15. Seqmagique: Creating and Formatting Reports
    1. Using Seqmagick to Analyze Sequence Files
    2. Checking Files Using MD5 Hashes
    3. Getting Started
    4. Formatting Text Tables Using tabulate()
    5. Solutions
    6. Solution 1: Formatting with tabulate()
    7. Solution 2: Formatting with rich
    8. Going Further
    9. Review
  19. 16. FASTX grep: Creating a Utility Program to Select Sequences
    1. Finding Lines in a File Using grep
    2. The Structure of a FASTQ Record
    3. Getting Started
    4. Guessing the File Format
    5. Solution
    6. Going Further
    7. Review
  20. 17. DNA Synthesizer: Creating Synthetic Data with Markov Chains
    1. Understanding Markov Chains
    2. Getting Started
    3. Understanding Random Seeds
    4. Reading the Training Files
    5. Generating the Sequences
    6. Structuring the Program
    7. Solution
    8. Going Further
    9. Review
  21. 18. FASTX Sampler: Randomly Subsampling Sequence Files
    1. Getting Started
    2. Reviewing the Program Parameters
    3. Defining the Parameters
    4. Nondeterministic Sampling
    5. Structuring the Program
    6. Solutions
    7. Solution 1: Reading Regular Files
    8. Solution 2: Reading a Large Number of Compressed Files
    9. Going Further
    10. Review
  22. 19. Blastomatic: Parsing Delimited Text Files
    1. Introduction to BLAST
    2. Using csvkit and csvchk
    3. Getting Started
    4. Defining the Arguments
    5. Parsing Delimited Text Files Using the csv Module
    6. Parsing Delimited Text Files Using the pandas Module
    7. Solutions
    8. Solution 1: Manually Joining the Tables Using Dictionaries
    9. Solution 2: Writing the Output File with csv.DictWriter()
    10. Solution 3: Reading and Writing Files Using pandas
    11. Solution 4: Joining Files Using pandas
    12. Going Further
    13. Review
  23. A. Documenting Commands and Creating Workflows with make
    1. Makefiles Are Recipes
    2. Running a Specific Target
    3. Running with No Target
    4. Makefiles Create DAGs
    5. Using make to Compile a C Program
    6. Using make for a Shortcut
    7. Defining Variables
    8. Writing a Workflow
    9. Other Workflow Managers
    10. Further Reading
  24. B. Understanding $PATH and Installing Command-Line Programs
  25. Epilogue
  26. Index
3.149.243.32