How it works...

The first part of this recipe is where we actually predict ORFs. Initially, we load in the DNA sequence as a DNAStringSet object using readDNAStringSet() from Biostrings. The predORF() function from systemPipeR uses this object as input and actually predicts open reading frames according to the options set. Here, we're returning all ORFs on both strands.

This will result in the following output:

## GRanges object with 2501 ranges and 2 metadata columns:
##           seqnames        ranges strand | subject_id inframe2end
##              <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##      1 chloroplast   86762-93358      + |          1           2
##   1162 chloroplast     2056-2532      - |          1           3
##      2 chloroplast   72371-73897      + |          2           2
##   1163 chloroplast   77901-78362      - |          2           1
##      3 chloroplast   54937-56397      + |          3           3

We receive a GRanges object in return, with 2,501 open reading frames described. This is far too many, so we need to filter out those; in particular, we can work out which are ORFs that occurred by chance from the sequence. To do this, we need to do a little simulation and that's what happens in the next section of code.

To estimate the length that random ORFs can reach, we're going to create a series of random genomes of a length equal to our input sequence and with the same base proportion and see what the longest ORF that can be predicted is. We do a few iterations of this and we get an idea of what the longest ORF occurring by chance could be. This length serves as a cut-off we can use to reject the predicted ORFs in the real sequence.

Achieving this needs a bit of setup and a custom function. First, we define the bases we will use as a simple character vector. Then, we get a character vector of the original DNA sequence by splitting the as.character version of dna_object. We use this information to work out the proportions of each base in the input sequence by first counting the number of each base (resulting in counts ), then dividing it by the sequence length, resulting in probs. In both these steps, we use lapply() to loop over the vector bases and the list counts and apply an anonymous function that uses these two variables to give lists of results. unlist() is used on our final list to reduce it to a simple vector.

Once we have the setup done, we can build our get_longest_orf_in_random_genome() simulation function. This generates a random genome by sampling length characters from the selection in bases with probabilities given in probs. The vector is paste0() into a single string and then converted into a DNAStringSet object for the predORF() function. This time, we ask for only the longest ORF using n = 1 and return the length of that.

This will result in the following output:

## GRanges object with 10 ranges and 2 metadata columns:
##         seqnames        ranges strand | subject_id inframe2end
##            <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##    1 chloroplast   86762-93358      + |          1           2
##    2 chloroplast   72371-73897      + |          2           2
##    3 chloroplast   54937-56397      + |          3           3
##    4 chloroplast   57147-58541      + |          4           1

Now, we can run the function, which we do 10 times using lapply() and the length, probs, and bases information we calculated before. unlist() turns the result into a simple vector and we extract the longest of the 10 runs with max(). We can use subsetting on our original predicted_orfs GRanges object to keep the ORFs longer than the ones generated by chance.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...