How it works...

This is a long and involved pipeline with a few complicated steps. After loading the libraries, the first four lines set up the files we're going to need from the dataset directory. Note we need a .bam file and a fasta file. Next, we create a GmapGenome object using the gmapR::GmapGenome() function with the fasta object—this describes the genome to the later variant-calling function. The next two functions we use, TallyVariantParams() and VariantCallingFilters(), are vital for the correct calling and filtering of candidate SNPs. These are the functions in which you can set the parameters that define an SNP or indel. The options here are deliberately very poor. As you can see from the output, there are 6 SNPs called, when we created 64.

Once the parameters are defined, we use the callVariants() function with all of the information we set up to get a vranges object of variants.

This results in the following output:

 VRanges object with 6 ranges and 17 metadata columns:
##           seqnames    ranges strand         ref              alt
##              <Rle> <IRanges>  <Rle> <character> <characterOrRle>
##   [1] NC_000017.10        64      *           G                T
##   [2] NC_000017.10        69      *           G                T
##   [3] NC_000017.10        70      *           G                T
##   [4] NC_000017.10        73      *           T                A
##   [5] NC_000017.10        77      *           T                A
##   [6] NC_000017.10        78      *           G                T

We can then set up the GRanges object of the GFF file of annotations (I also provided a function for getting annotations from BED files).

This results in the following output:

## Hits object with 12684 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]     35176           1
##       [2]     35176           2
##       [3]     35176           3
##       [4]     35177           1

The final step is to use the powerful overlapping and subsetting capability of the XRanges objects. We use GenomicRanges::findOverlaps() to find the actual overlap—the returned overlaps object actually contains the indices in each input object of the overlapped object.

This results in the following output:

## GRanges object with 12684 ranges and 20 metadata columns:
##               seqnames      ranges strand |   source       type     score
##                  <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric>
##       [1] NC_000017.10 64099-76866      - |   havana ncRNA_gene      <NA>
##       [2] NC_000017.10 64099-76866      - |   havana    lnc_RNA      <NA>
##       [3] NC_000017.10 64099-65736      - |   havana       exon      <NA>

Hence, we can use subjectHits(overlaps) to directly subset the genes with SNPs inside and get a very non-redundant list.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...