How it works...

After loading the library, the first step sets up a local SQL file, called sqlfile. The file contains all of the information about the studies on SRA. In our example, we are using a small version from within the package itself (hence, we're extracting it with the system.file() function); the real file is >50GB in size so we won't use it now but it can be retrieved using this replacement code: sqlfile <- getSRAdbfile(). Once we have a sqlfile object, we can create a connection to the database with the dbConnect() function. We save the connection in the object named sra_con for reuse.

We then perform a query on the sqlfile database using the dbGetQuery() function. The first argument to this is the database file, and the second is a full query in SQL format. The query written is pretty self-explanatory; we're looking to return study_accession and study_description when the description contains the term coli. Much more complicated queries are possible—if you're prepared to write them in SQL. A tutorial on that is far beyond the scope of this recipe but there are numerous books dedicated to the subject; you should try SQL for Data Analytics by Upom Malik, Matt Goldwasser, and Benjamin Johnston, Packt Publishing: https://www.packtpub.com/big-data-and-business-intelligence/sql-data-analysis. The query returns a dataframe object that looks like this:

## study_accession    study_description
## ERP000350 Transcriptome sequencing of E.coli K12 in LB media in early exponential phase and transition to stationary phase

Step 3 uses the accession number we extracted to get all of the related submission, sample, and experiment and run information related to the study with the sraConvert() function. This returns something like the following table—we can see the run IDs for this study, showing the actual files containing the sequence:

##    study submission    sample experiment       run
## 1 ERP000350 ERA014184 ERS016116 ERX007970 ERR019652
## 2 ERP000350 ERA014184 ERS016115 ERX007969 ERR019653

In Step 4, we use the listSRAfile() function to get the actual FTP address on the server for the specific sequences in a run. This provides the address of the SRA format file, a compressed and convenient format should you wish to know that:

     run     study    sample experiment    ftp
## 1 ERR019652 ERP000350 ERS016116 ERX007970 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019652/ERR019652.sra
## 2 ERR019653 ERP000350 ERS016115 ERX007969 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019653/ERR019653.sra

But in Step 5, we use the getSRAfile() function, setting the fileType argument to fastq to get the data in the standard fastq format. The files are downloaded into the folder specified in the destDir argument.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.21.61