How it works...

The key to this approach is finding out whether the database we're using actually carries the PFAM domain information. That's what we do in step 1—we use the keytypes() function to list the search keys available. PFAM can be seen in the results. Once we've verified that we can use this database for the information we want, we can follow a fairly standard procedure:

Get a list of keys to query with—such as gene names. Here, we pull them from the database directly, but they could come from anywhere. This will result in the following output:

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"     
## [25] "UNIGENE"      "UNIPROT"

Query the database with the select() function, which pulls data for the provided keys. The columns argument tells it which data to pull. The expression here is going to get PFAM IDs for our genes of interest.
Make a list of all PFAM IDs and descriptions. We load the PFAM.db package and use the PFAMDE object it provides to get a mapping between IDs and descriptions. This will result in the following output. Note that because we're pulling data from an external database, changes in that database could be reflected here:

##           ENSEMBL    PFAM
## 1 ENSG00000121410 PF13895
## 2 ENSG00000175899 PF01835
## 3 ENSG00000175899 PF07678
## 4 ENSG00000175899 PF10569
## 5 ENSG00000175899 PF07703
## 6 ENSG00000175899 PF07677
## 7 ENSG00000175899 PF00207
## 8 ENSG00000256069    <NA>

We can then get the actual descriptions in an object with the mappedkeys() function.
Next, we extract and convert the descriptions of the all_ids object to a data frame.

And finally, we join the descriptions of the PFAM domains to the PFAM IDs we got earlier, using the columns with common data—PFAM and ac. This will result in the following output:

##           ENSEMBL    PFAM                                       de
## 1 ENSG00000121410 PF13895                    Immunoglobulin domain
## 2 ENSG00000175899 PF01835                               MG2 domain
## 3 ENSG00000175899 PF07678               A-macroglobulin TED domain
## 4 ENSG00000175899 PF10569                                     <NA>
## 5 ENSG00000175899 PF07703 Alpha-2-macroglobulin bait region domain
## 6 ENSG00000175899 PF07677  A-macroglobulin receptor binding domain
## 7 ENSG00000175899 PF00207             Alpha-2-macroglobulin family
## 8 ENSG00000256069    <NA>                                     <NA>

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...