The key to this approach is finding out whether the database we're using actually carries the PFAM domain information. That's what we do in step 1—we use the keytypes() function to list the search keys available. PFAM can be seen in the results. Once we've verified that we can use this database for the information we want, we can follow a fairly standard procedure:
- Get a list of keys to query with—such as gene names. Here, we pull them from the database directly, but they could come from anywhere. This will result in the following output:
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" ## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" ## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL" ## [13] "IPI" "MAP" "OMIM" "ONTOLOGY" ## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID" ## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" ## [25] "UNIGENE" "UNIPROT"
- Query the database with the select() function, which pulls data for the provided keys. The columns argument tells it which data to pull. The expression here is going to get PFAM IDs for our genes of interest.
- Make a list of all PFAM IDs and descriptions. We load the PFAM.db package and use the PFAMDE object it provides to get a mapping between IDs and descriptions. This will result in the following output. Note that because we're pulling data from an external database, changes in that database could be reflected here:
## ENSEMBL PFAM ## 1 ENSG00000121410 PF13895 ## 2 ENSG00000175899 PF01835 ## 3 ENSG00000175899 PF07678 ## 4 ENSG00000175899 PF10569 ## 5 ENSG00000175899 PF07703 ## 6 ENSG00000175899 PF07677 ## 7 ENSG00000175899 PF00207 ## 8 ENSG00000256069 <NA>
- We can then get the actual descriptions in an object with the mappedkeys() function.
- Next, we extract and convert the descriptions of the all_ids object to a data frame.
- And finally, we join the descriptions of the PFAM domains to the PFAM IDs we got earlier, using the columns with common data—PFAM and ac. This will result in the following output:
## ENSEMBL PFAM de ## 1 ENSG00000121410 PF13895 Immunoglobulin domain ## 2 ENSG00000175899 PF01835 MG2 domain ## 3 ENSG00000175899 PF07678 A-macroglobulin TED domain ## 4 ENSG00000175899 PF10569 <NA> ## 5 ENSG00000175899 PF07703 Alpha-2-macroglobulin bait region domain ## 6 ENSG00000175899 PF07677 A-macroglobulin receptor binding domain ## 7 ENSG00000175899 PF00207 Alpha-2-macroglobulin family ## 8 ENSG00000256069 <NA> <NA>