Visualizing distributions of peptide hit counts to find thresholds can be done using the following steps:
- Load the libraries and data:
library(MSnID)
library(data.table)
library(dplyr)
library(ggplot2)
msnid <- MSnID()
msnid <- read_mzIDs(msnid, file.path(getwd(), "datasets", "ch6", "HeLa_180123_m43_r2_CAM.mzid.gz"))
peptide_info <- as(msnid, "data.table")
- Filter out decoy data rows and get a count of every time a peptide appears:
per_peptide_counts <- peptide_info %>%
filter(isDecoy == FALSE) %>%
group_by(pepSeq) %>%
summarise(count = n() ) %>%
mutate(sample = rep("peptide_counts", length(counts) ) )
- Create a violin and jitter plot of the hit counts:
per_peptide_counts %>%
ggplot() + aes( sample, count) + geom_jitter() + geom_violin() + scale_y_log10()
- Create a plot of cumulative hit counts for peptides sorted by hit count:
per_peptide_counts %>%
arrange(count) %>%
mutate(cumulative_hits = cumsum(count), peptide = 1:length(count)) %>%
ggplot() + aes(peptide, cumulative_hits) + geom_line()
- Filter out very low and very high peptide hits and then replot them:
filtered_per_peptide_counts <- per_peptide_counts %>%
filter(count >= 5, count <= 2500)
filtered_per_peptide_counts %>%
ggplot() + aes( sample, count) + geom_jitter() + geom_violin() + scale_y_log10()