In step 1, we use Rsamtools pileup() function with default settings to get a per-base coverage dataframe. Each row represents a single nucleotide in the reference and the count column gives the depth of coverage at that point. The result is stored in the pileup_df dataframe.
In step 2, we use the bumphunter clusterMaker() function on pileup_df, which simply groups reads within a certain distance of each other into clusters. We give it the sequence names, positions, and a maximum distance parameter (maxGap). The function returns a vector of cluster numbers of equal length to the dataframe, indicating the cluster membership of each row in the dataframe. If we tabulate with table, we can see the cluster sizes (number of rows) in each cluster:
table(clusters)
## clusters ## 1 2 3 ## 1486 1552 1520
In step 3, we refine our approach; we use regionFinder(), which applies a read depth cutoff to ensure a minimum read depth for the clusters. We pass it similar data as in step 2, adding the cluster membership vector clusters and a minimum read cutoff—here, we set to 1 for use with this very small dataset. The result of step 3 is the regions that are clustered together, but in a useful table:
## chr start end value area cluster indexStart indexEnd L ## 3 Chr1 4503 5500 10.401974 15811 3 3039 4558 1520 ## 1 Chr1 502 1500 9.985868 14839 1 1 1486 1486 ## 2 Chr1 2501 3500 8.657216 13436 2 1487 3038 1552
In these region predictions, we can clearly see the three regions containing reads that are in that data, give or take a nucleotide or two.