The previously computed TermDocumentMatrix
, can also be used to identify the association between the cleaned terms found in the corpus. This simply suggests the correlation coefficient computed on the joint occurrence of term-pairs in the same document, which can be queried easily with the findAssocs
function.
Let's see which words are associated with data
:
> findAssocs(tdm, 'data', 0.1) data set 0.17 analyzing 0.13 longitudinal 0.11 big 0.10
Only four terms seem to have a higher correlation coefficient than 0.1, and it's not surprising at all that analyzing
is among the top associated words. Probably, we can ignore the set
term, but it seems that longitudinal
and big
data are pretty frequent idioms in package descriptions. So, what other big
terms do we have?
> findAssocs(tdm, 'big', 0.1) big mpi 0.38 pbd 0.33 program 0.32 unidata 0.19 demonstration 0.17 netcdf 0.15 forest 0.13 packaged 0.13 base 0.12 data 0.10
Checking the original corpus reveals that there are several R packages starting with pbd, which stands for
Programming with Big Data. The pbd
packages are usually tied to Open MPI, which pretty well explains the high association between these terms.
18.117.230.81