Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Analyzing the associations among terms

The previously computed TermDocumentMatrix, can also be used to identify the association between the cleaned terms found in the corpus. This simply suggests the correlation coefficient computed on the joint occurrence of term-pairs in the same document, which can be queried easily with the findAssocs function.

Let's see which words are associated with data:

> findAssocs(tdm, 'data', 0.1)
             data
set          0.17
analyzing    0.13
longitudinal 0.11
big          0.10

Only four terms seem to have a higher correlation coefficient than 0.1, and it's not surprising at all that analyzing is among the top associated words. Probably, we can ignore the set term, but it seems that longitudinal and big data are pretty frequent idioms in package descriptions. So, what other big terms do we have?

> findAssocs(tdm, 'big', 0.1)
               big
mpi           0.38
pbd           0.33
program       0.32
unidata       0.19
demonstration 0.17
netcdf        0.15
forest        0.13
packaged      0.13
base          0.12
data          0.10

Checking the original corpus reveals that there are several R packages starting with pbd, which stands for Programming with Big Data. The pbd packages are usually tied to Open MPI, which pretty well explains the high association between these terms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Analyzing the associations among terms

Create new playlist

Sign In

Sign Up

Analyzing the associations among terms

Table of Contents for
Analyzing the associations among terms