Chapter 6. Social Media Mining – Case Studies

The importance of examples cannot be downplayed as they help us to understand and enhanced understanding often leads to subsequent improvement of our skills. While this chapter represents a sizable minority of the overall book, it also represents the proportion of time spent during modeling, that is, only a sizable minority. The previous chapters have established a solid groundwork of key concepts and foundational knowledge such that readers can now responsibly digest, comprehend, and execute the case studies discussed in this chapter. This pivotal chapter provides accessible material and tangible examples, including lexicon-based, supervised, and unsupervised approaches to sentiment analysis.

Introductory considerations

As promoted often throughout this book, social data mining can be about more than mere product reviews. This is not to suggest that these methods are used exclusively for marketing or business, nor that using these methods for such analyses is unimportant. It is merely an acknowledgment that our goals are largely about investigating socially critical issues, such as abortion, gun control, and immigration, or parts of issues broadly set within the health, economic, and political categories. These are perspectives that remain intrinsically important to the human condition and societal progress. To that end, we have intentionally chosen topics that are hard-hitting. Also, while we promote social media data, specifically Twitter, we also find utility in these methods on varied datasets, both big and small. Consequently, examples include data from the Web that is not from Twitter, and we simultaneously highlight nuances to consider when working with varied datasets. Furthermore, social data mining is often about Big Data, and R does a good job with large datasets, but size can become a consideration when working with real-world datasets.

When working with larger files, in general, you should consider the following:

  • How large is your dataset?

    Generally, the number of rows and columns is a good estimation given the content in each cell, that is, numeric versus character.

  • How much memory does your system have?

    You may want to avoid reading in data with file sizes greater than the memory available to you. As a rule of thumb, the overhead associated with reading data to memory, which is the default behavior of R, is about double. Therefore, if you estimate your data to be 3 GB, then the memory required is roughly 6 GB. Most computers now have 8 GB and even 16 GB of RAM, but if you do not have enough system memory, then some social media mining applications may be intractable.

  • How many open applications do you have, and what are they?

    If you think you will be broaching the limits of your system, then you may want to consider closing applications or reading your data into memory at a later time.

  • What is your operating system? Is it 32-bit or 64-bit?

    Some operating systems are more efficient, and having a 64-bit machine will allow increased access to memory.

We suggest you read the help page for read.table and read.csv; both offer simple mechanisms to gracefully handle larger datasets. colClasses is another option that should be considered. This option takes a vector whose length is equal to the number of columns in your table. By specifying this option instead of using the default we can tune R to load much faster since R will know in advance what the columns are and know their class. Also, by specifying the nrows argument we tune the internal memory usage. When R doesn't know how many rows it has to read it makes some rather crass estimations, and when it underestimates the memory demands, it allocates more memory. The constant allocations take time, and if R overestimates the amount of memory it needs, your computer will run out of memory. Even a mild overestimate for nrows is better than none at all.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.109.75