The importance of examples cannot be downplayed as they help us to understand and enhanced understanding often leads to subsequent improvement of our skills. While this chapter represents a sizable minority of the overall book, it also represents the proportion of time spent during modeling, that is, only a sizable minority. The previous chapters have established a solid groundwork of key concepts and foundational knowledge such that readers can now responsibly digest, comprehend, and execute the case studies discussed in this chapter. This pivotal chapter provides accessible material and tangible examples, including lexicon-based, supervised, and unsupervised approaches to sentiment analysis.
As promoted often throughout this book, social data mining can be about more than mere product reviews. This is not to suggest that these methods are used exclusively for marketing or business, nor that using these methods for such analyses is unimportant. It is merely an acknowledgment that our goals are largely about investigating socially critical issues, such as abortion, gun control, and immigration, or parts of issues broadly set within the health, economic, and political categories. These are perspectives that remain intrinsically important to the human condition and societal progress. To that end, we have intentionally chosen topics that are hard-hitting. Also, while we promote social media data, specifically Twitter, we also find utility in these methods on varied datasets, both big and small. Consequently, examples include data from the Web that is not from Twitter, and we simultaneously highlight nuances to consider when working with varied datasets. Furthermore, social data mining is often about Big Data, and R does a good job with large datasets, but size can become a consideration when working with real-world datasets.
When working with larger files, in general, you should consider the following:
Generally, the number of rows and columns is a good estimation given the content in each cell, that is, numeric versus character.
You may want to avoid reading in data with file sizes greater than the memory available to you. As a rule of thumb, the overhead associated with reading data to memory, which is the default behavior of R, is about double. Therefore, if you estimate your data to be 3 GB, then the memory required is roughly 6 GB. Most computers now have 8 GB and even 16 GB of RAM, but if you do not have enough system memory, then some social media mining applications may be intractable.
If you think you will be broaching the limits of your system, then you may want to consider closing applications or reading your data into memory at a later time.
Some operating systems are more efficient, and having a 64-bit machine will allow increased access to memory.
We suggest you read the help page for read.table
and read.csv
; both offer simple mechanisms to gracefully handle larger datasets. colClasses
is another option that should be considered. This option takes a vector whose length is equal to the number of columns in your table. By specifying this option instead of using the default we can tune R to load much faster since R will know in advance what the columns are and know their class. Also, by specifying the nrows
argument we tune the internal memory usage. When R doesn't know how many rows it has to read it makes some rather crass estimations, and when it underestimates the memory demands, it allocates more memory. The constant allocations take time, and if R overestimates the amount of memory it needs, your computer will run out of memory. Even a mild overestimate for nrows
is better than none at all.
3.144.109.75