Scraping and information | Main | Important | ||
Case study | Description | extraction via... | packages | functions |
Collaboration Networks in the U.S. Senate | Scraping of bill cosponsorship data from the US Senate at thomas.loc.gov, assessment of collaboration network structure | URL manipulation, regular expressions | RCurl, stringr, igraph | getURL(), str_extract(), graph.edgelist(), get.adjacency() |
Parsing Information from Semi-Structured Documents | Scraping of climate data from Californian weather stations (ftp.wcc.nrcs.usda.gov), construction of a regex-based parser | FTP download, regular expressions and string manipulation tools | RCurl, stringr | getURL(), str_extract(), str_replace() |
Predicting the 2014 Academy Awards using Twitter | Collection of tweets from Twitter API (dev.twitter.com/docs/api/streaming), frequency-based prediction of Oscar winners | Persistent connection to Streaming API via streamR, regular expressions | streamR, twitteR, lubridate, stringr, plyr | filterStream(), parseTweets(), str_detect(), agrep() |
Mapping the Geographic Distribution of Names | Scraping phone book data from dastelefonbuch.de, extraction of zip codes and matching with geo-coordinates, creation of family name maps | HTML forms,XPath and regular expressions, R geographic functionality | RCurl, stringr, XML, maptools, maps, rgdal | getForm(), htmlParse(), xpathSApply(), str_extract(), function() |
Gathering Data on Mobile Phones | Scraping of mobile phone product data from amazon.com, data storage in SQLite database | URL manipulation, XPath and regular expressions | RCurl, stringr, XML, RSQLite | htmlParse(), xpathSApply(), str_extract(), dbGetQuery() |
Analyzing Sentiments of Product Reviews | Extension of SQLite database with customer reviews of mobile phones at amazon.com, dictionary and classification-based sentiment analysis | XPath, data preparation with tm functionality, sentiment dictionary, maximum entropy and SVM | RCurl, string, XML, RSQLite, tm, RTextTools, textcat | dbReadTable(), dbGetQuery(), tm_map(), TermDocumentMatrix(), classify_model() |
3.137.186.178