Home Page Icon
Home Page
Table of Contents for
Automated data collection with R
Close
Automated data collection with R
by Dominic Nyhuis, Peter Meissner, Christian Rubba, Simon Munzert
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
Preface
What you won't learn from reading this book
Why R?
Recommended reading to get started with R
Typographic conventions
The book's website
Disclaimer
Acknowledgments
Note
Chapter 1: Introduction
1.1 Case study: World Heritage Sites in Danger
1.2 Some remarks on web data quality
1.3 Technologies for disseminating, extracting, and storing web data
1.4 Structure of the book
Notes
Part One: A Primer on Web and Data Technologies
Chapter 2: HTML
2.1 Browser presentation and source code
2.2 Syntax rules
2.3 Tags and attributes
2.4 Parsing
Summary
Further reading
Problems
Notes
Chapter 3: XML and JSON
3.1 A short example XML document
3.2 XML syntax rules
3.3 When is an XML document well formed or valid?
3.4 XML extensions and technologies
3.5 XML and R in practice
3.6 A short example JSON document
3.7 JSON syntax rules
3.8 JSON and R in practice
Summary
Further reading
Problems
Notes
Chapter 4: XPath
4.1 XPath—a query language for web documents
4.2 Identifying node sets with XPath
4.3 Extracting node elements
Summary
Further reading
Problems
Notes
Chapter 5: HTTP
5.1 HTTP fundamentals
5.2 Advanced features of HTTP
5.3 Protocols beyond HTTP
5.4 HTTP in action
Summary
Further reading
Problems
Notes
Chapter 6: AJAX
6.1 JavaScript
6.2 XHR
6.3 Exploring AJAX with Web Developer Tools
Summary
Further reading
Problems
Chapter 7: SQL and relational databases
7.1 Overview and terminology
7.2 Relational Databases
7.3 SQL: a language to communicate with Databases
7.4 Databases in action
Summary
Further reading
Problems
Pokemon problems
ParlGov problems
Notes
Chapter 8: Regular expressions and essential string functions
8.1 Regular expressions
8.2 String processing
8.3 A word on character encodings
Summary
Further reading
Problems
Notes
Part Two: A Practical Toolbox for Web Scraping and Text Mining
Chapter 9: Scraping the Web
9.1 Retrieval scenarios
9.2 Extraction strategies
9.3 Web scraping: Good practice
9.4 Valuable sources of inspiration
Summary
Further reading
Problems
Notes
Chapter 10: Statistical text processing
10.1 The running example: Classifying press releases of the British government
10.2 Processing textual data
10.3 Supervised learning techniques
10.4 Unsupervised learning techniques
Summary
Further reading
Notes
Chapter 11: Managing data projects
11.1 Interacting with the file system
11.2 Processing multiple documents/links
11.3 Organizing scraping procedures
11.4 Executing R scripts on a regular basis
Notes
Part Three: A Bag of Case Studies
Chapter 12: Collaboration networks in the US Senate
12.1 Information on the bills
12.2 Information on the senators
12.3 Analyzing the network structure
12.4 Conclusion
Notes
Chapter 13: Parsing information from semistructured documents
13.1 Downloading data from the FTP server
13.2 Parsing semistructured text data
13.3 Visualizing station and temperature data
Notes
Chapter 14: Predicting the 2014 Academy Awards using Twitter
14.1 Twitter APIs: Overview
14.2 Twitter-based forecast of the 2014 Academy Awards
14.3 Conclusion
Notes
Chapter 15: Mapping the geographic distribution of names
15.1 Developing a data collection strategy
15.2 Website inspection
15.3 Data retrieval and information extraction
15.4 Mapping names
15.5 Automating the process
Summary
Notes
Chapter 16: Gathering data on mobile phones
16.1 Page exploration
16.2 Scraping procedure
16.3 Graphical analysis
16.4 Data storage
Note
Chapter 17: Analyzing sentiments of product reviews
17.1 Introduction
17.2 Collecting the data
17.3 Analyzing the data
17.4 Conclusion
Notes
References
General index
Package index
Function index
End User License Agreement
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Automated data collection with R
Next
Next Chapter
Preface
CONTENTS
Preface
What you won't learn from reading this book
Why
R
?
Recommended reading to get started with
R
Typographic conventions
The book's website
Disclaimer
Acknowledgments
Note
Chapter 1: Introduction
1.1 Case study: World Heritage Sites in Danger
1.2 Some remarks on web data quality
1.3 Technologies for disseminating, extracting, and storing web data
1.4 Structure of the book
Notes
Part One: A Primer on Web and Data Technologies
Chapter 2: HTML
2.1 Browser presentation and source code
2.2 Syntax rules
2.3 Tags and attributes
2.4 Parsing
Summary
Further reading
Problems
Notes
Chapter 3: XML and JSON
3.1 A short example XML document
3.2 XML syntax rules
3.3 When is an XML document well formed or valid?
3.4 XML extensions and technologies
3.5 XML and
R
in practice
3.6 A short example JSON document
3.7 JSON syntax rules
3.8 JSON and
R
in practice
Summary
Further reading
Problems
Notes
Chapter 4: XPath
4.1 XPath—a query language for web documents
4.2 Identifying node sets with XPath
4.3 Extracting node elements
Summary
Further reading
Problems
Notes
Chapter 5: HTTP
5.1 HTTP fundamentals
5.2 Advanced features of HTTP
5.3 Protocols beyond HTTP
5.4 HTTP in action
Summary
Further reading
Problems
Notes
Chapter 6: AJAX
6.1 JavaScript
6.2 XHR
6.3 Exploring AJAX with Web Developer Tools
Summary
Further reading
Problems
Chapter 7: SQL and relational databases
7.1 Overview and terminology
7.2 Relational Databases
7.3 SQL: a language to communicate with Databases
7.4 Databases in action
Summary
Further reading
Problems
Pokemon problems
ParlGov problems
Notes
Chapter 8: Regular expressions and essential string functions
8.1 Regular expressions
8.2 String processing
8.3 A word on character encodings
Summary
Further reading
Problems
Notes
Part Two: A Practical Toolbox for Web Scraping and Text Mining
Chapter 9: Scraping the Web
9.1 Retrieval scenarios
9.2 Extraction strategies
9.3 Web scraping: Good practice
9.4 Valuable sources of inspiration
Summary
Further reading
Problems
Notes
Chapter 10: Statistical text processing
10.1 The running example: Classifying press releases of the British government
10.2 Processing textual data
10.3 Supervised learning techniques
10.4 Unsupervised learning techniques
Summary
Further reading
Notes
Chapter 11: Managing data projects
11.1 Interacting with the file system
11.2 Processing multiple documents/links
11.3 Organizing scraping procedures
11.4 Executing
R
scripts on a regular basis
Notes
Part Three: A Bag of Case Studies
Chapter 12: Collaboration networks in the US Senate
12.1 Information on the bills
12.2 Information on the senators
12.3 Analyzing the network structure
12.4 Conclusion
Notes
Chapter 13: Parsing information from semistructured documents
13.1 Downloading data from the FTP server
13.2 Parsing semistructured text data
13.3 Visualizing station and temperature data
Notes
Chapter 14: Predicting the 2014 Academy Awards using Twitter
14.1 Twitter APIs: Overview
14.2 Twitter-based forecast of the 2014 Academy Awards
14.3 Conclusion
Notes
Chapter 15: Mapping the geographic distribution of names
15.1 Developing a data collection strategy
15.2 Website inspection
15.3 Data retrieval and information extraction
15.4 Mapping names
15.5 Automating the process
Summary
Notes
Chapter 16: Gathering data on mobile phones
16.1 Page exploration
16.2 Scraping procedure
16.3 Graphical analysis
16.4 Data storage
Note
Chapter 17: Analyzing sentiments of product reviews
17.1 Introduction
17.2 Collecting the data
17.3 Analyzing the data
17.4 Conclusion
Notes
References
General index
Package index
Function index
End User License Agreement
List of Tables
Chapter 2
Table 2.1
Table 2.2
Table 2.3
Chapter 3
Table 3.1
Table 3.2
Table 3.3
Table 3.4
Chapter 4
Table 4.1
Table 4.2
Table 4.3
Table 4.4
Chapter 5
Table 5.1
Table 5.2
Table 5.3
Table 5.4
Chapter 7
Table 7.1
Table 7.2
Table 7.3
Table 7.4
Table 7.5
Table 7.6
Table 7.7
Table 7.8
Table 7.9
Table 7.10
Table 7.11
Chapter 8
Table 8.1
Table 8.2
Table 8.3
Table 8.4
Table 8.5
Chapter 9
Table 9.1
Chapter 10
Table 10.1
Chapter 11
Table 11.1
Table 11.2
Chapter 12
Table 12.1
Table 12.2
Table 12.3
Table 12.4
Table 12.5
Chapter 13
Table 13.1
Chapter 14
Table 14.1
Table 14.2
Chapter 15
Table 15.1
List of Illustrations
Chapter 1
Figure 1.1 Location of UNESCO World Heritage Sites in danger (as of March 2014). Cultural sites are marked with triangles, natural sites with dots
Figure 1.2 Distribution of years when World Heritage Sites were put on the list of endangered sites
Figure 1.3 Distribution of time spans between year of inscription and year of endangerment of World Heritage Sites in danger
Figure 1.4 Technologies for disseminating, extracting, and storing web data
Chapter 2
Figure 2.1 Browser view of a simple HTML document
Figure 2.2 Source view of a simple HTML document
Figure 2.3 Inspect elements view of a simple HTML document
Figure 2.4 Source code of
OurFirstHTML.html
Figure 2.5 A tree perspective on
OurFirstHTML.html
(see Figure 2.4)
Chapter 3
Figure 3.1 An XML code example: James Bond movies
Figure 3.2 Tree perspective on an XML document
Figure 3.3 How RSS works
Figure 3.4 SVG code example:
R
logo
Figure 3.5 The
R
logo as SVG image from code in Figure 3.4
Figure 3.6 XML example document: stock data
Figure 3.7 DTD of stock data XML file (see Figure 3.6)
Figure 3.8
R
code for event-driven parsing
Figure 3.9 JSON code example: Indiana Jones movies
Chapter 4
Figure 4.1 A tree perspective on
parsed_doc
Figure 4.2 Visualizing node relations. Descriptions are presented in relation to the white node
Chapter 5
Figure 5.1 User–server communication via HTTP
Figure 5.2 HTTP request schema
Figure 5.3 HTTP response schema
Figure 5.4 The principle of web proxies
Figure 5.5 The principle of HTTPS
Chapter 6
Figure 6.1 Javascript-enriched
fortunes1.html
(a) Initial state
(b) After a click on “Robert Gentleman”
Figure 6.2 The user–server communication process using the XMLHttpRequest. Adapted from Stepp et al. (2012)
Figure 6.3 View on
fortunes2.html
from the Elements panel
Figure 6.4 View on
fortunes2.html
from the Network panel
Figure 6.5 Information on
quotes.html
from the Network panel
(a) Preview
(b) Headers
Chapter 7
Figure 7.1 How users,
R
, SQL, DBMS, and databases are related to each other
Figure 7.2 Database scheme
Figure 7.3 SQL example database scheme
Chapter 9
Figure 9.1 Screenshot of HTTP authentication mask at http://www.r-datacollection.com/ materials/solutions
Figure 9.2 The Federal Contributions database
Figure 9.3 Initializing the Selenium Java Server
Figure 9.4 The mechanics of web APIs
Figure 9.5 An
R
wrapper function for Yahoo's Weather Feed
Figure 9.6 Scraping with regular expressions
Figure 9.7 Scraping with XPath
Figure 9.8 Data collection with APIs
Figure 9.9
R
code for parsing
robots.txt
files
Figure 9.10 An etiquette manual for web scraping
Figure 9.11 Helper functions for handling HTTP If-Modified-Since header field
Chapter 10
Figure 10.1 Output of hierarchical clustering of UK Government press releases
Figure 10.2 Output of Correlated Topic Model of UK Government press releases
Chapter 11
Figure 11.1 Time-series of Apple stock values, 2003–2013
Figure 11.2 Trigger selection on Windows platform
Figure 11.3 Action selection on Windows platform
Chapter 12
Figure 12.1 R procedure to collect list of bill sponsors
Figure 12.2 Cosponsorship network of senators
Chapter 13
Figure 13.1 Excerpt from a text file on temperature data from Californian weather stations, accessible at ftp://ftp.wcc.nrcs.usda.gov/data/climate/table/temperature/history/california/
Figure 13.2
R
-based parsing function for temperature text files
Figure 13.3 Weather station locations on an OpenStreetMaps map
Figure 13.4 Overall monthly temperature means for selected weather stations. Lines present average monthly temperatures in degree Celsius for all years in the dataset. Small gray dots are daily temperatures for all years within the dataset.
Chapter 14
Figure 14.1 Tweets per hour on the 2014 Academy Awards
Chapter 15
Figure 15.1 Excerpt from the
robots.txt
file on www.dastelefonbuch.de
Figure 15.2 Geographic distribution of “Feuersteins”
Figure 15.3 Generalized
R
code to scrape entries from
www.dastelefonbuch.de
Figure 15.4 Generalized
R
code to parse entries from
www.dastelefonbuch.de
Figure 15.5 Generalized
R
code to map entries from
www.dastelefonbuch.de
Figure 15.6 Results of three calls of the
namesPlot()
function
Chapter 16
Figure 16.1 Amazon's search form
Figure 16.2 Prices, costumer rating, and best seller positioning of mobile phones. Black dots mark placement of individual products and white dots with horizontal and vertical lines mark the five best selling items per plot
Chapter 17
Figure 17.1 Violin plots of estimated sentiment versus product rating in Amazon reviews
Figure 17.2 Estimated sentiment in Amazon review titles versus product rating. The data are jittered on both axes.
Figure 17.3 Maximum entropy classification results of Amazon reviews
Figure 17.4 Support vector machine classification results of Amazon reviews
Figure 17.4 Support vector machine classification results of Amazon reviews
preface
Figure 1 The research process not using R—stylized example
Figure 2 The research process using R—stylized example
Guide
Cover
Table of Contents
Part
Pages
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
136
137
138
139
140
141
142
143
144
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
341
342
343
344
345
346
347
348
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
393
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset