Scraping a dwarf name

As silly as it is, I am not moving forward until I have a dwarf name of my own. I am not positive about getting this one directly from the web browser, either. We can do this using the R console alone. This section shows how. The first thing to do is find a website that will generate dwarf names for us. I picked the following one (http://www.rdinn.com/generators/1/dwarven_name_generator.php?):

Figure 4.1: The Red Dragon Inn's dwarven name generator

The Red Dragon Inn will generate dwarf names, given four parameters:

  • Name, surname, or both
  • Total number of names to generate
  • Gender
  • Realistic or fantasy

Web pages usually communicate with users through a well-known protocol (HTTP). All we have to do is to open browser developer tools while requesting a name, so we can understand what kind of request is done and how. For Windows users, developer tools for Chrome can be accessed by pressing Ctrl + Shift + I. I asked for both names, only one male fantasy name with my developer tools wide open. Here is what I got:

Figure 4.2: POST request

The first thing to seek is the request that will lead us to the dwarf name. There, we can check the Request URL and Request Method (POST). The POST method will send some information to the server, and we can get that if we scroll down the headers:

Figure 4.3: Form data

As you may have already imagined, this information tells the server what kind (and how many) names do we want. We can do a very similar POST request through R, just like the following code is doing:

if(!require(httr)){install.packages('httr')}
library(httr)

form_dt <- list(
nametype = 2,
numnames = 1,
gender = 1, # replace with 2 for girls name
surnametype = 2,
namegenraceid = 1
)

url <- 'http://www.rdinn.com/generators/1/dwarven_name_generator.php'
post <- POST(url, body = form_dt, verbose())
page <- content(post, as = 'text')

Now, I ended up with a very long string called page. If you could find patterns between the generated name, we could easily extract it using the substr() function. The following code is using these patterns to extract the dwarf name from this very long string:

pattern_1 <- '<td width="25%">'
pattern_2 <- '</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;'

dwarven_name <- substr(page,
start = regexpr(pattern_1,page) + nchar(pattern_1),
stop = regexpr(pattern_2, page) - 1)
dwarven_name
# I got:
# [1] "Thorru Steelmaul"

That is my new data miner name, Thorru Steelmaul—don't you dare create a Tibia character named Thorru Steelmaul. Here is what all this code has done. First, we make sure to have the httr package installed and ready for action using an if...else statement. After we loaded the package using library(), a list named form_dt was created carrying all that we looked for in a dwarf name (as I correctly guessed, the Submit variable displayed by Chrome's developer tools was kind of useless).

Using the httr::POST() function, we were able to make a POST request to the website. The result was stored in a variable named post. Later, this variable was called to extract the content from this request—the function to do it was httr::content(). Notice that we asked for a text output with the as = 'text' argument. We basically got a string with the HTML code from the page given as a response to the POST request.

Next, I looked for patterns around the outputted name—it's easier to do this inside developer tools (Response tab). The substr()  function took care of extracting a substring from our huge string, page. To properly do this, we had to input the function with the big string, the character position to start the subsetting, and the character position to stop it.

Here lies a little trick. After I found the pattern code surrounding the dwarf name, I used the regexpr() function to return the position from this patterns. The start pattern was summed with the number of characters, so that it would not include the start pattern itself. The stop pattern was subtracted by one, and this way the pattern itself is not included.

You can try source(url('http://bit.do/dwarven_name')) and then dwarven_name(gender = 1) if you want a boy's name or dwarven_name(gender = 2) if you seek a girl's name. Let me stress that these numbers make reference to an evolution scale. Women are way more evolved than men (two times more evolved at least). Also, as cool as it sounds to name your kids using this function, please don't.

This section was absolutely necessary. It was painful to write, but totally worth it. A bunch of sailors will tell you that it is bad luck to do data mining without a proper dwarf name. From now on, you can call me Thorru Steelmaul—sounds badass. Jokes aside, this section introduced the httr package, which can be used to retrieve text from the web. Retrieving text from the web is essential to social media text mining, and it's also useful to practice and other kinds of text mining because you can get a lot of data from it. The next section is dealing with it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.19.111