Fetching the data

Luckily for us, the team behind stackoverflow provides most of the data behind the StackExchange universe to which stackoverflow belongs under a CC Wiki license. While writing this, the latest data dump can be found at http://www.clearbits.net/torrents/2076-aug-2012. Most likely, this page will contain a pointer to an updated dump when you read it.

After downloading and extracting it, we have around 37 GB of data in the XML format. This is illustrated in the following table:

File

Size (MB)

Description

badges.xml

309

Badges of users

comments.xml

3,225

Comments on questions or answers

posthistory.xml

18,370

Edit history

posts.xml

12,272

Questions and answers—this is what we need

users.xml

319

General information about users

votes.xml

2,200

Information on votes

As the files are more or less self-contained, we can delete all of them except posts.xml; it contains all the questions and answers as individual row tags within the root tag posts. Refer to the following code:

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="4572748" PostTypeId="2" ParentId="4568987" CreationDate="2011-01-01T00:01:03.387" Score="4" ViewCount="" Body="&lt;p&gt;IANAL, but &lt;a href=&quot;http://support.apple.com/kb/HT2931&quot;rel=&quot;nofollow&quot;&gt;this&lt;/a&gt; indicates to me that you cannot use the loops in your application:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;...however, individual audio loops may
  not be commercially or otherwise
  distributed on a standalone basis, nor
  may they be repackaged in whole or in
  part as audio samples, sound effects
  or music beds.&quot;&lt;/p&gt;
  
  &lt;p&gt;So don't worry, you can make
  commercial music with GarageBand,you
  just can't distribute the loops as
loops.&lt;/p&gt;
&lt;/blockquote&gt;
" OwnerUserId="203568" LastActivityDate="2011-01-01T00:01:03.387" CommentCount="1" />

Name

Type

Description

Id

Integer

This is a unique identifier

PostType

Integer

This describes the category of the post. The following values are of interest to us:

  • Question
  • Answer

Other values will be ignored

ParentId

Integer

This is a unique identifier of the question to which this answer belongs (missing for questions)

CreationDate

DateTime

This is the date of submission

Score

Integer

This is the score of the post

ViewCount

Integer or empty

This tells us the number of user views for this post

Body

String

This is the complete post as it is encoded in HTML text

OwnerUserId

Id

This is a unique identifier of the poster. If it is 1, it is a wiki question

Title

String

This is the title of the question (missing for answers)

AcceptedAnswerId

Id

This is the ID of the accepted answer (missing for answers)

CommentCount

Integer

This tells us the number of comments for the post

Slimming the data down to chewable chunks

To speed up our experimentation phase, we should not try to evaluate our classification ideas on a 12 GB file. Instead, we should think of how we can trim it down so that we can still keep a representable snapshot of it while being able to quickly test our ideas. If we filter an XML for row tags that have a CreationDate of 2011 or later, we still end up with over 6 million posts (2,323,184 questions and 4,055,999 answers), which should be enough training data for now. We also do not operate on the XML format as it will slow us down. The simpler the format, the better it is. That's why we parse the remaining XML using Python's cElementTree and write it out to a tab-separated file.

Preselection and processing of attributes

We should also only keep those attributes that we think could help the classifier in determining the good from the not-so-good answers. Certainly, we need the identification-related attributes to assign the correct answers to the questions. Read the following attributes:

  • The PostType attribute, for example, is only necessary to distinguish between questions and answers. Furthermore, we can distinguish between them later by checking for the ParentId attribute. So, we keep it for questions too, and set it to 1.
  • The CreationDate attribute could be interesting to determine the time span between posting the question and posting the individual answers, so we keep it.
  • The Score attribute is, of course, important as an indicator of the community's evaluation.
  • The ViewCount attribute, in contrast, is most likely of no use for our task. Even if it is able to help the classifier distinguish between good and bad, we will not have this information at the time when an answer is being submitted. We will ignore it.
  • The Body attribute obviously contains the most important information. As it is encoded in HTML, we will have to decode it to plain text.
  • The OwnerUserId attribute is useful only if we will take the user-dependent features into account, which we won't. Although we drop it here, we encourage you to use it (maybe in connection with users.xml) to build a better classifier.
  • The Title attribute is also ignored here, although it could add some more information about the question.
  • The CommentCount attribute is also ignored. Similar to ViewCount, it could help the classifier with posts that were posted a while ago (more comments are equal to more ambiguous posts). It will, however, not help the classifier at the time that an answer is posted.
  • The AcceptedAnswerId attribute is similar to the Score attribute, that is, it is an indicator of a post's quality. As we will access this per answer, instead of keeping this attribute, we will create a new attribute, IsAccepted, which will be 0 or 1 for answers and ignored for questions (ParentId = 1).

We end up with the following format:

Id <TAB> ParentId <TAB> IsAccepted <TAB> TimeToAnswer <TAB> Score <TAB> Text

For concrete parsing details, please refer to so_xml_to_tsv.py and choose_instance.py. It will suffice to say that in order to speed up the process, we will split the data into two files. In meta.json, we store a dictionary, mapping a post's Id to its other data (except Text in the JSON format) so that we can read it in the proper format. For example, the score of a post would reside at meta[Id]['Score']. In data.tsv, we store Id and Text, which we can easily read with the following method:

def fetch_posts():

      for line in open("data.tsv", "r"):

      post_id, text = line.split("	")

      yield int(post_id), text.strip()

Defining what is a good answer

Before we can train a classifier to distinguish between good and bad answers, we have to create the training data. So far, we have only a bunch of data. What we still have to do is to define labels.

We could, of course, simply use the IsAccepted attribute as a label. After all, it marks the answer that answered the question. However, that is only the opinion of the asker. Naturally, the asker wants to have a quick answer and accepts the first best answer. If more answers are submitted over time, some of them will tend to be better than the already accepted one. The asker, however, seldom gets back to the question and changes his/her mind. So we end up with many questions with accepted answers that have not been scored the highest.

At the other extreme, we could take the best and worst scored answer per question as positive and negative examples. However, what do we do with questions that have only good answers, say, one with two and the other with four points? Should we really take the answer with two points as a negative example?

We should settle somewhere between these extremes. If we take all answers that are scored higher than zero as positive and all answers with 0 or less points as negative, we end up with quite reasonable labels as follows:

>>> all_answers = [q for q,v in meta.iteritems() if v['ParentId']!=-1]
>>> Y = np.asarray([meta[aid]['Score']>0 for aid in all_answers])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.239.48