Challenges

Data is the main asset for any social network. Yet StackExchange does a wonderful job of exposing its data for exploration and analysis. Unlike other social networks, which expose their data through APIs mostly and restrict many details, StackExchange not only provides multiple channels like data dumps and data explorer apart from APIs, but it also provides access to an almost complete set of public information.

That being said, there are challenges while working with a platform such as StackExchange. The following are a few of them:

  • Data dumps expose the data in the form of XML files. Though there are parsers available in R for using XML data, there is an inherent limit imposed if the XML files are huge (StackOverflow's XML files amount to 30 GB). This limitation can be overcome by first loading the data into a local database such as MySQL and then working upon the required subset of data.
  • Data explorer has row limits imposed upon the data extracted through the explorer (current limits are set to 50,000 rows per query). Also, frequent querying may not be feasible due to heavy network load if the query results are huge.
  • The data, once extracted, requires multiple steps of preprocessing to bring it into the required shape. Since most of the information on StackExchange is transactional in nature, certain nested attributes complicate this process even further.
  • StackExchange is a user-driven platform which usually has high quality posts. Yet data quality can be a challenge and requires additional diligence from the end users analyzing the data. For instance, optional attributes like age and location may contain weird values or be missing altogether. Similarly, even though moderators do a great job, tag quality can still be a cause of worry while analyzing the data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.131.25