Summary

Although we were impressed with many of the overall model consistencies, we appreciate that we certainly did not build the most accurate classification system ever. Crowd sourcing this task to millions of users was an ambitious task and by far not the easiest way of getting clearly defined categories. However, this simple proof of concept shows us a few important things:

  1. It technically validates our Spark Streaming architecture.
  2. It validates our assumption of bootstrapping GDELT using an external dataset.
  3. It made us lazy, impatient, and proud.
  4. It learns without any supervision and eventually gets better at every batch.

No data scientist can build a fully functional and highly accurate classification system in just a few weeks, especially not on dynamic data; a proper classifier needs to be evaluated, trained, re-evaluated, tuned, and retrained for at least the first few months, and then re-evaluated every half a year at the very least. Our goal here was to describe the components involved in a real-time machine learning application and to help data scientists sharpen their creative minds (out-of-the-box thinking is the #1 virtue of a modern data scientist).

In the next chapter, we will be focusing on article mutation and story de-duplication; how likely is a topic to evolve over time, how likely is a clique of people (or community) likely to mutate over time? By de-duplicating articles into stories, stories into epics, can we predict the possible outcomes based on previous observations?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.134.151