Although we were impressed with many of the overall model consistencies, we appreciate that we certainly did not build the most accurate classification system ever. Crowd sourcing this task to millions of users was an ambitious task and by far not the easiest way of getting clearly defined categories. However, this simple proof of concept shows us a few important things:
No data scientist can build a fully functional and highly accurate classification system in just a few weeks, especially not on dynamic data; a proper classifier needs to be evaluated, trained, re-evaluated, tuned, and retrained for at least the first few months, and then re-evaluated every half a year at the very least. Our goal here was to describe the components involved in a real-time machine learning application and to help data scientists sharpen their creative minds (out-of-the-box thinking is the #1 virtue of a modern data scientist).
In the next chapter, we will be focusing on article mutation and story de-duplication; how likely is a topic to evolve over time, how likely is a clique of people (or community) likely to mutate over time? By de-duplicating articles into stories, stories into epics, can we predict the possible outcomes based on previous observations?
3.144.41.148