Summary

Data science is not just about machine learning. In fact, machine learning is only a small portion of it. In our understanding of what modern data science is, the science often happens exactly here, at the data enrichment process. The real magic occurs when one can transform a meaningless dataset into a valuable set of information and get new insights out of it. In this section, we have been describing how to build a fully functional data insight system using nothing more than a simple collection of URLs (and a bit of elbow grease).

In this chapter, we demonstrated how to create an efficient web scraper with Spark using the Goose library and how to extract and de-duplicate features out of raw text using NLP techniques and the GeoNames database. We also covered some interesting design patterns such as mapPartitions and Bloom filters that will be discussed further in Chapter 14, Scalable Algorithms.

In the next chapter, we will be focusing on the people we were able to extract from all these news articles. We will be describing how to create connections among them using simple contact chaining techniques, how to efficiently store and query a large graph from a Spark context, and how to use GraphX and Pregel to detect communities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.62.94