We are almost at the end of the book, but the last chapter is going to utilize all the tricks and knowledge we covered in the previous chapters. We showed you how to utilize the power of Spark for data manipulation and transformation, and we showed you the different methods for data modeling, including linear models, tree models, and model ensembles. Essentially, this chapter will be the kitchen sink of chapters, whereby we will deal with many problems all at once, ranging from data ingestion, manipulation, preprocessing, outlier handling, and modeling, all the way to model deployment.
One of our main goals is to provide a realistic picture of a data scientists' daily life--start with almost raw data, explore the data, build a few models, compare them, find the best model, and deploy into production--if only it were this easy all the time! In this final chapter, we will borrow a real-life scenario from Lending Club, a company that provides peer-to-peer loans. We will apply all the skills you learned to see if we can build a model that determines the riskiness of a loan. Furthermore, we will compare the results with actual Lending Club data to evaluate our process.