Chapter 8. Learning Analytics on Spark

To continue our machine learning on Spark, we will further extend our application to serve the educational sector in this chapter and the government sector in next chapter. Specifically, in this chapter, we will extend our application to serve learning organizations, such as universities and training institutions, for which we will apply machine learning to improve the learning analytics for a real case of predicting student attrition. In the next chapter, we will utilize our Apache Spark machine learning to serve city governments, for which we will demonstrate our application with a real use case of predicting service requests.

By following the structures and processes established in previous chapters, in this chapter, we will first review machine learning methods and related computing for the real case of predicting student attrition, and we will then discuss how Apache Spark comes in to make them easy. At the same time, by working on this real-life student attrition prediction example, we will illustrate our machine learning process of predicting attritions step by step with Big Data in the following sections:

  • Spark for attrition prediction
    • Processing Big Data fast and easy with Spark
  • Methods for attrition prediction
    • Regression and decision trees
  • Feature preparation
    • Feature extraction and data merging
  • Model estimation
    • Distributed model estimation
  • Model evaluation
    • Confusion matrix and false positive ratio
  • Results explanation
    • Significant features and impacts
  • Model deployment
    • Rules and scoring

Spark for attrition prediction

In this section, we will start with a real use case and then describe how to prepare Apache Spark for this attrition prediction project.

The use case

NIY University is a private university and wants to improve its student retention using predictive modeling with Big Data. According to ACT's research (refer to http://www.act.org/research/policymakers/pdf/retain_2015.pdf), the average retention rate for American colleges was only about 68% in 2015, and it is even lower for two-year public colleges at 54.7% and for private two-year colleges at 63.4%. That is, about 32% of students left school before graduation, and the attrition is even at greater for two-year public colleges at 45.3% and for two-year private colleges at 36.6%. As student attrition costs both colleges and students a lot, using Big Data to predict students' attrition and designing interventions to prevent them has a lot of value.

The university has a lot of information about student demographics and the past test scores of its students. At the same time, the university also collected its students' online behavior data on university websites as well as some social media data along with some data of campus social activity. The university especially collected a lot of data on their learning management systems as it uses MOODLE as the main learning platform. The goal of this project is to build a model for the university to identity students at risk, understand how some of their interventions affect students' academic performance, and then work on student retention.

To sum up, for this project, we have a target variable of student performance measured by test scores as well as a categorical variable of attrition along with a lot of data on demographics, behavior, performance, and interventions.

Through some preliminary analysis, the university understands some of their data challenges as follows:

  • The data is not ready to use, especially the web log data, and some of the learning management system data needs to be developed into useful features ready for machine learning
  • Students with various backgrounds major in various subjects with various career goals, for which attrition patterns are very different from each other

To deal with the challenges mentioned here, for this real project, we will utilize some feature development techniques plus some distributed computing techniques discussed in the previous chapter, for which we will specially focus our effort on organizing our computations with some notebooks and then implementing them in an integrated environment to distribute computing.

Spark computing

After learning about Spark computing in the previous seven chapters, you must be very familiar with setting up Spark computing projects by now, for which there are a few options that include the Databricks platform, IBM DataScientistWorkbench, SPSS on Spark, and Apache Spark with MLlib alone.

Either one of the preceding mentioned approaches should work well for this learning analytics project. Therefore, in the following section, we will touch on using one of the four approaches but will focus our efforts more on utilizing the Zeppelin notebook as this approach of using the Zeppelin notebook was only briefly discussed in Chapter 5, Risk Scoring on Spark. The Zeppelin notebook is widely utilized, and it is similar to the Jupyter notebook used in IBM DataScientistWorkbench. Both Zeppelin and Jupyter have a similar coding style, embed images, and run different programming languages.

The Jupyter notebook is more mature in terms of abilities and utility, but its Scala version is weak. With Zeppelin, it's easier to mix languages in the same notebook. You can do some SQL and Scala, then mark down to document it all together. You can also easily convert your notebook into a presentation style to maybe present it to the management or use it in dashboards.

Also, for practical use, you may take the code developed here in this chapter, put them on a different notebook, and then implement the notebook with any other approaches, as mentioned in the preceding paragraph, so that you will not be limited by our Zeppelin with Spark approach.

Spark computing

Data uploading

The following screenshot shows how the Zeppelin starting page looks:

Spark computing

Users can click on Create new note, the first line under Notebook in the left-hand side column, to start. Then a box will open to allow users to type in the notebook's name and then click on Create Note to create a new notebook:

Spark computing
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.9.169