Inside the data analytics process

Data analytics applications involve more than just analyzing data. Before any analytics can be planned, there is also a need to invest time and effort in collecting, integrating, and preparing data, checking the quality of the data and then developing, testing, and revising analytical methodologies. Once data is deemed ready, data analysts and scientists can explore and analyze the data using statistical methods such as SAS or machine learning models using Spark ML. The data itself is prepared by data engineering teams and the data quality team checks the data collected. Data governance becomes a factor too to ensure the proper collection and protection of the data. Another not commonly known role is that of a Data Steward who specializes in understanding data to the byte, exactly where it is coming from, all transformations that occur, and what the business really needs from the column or field of data.

Various entities in the business might be dealing with addresses differently, 123 N Main St as opposed to 123 North Main Street. But, our analytics depends on getting the correct address field; otherwise both the addresses mentioned above will be considered different and our analytics will not have the same accuracy.

The analytics process starts with data collection based on what the analysts might need from the data warehouse, collecting all sorts of data in the organization (Sales, Marketing, Employee, Payroll, HR, and so on). Data stewards and the Governance team are important here to make sure the right data is collected and that any information deemed confidential or private is not accidentally exported out even if the end users are all employees.

Social Security Numbers or full addresses might not be a good idea to include in analytics as this can cause a lot of problems to the organization.

Data quality processes must be established to make sure the data being collected and engineered is correct and will match the needs of the data scientists. At this stage, the main goal is to find and fix data quality problems that could affect the accuracy of analytical needs. Common techniques are profiling the data and cleansing the data to make sure that the information in a dataset is consistent, and also that any errors and duplicate records are removed.

Data from disparate source systems may need to be combined, transformed, and normalized using various data engineering techniques, such as distributed computing or MapReduce programming, Stream processing, or SQL queries, and then stored on Amazon S3, Hadoop cluster, NAS, or SAN storage devices or a traditional data warehouse such as Teradata. Data preparation or engineering work involves techniques to manipulate and organize the data for the planned analytics use.

Once we have the data prepared and checked for quality, and it is available for the Data scientists or analysts to use, the actual analytical work starts. A Data scientist can now build an analytical model using predictive modeling tools and languages such as SAS, Python, R, Scala, Spark, H2O, and so on. The model is initially run against a partial dataset to test its accuracy in the training phase. Several iterations of the training phase are common and expected in any analytical project. After adjustments at the model level, or sometimes going all the way to the Data Steward to get or fix some data being collected or prepared, the model output tends to get better and better. Finally, a stable state is reached when further tuning does not change the outcome noticeably; at this time, we can think of the model as being ready for production usage.

Now, the model can be run in production mode against the full dataset and generate outcomes or results based on how we trained the model. The choices made in building the analysis, either statistical or machine learning, directly affect the quality and the purpose of the model. You cannot look at the sales from groceries and figure out if Asians buy more milk than Mexicans as that needs additional elements from demographical data. Similarly, if our analysis was focused on customer experience (returns or exchanges of products) then it is based on different techniques and models than if we are trying to focus on revenue or up-sell customers.

You will see various machine learning techniques in later chapters.

Analytical applications can thus be realized using several disciplines, teams, and skillsets. Analytical applications can be used to generate reports all the way to automatically triggering business actions. For example, you can simply create daily sales reports to be emailed out to all managers every day at 8 a.m. in the morning. But, you can also integrate with Business process management applications or some custom stock trading application to take action, such as buying, selling, or alerting on activities in the stock market. You can also think of taking in news articles or social media information to further influence the decisions to be made.

Data visualization is an important piece of data analytics and it's hard to understand numbers when you are looking at a lot of metrics and calculation. Rather, there is an increasing dependence on Business Intelligence (BI) tools, such as Tableau, QlikView, and so on, to explore and analyze data. Of course, large-scale visualization such as showing all Uber cars in the country or heat maps showing the water supply in New York City requires more custom applications or specialized tools to be built.

Managing and analyzing data has always been a challenge across many organizations of different sizes across all industries. Businesses have always struggled to find a pragmatic approach to capturing information about their customers, products, and services. When the company only had a handful of customers who bought a few of their items, it was not that difficult. It was not as big a challenge. But over time, companies in the markets started growing. Things have become more complicated. Now, we have branding Information and social media. We have things that are sold and bought over the Internet. We need to come up with different solutions. Web development, organizations, pricing, social networks, and segmentations; there's a lot of different data that we're dealing with that brings a lot more complexity when it comes to dealing, managing, organizing, and trying to gain some insight from the data.

Table of Contents for Inside the data analytics process

Create new playlist

Sign In

Sign Up

Table of Contents for
Inside the data analytics process