8 Implementing data science – analytics, algorithms and machine learning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8 Implementing data science – analytics, algorithms and machine learning

‘The formulation of a problem is often more essential than its solution … To raise new questions, new possibilities, to regard old problems from a new angle requires creative imagination and marks real advances in science.’ —Albert Einstein in The Evolution of Physics

Four types of analytics

It’s quite possible that your biggest initial wins will be from very basic applications of analytics. Analytics can be extremely complex, but it can also be very simple, and the most basic of applications are sometimes the most valuable. The preliminary tasks of collecting and merging data from multiple sources, cleaning the data and summarizing the results in a well-designed table or graph can produce substantial business value, eliminating fatal misconceptions and clearly highlighting performance metrics, costs, trends and opportunities.

Gartner has developed a useful framework for classifying application areas of analytics (shown below). Their Analytics Ascendancy Model (Figure 8.1) divides analytic efforts into four categories: descriptive, diagnostic, predictive and prescriptive. I find this model to be quite useful in discussing analytics.

Figure 8.1 Gartner’s Analytics Ascendancy Model.

Descriptive analytics

As you consider your pain points and work to raise KPIs and to reach your strategic goals, many of the issues that surface will be as simple as ‘We don’t know X about our customers’ (behaviour, demographics, lifetime value, etc.) or ‘We don’t know X about ourselves’ (costs, inventory movements, marketing effectiveness, etc.). Getting data-driven answers to such factual questions is what we call ‘descriptive analytics’. It is the process of collecting, cleaning and presenting data to get immediate insights.

You’ll look to descriptive analytics for three applications:

Operational requirements. Each of your departments will need data to operate effectively. Your finance team, in particular, cannot function without regular, accurate, and consolidated figures. This is why companies often place their business intelligence (BI) teams within the finance division.
Data insights. Put as much data as possible into the hands of decision makers. You may choose to have your data (BI) teams situated close to your strategy team and have analysts embedded within your various business units (more on this in Chapter 10).
Damage control. If you’ve been operating without visibility into your data, you may be blindsided by a drop in one or more KPIs. You need to quickly determine what happened and do damage control. The less data-driven your company, the more likely it is that the crisis will be directly related to profit or revenue (else you would have detected a change in lead indicators). You’ll first need to make up ground in descriptive analytics, then move quickly to diagnostic analytics.

To do descriptive analytics well, you’ll need the following:

Specially designed databases for archiving and analysing the data. These databases are most generally referred to as data warehouses, although you may use similar technologies with other names.
A tool for constructing and delivering regular reports and dashboards. Very small companies start with a tool such as MS Excel, but most should be using a dedicated BI tool.
A system that allows your business users to do self-service analytics. They should be able to access data tables and create their own pivot tables and charts. This greatly accelerates the process of data discovery within your organization. To implement a self-service system, you’ll need to establish additional data access and governance policies (see Chapter 11).

Diagnostic analytics

Diagnostic analytics are the problem-solving efforts, typically ad hoc, that bring significant value and often require only minimal technical skills (typically some standard query language (SQL) and basic statistics). The diagnostic effort consists largely of bringing together potentially relevant source data and teasing out insights either by creating graphs that visually illuminate non-obvious trends or through feature engineering (creating new data fields from your existing data, such as calculating ‘time since last purchase’ using your customer sales records).

A well-designed graph can be surprisingly useful in providing immediate insights. Consider the graphs presented in Figure 8.2 from Stephen Few’s book.⁶⁰ Start by looking at the simple data table below to see if any insights or trends stand out.

Now consider two different charts constructed from these same eight data points:

Figure 8.2 Employee job satisfaction.⁶⁰

Notice that the second graph immediately provides a visual insight which does not stand out from the bar chart or the data table: that job satisfaction decreases with age in only one category.

Visuals are powerful tools for diagnostic analytics, and they demonstrate how analytics can be both an art and a science. You’ll need creativity and visualization skills to discover and communicate insights through well-designed tables and graphs. As we’ll discuss more in Chapter 10, it’s important to have someone on your analytics team who is trained in designing graphs and dashboards that can bring your analysis to life.

Predictive analytics

Predictive analytics help you understand the likelihood of future events, such as providing revenue forecasts or the likelihood of credit default.

The border between diagnostic analytics and predictive analytics is somewhat vague, and it is here that we see techniques that require more advanced analytics. Consider customer segmentation, in which an organization divides its customer base into a relatively small number of segments (personas), allowing them to customize marketing and products.

Your initial segments probably used customer demographics (age, gender, location, income bracket, etc.), but you should also incorporate more refined data, such as expressed preferences and habits. RFM (Recency, Frequency, Monetary) is an example of a traditional customer segmentation approach based on purchase history, but the data that informs today’s segmentations should include big data: online journey, and sometimes audio and visual data. To form these advanced segments, you’ll bring in a lot of data and should use specialized statistical and algorithmic skills (such as principal component analysis, support vector machines, clustering methods, neural networks, etc.)

You’ll be able to do much more powerful analysis in predictive analytics. You may be predicting events several years in the future, such as credit default; several weeks or months in the future, when forecasting revenue, supply or demand; several days in the future, when forecasting product returns or hardware failures; or even just a moment in the future, when predicting cart abandonment or the likely response to a real-time advertisement.

The abilities to accurately forecast supply and demand, credit default, system failure or customer response can each have significant impact on your top and bottom lines. They can greatly increase customer satisfaction by enabling adequate inventory levels, relevant advertising, minimal system failures and low likelihood of product return.

Prescriptive analytics

Prescriptive analytics tell you what should be done. Think here of optimal pricing, product recommendations, minimizing churn (including cart abandonment), fraud detection, minimizing operational costs (travel time, personnel scheduling, material waste, component degradation), inventory management, real-time bid optimization, etc.

You’ll often use these four analytic layers in succession to reach a goal. For example:

descriptive analytics would flag a revenue shortfall;
diagnostic analytics might reveal that it was caused by a shortage of key inventory;
predictive analytics could forecast future supply and demand; and
prescriptive analytics could optimize pricing, based on the balance of supply and demand, as well as on the price elasticity of the customer base.

Keep in mind

You can realize tremendous value simply by organizing your data and making it available across the organization. More advanced projects can come later.

Models, algorithms and black boxes

As you move into more advanced analytics, you’ll need to choose analytic models. Models are sets of formulas that approximately describe events and interactions around us. We apply models using algorithms, which are sequences of actions that we instruct a computer to follow, like a recipe. As you employ an analytic model to solve your business problem (such as predicting customer churn or recommending a product), you’ll need to follow three steps:

Design the model;
Fit the model to the data (also known as ‘training’ or ‘calibrating’ the model); and
Deploy the model.

Designing the model

Your business problems can typically be modelled in multiple ways. You’ll find examples of ways your application has traditionally been modelled in textbooks or online, but you may also experiment with applying models in creative ways. A description of the various models is beyond the scope of this book, but to get a quick overview of commonly used analytic models, pull up the documentation of a well-developed analytic tool such as RapidMiner, KNIME or SAS Enterprise Miner, or check out the analytics libraries of the programming languages Python or R. These won’t provide exhaustive lists of models, but they will give a very good start. You will be more likely to find a broader range of (academic) algorithms in the R language but you may need to use Python for the cutting-edge big data applications. More on this later.

When speaking about models, we sometimes use the term model transparency to indicate the ease with which a model can be explained and understood intuitively, particularly for a non-technical audience. An example of a transparent model would be one that computes insurance risk based on age and geography, and that is calibrated using historical insurance claims. It’s easy to understand the factors that influence such insurance premiums. A model that is completely non-transparent is called a black-box model, because the end user will not be able to understand its inner workings.

Choose simple, intuitive models whenever possible. A simple model, such as a basic statistical model, is easier to develop, to fit to the data, and to explain to end users than is a more complicated model, such as non-linear support vector machines or neural networks. In addition, transparent models allow end users to apply intuition and suggest modelling improvements, as we saw previously in the University of Washington study, where model improvements suggested by non-technical staff raised accuracy by nearly 50 per cent.

Model transparency is particularly important for applications where outcomes must be explained to healthcare patients, government regulators or customers (e.g. rejection of a loan application).

Black box models from big data

In our daily lives, we draw certain conclusions from concrete facts and others from intuition, an invaluable skill, which is nearly impossible to explain. A thermometer gives a clear indication of fever, but we can simply look at someone we know and perceive when they aren’t feeling well. Through years of experience looking at many healthy people and sick people, combined with our knowledge of what this person typically looks like, we ‘just know’ that the person is not feeling well. Remarkably, models such as neural networks work in a very similar way, trained by millions of sample data points to recognize patterns. Big data enables much stronger computer ‘intuition’ by providing a massive training set for such machine learning models.

These models, which improve by consuming masses of training data, have recently become significantly better than any transparent algorithm for certain applications. This makes us increasingly dependent on black-box models. Even for classic business applications such as churn prediction and lead scoring, for which we already have reasonably effective and transparent models, data scientists are increasingly testing the effectiveness of black-box models, particularly neural networks (deep learning).

Although an ML method such as neural networks can recognize patterns through extensive training, it is not able to ‘explain’ its pattern recognition skills. When such a model wrongly labels a dog as an ostrich (as illustrated in Chapter 2), it is very difficult to detect and explain what went wrong.

This lack of transparency is a significant problem in areas such as insurance, law enforcement and medicine. Consider an article published in Science magazine in April 2017,⁶¹ describing recent results in the prediction of cardiovascular disease (heart attacks, strokes, etc.). Currently, many doctors evaluate patient risk in this area using eight risk factors, including age, cholesterol level and blood pressure. Researchers at the University of Nottingham recently trained a neural network to detect cardiovascular events, using as inputs the medical records of nearly 400,000 patients. Their model achieved both a higher detection rate (+7.6 per cent) and a lower false alarm rate (−1.6 per cent) than the conventional eight-factor method.

Such a model can potentially save millions of lives every year, but being a neural network it is a black box and hence raises several concerns.

If patients ask the reason why they were labelled as high risk, the black-box model will not provide an answer. The patients will have difficulty knowing how to reduce their risk.
If insurance companies use this more accurate method to calculate insurance premiums, they will be unable to justify premiums. At this point, we also start to run into legal implications, as some countries are introducing so-called ‘right to explanation’ legislation, mandating that customers be given insight into the decision processes that impact them.

There has been recent progress in adding transparency to black box models. The University of Washington has recently developed a tool they call LIME – Local Interpretable Model-Agnostic Explanations, which they describe as ‘an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.’⁶² Their tool works by fitting local linear approximations, which are easy to understand, and it can be used for any model with a scoring function (including neural networks).

Fitting the model to the data

After a trained analyst has selected a model (or several candidate models), they’ll fit the model to your data, which involves:

choosing the exact structure of the model; and
calibrating the parameters of the model.

Choosing the structure of the model involves deciding which features in your data are most important. You’ll need to decide how to create binned categories (where numbers are grouped into ranges) and whether an engineered feature such as ‘average purchase price’ or ‘days since last purchase’ should be included. You might start with hundreds of potential features but use only a half dozen in your final model.

When using neural networks (including deep learning), you won’t need to walk through this variable selection or feature engineering process, but you will need to find a network architecture that works for your problem (some sample architectures are illustrated in Chapter 2). You’ll follow a trial-and-error approach to choosing the types and arrangements of nodes and layers within the network.

As you calibrate the model parameters, you’ll tune the structured model to best fit the training data. Each model will have one or more associated algorithms for adjusting parameters to maximize some target score, which itself describes how well the model fits the training data. Take care to avoid over-fitting the training data, by using a method called cross-validation or by assessing a goodness-of-fit test.

During the modelling process, try several possible model structures or architectures, using specialized tools and programs to optimize the parameters for each. Assess the effectiveness of each structure or architecture and select one that seems best.

Deploying the model

Develop your models in a programming language and environment suitable for rapid prototyping, using a limited and cleaned data set. Bring in more data after you’ve demonstrated the effectiveness of the model. Don’t spend time making it fast and efficient until it’s proven its worth, perhaps even after it has already been deployed in a limited capacity to production.

Many data scientists prototype in the Python or R languages, testing on laptops or a company server. When the code is ready to be deployed to production, you may want to completely re-write it in a language such as C++ or Java and deploy it to a different production system. You’ll connect it to production data systems, requiring additional safeguards and governance.

Work with your IT department in selecting:

hardware for deployment, including memory and processor (typically central processing unit (CPU), but GPU or tensor processing unit (TPU) for neural networks);
architecture that will satisfy requirements for speed and fault-tolerance;
choice of programming language and/or third-party tool (such as SAS or SPSS);
scheduling of the computational workload, such as whether it should be run centrally or locally, possibly even at the point of data input (recall our discussion of fog computing in Chapter 5).

Work with IT to cover the normal operational processes: backups, software updates, performance monitoring, etc.

Work with your company’s privacy officer to make sure you are satisfying requirements for data governance, security and privacy. Your model may be accessing a database with personal information even though the model results do not themselves contain personal information, and this may put you at risk. A government regulation such as Europe’s General Data Protection Regulation (GDPR) may allow use of personal data for one analytic purpose but not for another, depending on the permissions granted by each individual. I’ll talk more about this in Chapter 11, when I discuss governance and legal compliance.

Artificial intelligence and machine learning

I introduced the concepts and recent developments within artificial intelligence in Chapter 2. I’ll approach the topic from a more practical angle in this section, putting AI into context within the broader set of analytic tools. Recall that deep learning is a form of neural networks, reflecting the fact that we now have the technology to run networks with many more layers, hence ‘deep’. When I use the term artificial neural networks (ANNs), I’m including deep learning.

You’ve probably noticed that much of the recent focus within AI has been on ANNs, which are doing a great job solving several classes of problems. But don’t expect ANNs to be the best or even a suitable alternative for all problems. They have the advantage of working in domain-agnostic contexts, but if you have specific domain knowledge to bring to bear on your problems, you can incorporate that knowledge in the construction of a more transparent and often more accurate model. In addition, the black-box nature of ANNs generally dictates that you avoid using them except in cases where they provide a clear advantage over more transparent models.

ANNs are especially effective for problems with very large data sets and in which the complexity of the data makes it difficult to apply domain knowledge in your model. Problems involving images are prime candidates for ANNs. Problems involving large amounts of unstructured data with hundreds of features, particularly text, are also good candidates, as ANNs do not require the feature engineering of traditional machine learning techniques. ANNs can work well for natural language processing, but are not always better than alternative methods (such as information retrieval combined with XGBoost).⁶³ Problems with few features and relatively small amounts of data are generally not good candidates for ANNs.

Dozens of network architectures have been developed,⁶⁴ with perhaps the most important being convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Some models use a hybrid mix of architectures. CNNs are particularly useful for image and video analysis and were also used by AlphaGo. RNNs work well with sequential data, such as EEGs and text.

Although building an ANN lets you save effort in structuring your data and engineering features, you’ll need to select a network architecture and then train the model. Model training is an iterative process of adjusting the model parameters to optimize the accuracy of the model against the available data (the training data), which itself must be labelled to enable the training process. This training process is probably the most complex part of utilizing the ANN.

It’s becoming much easier to write code to build and deploy ANNs. Google has recently open-sourced TensorFlow, their internal machine learning library for developing and deploying ANNs (as well as other ML applications). TensorFlow is one of several such libraries automating the work required to build, train, and deploy models to various target platforms. Your choice of deployment platform for your ANN is important, since they run much faster on certain types of processors.

You can utilize additional software tools to speed the development process. The Python library Keras, for example, can operate on top of TensorFlow.

AI models such as ANNs are not silver bullets and are still only one part of a larger analytic toolkit. AlphaGo beat the world champion at Go by combining neural networks with the Monte Carlo simulation methods traditionally used for Go.⁶⁵ When Apple added deep learning technology to their Siri AI in the summer of 2014, they retained some of their previous analytic models to work alongside it.⁶⁶

Keep in mind

AI models such as deep learning are not silver bullets and are still only one part of a larger analytic toolkit. Don’t rush onto the AI bandwagon until you’ve considered the business benefits and alternative solutions.

In considering what analytic tools to use in your business and whether you should use AI or ML, start by considering the models with proven value for your business problem, taking into consideration available data and resources as well as model complexity, transparency and accuracy.

You can often match your specific business challenges to common classes of data science applications, each of which will have an associated set of algorithms with strong track records of success in addressing those challenges. For example, if you are optimizing personnel schedules, you’ll typically use a technique called integer programming; if you are pricing financial instruments, you’ll solve financial equations or run Monte Carlo simulations; but if you are building customer segments, you may choose from several viable models, such as logistic regression, support vector machines, decision trees, or a host of other available algorithms, including neural networks.

What remains even with these advances in hardware and software tools are the human resources needed to implement such applications. According to Gartner, there were 41,000 open deep-learning positions at the beginning of 2017, an astounding number when you consider that there were almost no such positions in 2014.⁶⁷ Thus, the opportunities afforded by integrating recent advances in neural networks into business applications will continue to add strain to your ability to build data science teams that can cover the increasingly broad span of relevant analytic skills. I’ll return to this topic in Chapter 10, when I talk about building an analytics team.

Analytic software

Databases

As you become more ambitious in the types and quantities of data you use, you’ll need to look beyond traditional methods of storing and retrieving data. Recent innovations in non-traditional databases for diverse data types form a fundamental component of the big data ecosystem. These databases are often referred to as noSQL databases, an acronym for ‘not only SQL’, since data can be retrieved from them in ways beyond the Standard Query Language (SQL). A key feature typifying these new databases is that the structure (the schema) can be defined on the fly, so we call them ‘schema-less databases’ and talk about ‘schema-on-read’. They are typically designed for efficient horizontal scaling, so we can grow their capacity with additional, rather than more expensive, machines.

You’ll need to choose the traditional and non-traditional databases that are most helpful for your applications. I’ll now briefly review some of the primary types of databases within the big data ecosystem. To give an idea of the extent of industry activity around each database type, I’ll add in parentheses the category ranking scores from db-engines.com as at July 2017.⁶⁸

Relational databases (80 per cent)

These have been the standard databases for operational use for the past 30–40 years. They sit within a relational database management system (RDMS) and consist of individual tables containing data rows with pre-determined columns, such as first name, last name, customer ID, phone number, etc. The tables are related when they share columns with the same information. For example, if there is a customer ID column in both the customer details table and the sales table, then you can compute sales grouped by customer postal code when you cross-reference the two tables. The same relational database can be designed for operational use or designed for use in analytics and reporting (as a data warehouse).

Document-oriented databases (7 per cent)

These are designed for large-scale storage and retrieval of documents, typically containing data stored in flexible XML or JSON formats. The most commonly used document-oriented database is MongoDB, which is open-source. Document-oriented databases serve well as gateway noSQL solutions, since they can quickly provide general functionality.

Search engine databases (4 per cent)

These are used to power onsite search on many websites, returning search results over vast amounts of inventory using customizable logic to match results to user search queries. With such fundamental functionality, they are often the first foray of websites into the big data ecosystem and are designed to address both the velocity and the variety challenges of big data, particularly for search. These databases are sometimes used for general data storage and analysis, although care should be taken here. Some of the most commonly used search engine databases are Elasticsearch, Solr and Splunk.

Key-value stores (3 per cent)

Entries in these databases are simply key-value pairs. They can get many simple results very quickly, which is particularly useful for online, customer-facing applications. Key-value stores address the velocity challenge of big data.

Wide column stores (3 per cent)

Similar to relational databases in functionality, but providing the flexibility to add data fields on the fly, wide column stores address the variety challenge of big data. For example, a relational database might have 20 pre-defined customer data columns, whereas a wide column store would allow on-the-fly creation of any column type for any customer. If you started a new initiative after several years, such as a premium membership class, you could simply add the required additional columns, such as membership number or total membership points, to a selection of customer records. The data rows for non-members would not change.

Graph databases (1 per cent)

These databases store data in the structure of a graph (a network of nodes and edges). They allow you to query data based on attributes and relationships. For example, you could easily find all of a customer’s third-degree connections with a given postal code and membership status. Graph databases take advantage of sparsity and structural features to enable very fast execution of queries that would involve tragically slow multiple inner joins on a traditional relational database. In Chapter 6, we saw an example of using a graph database to de-duplicate customers.

Choosing a database

You may be overwhelmed by the several hundred databases available in the market today. When selecting a database appropriate to your use case, consider not only the type and the cost of the database, but also its place within your current technology stack, the breadth of its adoption within the industry (which impacts staffing, maintenance and future capabilities), its scalability, concurrency and the tradeoff between consistency, availability and partition tolerance (according to Brewster’s CAP theorem, proven in 2002, any database can have at most two of these three). Some of these factors may be critical for your application, while others may be less important.

You can find an ordered list of the currently popular databases for different categories at db-engines.com, which also shows recent trends (see Figure 8.3). At the time of writing, time series databases have been gaining interest faster over the past 12 months than any other type of database (possibly due to their use in IoT), but they are still very much overshadowed by the other types mentioned above. The market research and advisory firms Gartner and Forrester regularly publish detailed analysis of the databases offered by many larger vendors in their publications known as Gartner Magic Quadrants and Forrester Waves.

Figure 8.3 Number of listed database systems per category, July 2017.⁶⁸

Programming languages

When developing your analytic models, choose a programming language that fits within your broader IT organization, has well-developed analytics libraries, and integrates well with other data and analytic tools you are likely to use. There is no single best language for analytics, but the top two contenders in online forums are R and Python, at least for the initial stages of development.

In addition to personal preferences, check constraints of the IT environment in which you are working and of third-party software you might use. For example, Python is typically one of the first languages supported by open-sourced big data projects (as was the case for TensorFlow and Hadoop streaming), but many analysts come out of academia with extensive experience in R. Those from banking environments are typically familiar with SAS, which itself has an extensive ecosystem, including the powerful (and relatively expensive) SAS Enterprise Miner.

Some companies allow analysts to choose their own language for prototyping models, but require that any model deployed to a production environment first be coded in a compiled language such as C++ or Java and be subjected to the same rigorous testing and documentation requirements as all other production code. Some deploy the analytic models as REST services, so that the code runs separately from other production code.

Analytic tools

You can create your models from the ground up, but it is often faster (and less error prone) to use third-party analytic software, whether this be application components such as SAS’s Enterprise Miner or IBM’s SPSS, standalone tools such as RapidMiner or KNIME, cloud-based services such as Azure/Amazon/Google ML engine, or open-sourced libraries such as Python’s scikit-learn, Spark’s MLlib, Hadoop’s Mahout, Flink’s Gelly (for graph algorithms), etc. You’ll get pre-built algorithms, which typically work well alongside custom-built R or Python code.

Choose a good visualization tool for creating charts. Excel will get you started, and languages such as R and Python have standard plotting libraries, but you’ll want to progress to tools with more functionality. Specialized BI systems such as Tableau, Microsoft Power BI, Qlik (plus a few dozen others) will integrate easily with your data sources, and technical tools such as D3.js will allow you to create even more impressive and responsive charts within web browsers. Most companies use the plotting functionality of an off-the-shelf BI tool, which also provides the data integration, concurrency, governance and self-service required within an enterprise environment. Self-service capabilities are very important in BI tools, as they empower your users to explore data on their own, so choose a tool with a low learning curve.

The market for visualization software is booming, and there are rapid changes in market leadership. Vendors are improving their presentation, self-service capabilities, accessibility of diverse data sources and the analytic capabilities that they provide as value-adds. But your visualization tools will only come into their own in the hands of specialized experts. Form your analytics teams with this skill in mind. We’ll return to this topic in Chapter 10.

Agile analytics

There are two main methods for project planning: waterfall and agile. Waterfall is a traditional approach in which the project is first planned in its entirety and then built from that planning. Agile is a more innovative approach in which small, multi-functional teams deliver incremental products, which eventually grow into the full solution.

The short delivery cycles in agile reduce the risk of misaligned delivery and force teams to build with an eye to modularity and flexibility. In addition, the focus within agile on cross-functional teams helps ensure that analytic initiatives are undergirded by necessary data, infrastructure and programming support, and are continuously re-aligning with business goals and insights.

Agile project management is gaining popularity, particularly in technology companies. It is especially helpful in big data analytics projects, where challenges and benefits are less clearly understood and more difficult to anticipate, and where underlying tools and technology are changing rapidly. Agile is designed for innovation, and it pairs well with big data projects, which themselves focus on agility and innovation.

In IT and analytics, agile methodologies are most often carried out using the framework called scrum (think rugby), which is employed in one of its forms at least five times as often as other agile frameworks.⁶⁹ Even departments outside of IT work with scrum, and it is not uncommon to see HR or marketing teams standing around scrum planning boards.

Agile methodologies are being embraced even at the highest levels within corporations, with companies embracing the principles of ‘fail fast’ and ‘nail it, then scale it.’ In the context of their recent digital initiative, General Electric (GE) has been developing what they call a ‘culture of simplification’: fewer layers, fewer processes and fewer decision points. They’ve adapted lean principles in what they call ‘Fast Works.’ They have broken away from many traditional annual operating cycles. As their (former) CEO Jeff Immelt said, ‘in the digital age, sitting down once a year to do anything is weird; it’s just bizarre.’⁷⁰

Keep in mind

Business feedback is key to making agile work. Work in short delivery cycles and solicit frequent feedback from your stakeholders.

It’s important to emphasize this once more. Don’t try to solve your full problem at once. Don’t try to assemble a complete, cleaned data set before starting your analysis. Spend two weeks building a 60 per cent solution using 10 per cent of the data, then get feedback on the results. Spend the next two weeks making a few improvements and then collect more feedback.

There are several advantages to such a short-cycled approach over trying to build the solution in one shot. First, you’ll demonstrate to your stakeholders after just a few days that you have indeed been working and that you are still alive. Second, if you happen to be headed down the wrong path with your analysis, either because the data doesn’t mean what you thought or because the problem wasn’t communicated clearly, then you can correct the misunderstanding before wasting more time. Third, it’s all too likely that the business priorities will change before you’ve completed the full project. Your short delivery cycles will allow you to cash in on the deliverable while it is still appreciated, before you start work on a now more relevant project.

Keep your analytics agile by following the following basic principles:

Start with a minimum viable product (MVP). Make it cheap and quick, because once you get feedback from your initial results, it will almost certainly need to change.
Learn and change quickly. Get feedback from end users as often as you can. Gain their trust and support by listening closely to their input.
Build modular components that are fault tolerant. Consider a microservice architecture, where components are built independently and communicate through a well-defined, lightweight process. This architecture will have some cost in speed and efficiency but will improve fault tolerance and usability.

There are many books, certifications and trainings on the topics of lean, agile and scrum, as well as at least one book written entirely about lean analytics. I touch on the topics here only briefly, to emphasize the importance of working in an agile manner to effectively derive business value from big data.

Takeaways

Analytics can be divided into four levels of increasing complexity, but even basic analytics can be extremely valuable. Start by getting your data in order and doing some spreadsheet analysis.
A well-designed graph can give insights you won’t get from a table.
When you have a choice of analytic models, use the simplest and most intuitive.
AI and machine learning have promises and pitfalls. Weigh the value, the risks, the costs and the alternatives.
Analytics projects are best carried out using agile approaches.
Leverage existing tools and technologies as far as possible, but consider the factors discussed above before making your choices.

Ask yourself

Which of the four types of analytics does your organization utilize effectively? For those you are not already utilizing, are you hindered by lack of skills, use-cases or priority?
Think of times when an insight jumped out at you from a graph. What data are you regularly reviewing in tables that might benefit from a graphical representation?
Where in your organization are you using analytic models but not yet incorporating business intuition within the modelling process? Are you satisfied with the output of those models? You may need to push to bring more business insight into those models.
How frequently do you review deliverables from your analytics projects? Which end users are testing out the intermediate deliverables for those projects?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8 Implementing data science – analytics, algorithms and machine learning

Create new playlist

Sign In

Sign Up

Chapter 8

Implementing data science – analytics, algorithms and machine learning

Four types of analytics

Descriptive analytics

Diagnostic analytics

Predictive analytics

Prescriptive analytics

Keep in mind

Models, algorithms and black boxes

Designing the model

Black box models from big data

Fitting the model to the data

Deploying the model

Artificial intelligence and machine learning

Keep in mind

Analytic software

Databases

Relational databases (80 per cent)

Document-oriented databases (7 per cent)

Search engine databases (4 per cent)

Key-value stores (3 per cent)

Wide column stores (3 per cent)

Graph databases (1 per cent)

Choosing a database

Programming languages

Analytic tools

Agile analytics

Keep in mind

Takeaways

Ask yourself

Table of Contents for
8 Implementing data science – analytics, algorithms and machine learning