1 INTRODUCTION TO DATA ANALYSIS

This chapter aims to give a general overview of the current state of affairs in the data analysis discipline. Some of the areas covered in this chapter will be dealt with in greater detail in later chapters of this book.

Unleashing the value of data is still at the heart of many organizations’ strategy, and data analysts and scientists are becoming crucial in creating an innovative, digital and customer-centric organisation.

(Patrick Maeder, Partner, PwC)

We begin with a description of data analysis and comment on the growing role that data has in our society. Then we introduce the technical advances relating to data analysis that have been made in computer science, data storage, data processing and statistical/machine learning during the last decade. These are the building blocks of the technical tools that enable the processing and analysis of ‘Big Data’.

There are an increasing number of regulatory and legal requirements about how to deal with data, with very severe penalties for non-compliance. The subsection ‘Legal and ethical considerations’ in this chapter will introduce this area.

The chapter ends with a section on what the IT industry is doing to address some of the challenges that relate to the data analysis discipline.

images

WHAT IS DATA?

The existence of data and information predates our computers and the internet. In fact, data has existed in many forms, such as oral stories, wooden carvings, paintings, written records, storybooks, newspapers and so on, for thousands of years. However, this book is only concerned with data that is stored in electronic formats and can be processed by computers.

The Merriam-Webster dictionary defines data as ‘factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation’.1 Wikipedia defines data as ‘the values of subjects with … qualitative or quantitative variables’.2

Both of these definitions capture important aspects of data that is used for analysis. It is interesting that the Wikipedia definition draws attention to the difference between qualitative and quantitative data. Quantitative values are numerical and categorical, such as that typically found in a table or list, and they have been used for statistical analysis for many decades. Qualitative values are usually text descriptions, such as documents, emails or social media posts, and are rapidly gaining in importance for data analysis.

WHAT IS DATA ANALYSIS?

Wikipedia’s definition of data analysis is particularly relevant to this book: ‘a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making’.3

It draws attention to the fact that the process of analysing data includes the tasks of manipulating the data by cleaning and transforming it as well as the task of discovering useful information from it. Manipulating data is typically classed as a computer science skill, whereas the discovery of information from data is a statistical or machine learning skill. Successful data analysis requires both of these skill sets.

There are a number of standard process models for data analysis, most notably CRISP-DM, SEMMA and KDD.4

The role of data in society

Data analysis is rapidly becoming one of the most important and challenging activities to drive the improvement of business performance, public services and other important aspects of society. This is happening because the volumes of data we have available for analysis continues to expand, the technological hardware that is accessible to us grows in data processing power and the algorithms that we can apply to the data become more and more advanced. These factors mean that we can get better insight from data than ever before.

This insight is used by businesses to develop products and services that more precisely suit their customers and increase business profits. These developments in data analysis also benefit other areas, such as the public sector, where the data is used to find the most cost-effective solutions to benefit all social groups.

There are continual new announcements in the press, commercial trade magazines and academic literature of novel applications of algorithms, automated decision-making and more efficient strategic decisions based on data. These news stories remind us how important data analysis is in transforming political debate, the provision of public services, commercial performances and solving research questions.

But these achievements are only possible because there are skilled data analysts able to work with the technology, apply the algorithms, explain the analysis and communicate the results to decision-makers. The growing importance of data in decision-making is creating an increasing demand for the education of new data analysts and the advancement of skills of experienced data analysts. To be successful and to keep up with all these developing areas, data analysts need a combination of skills that enable them to extract, process and manipulate data using programming languages and databases as well as statistical skills and acumen in communication. The role of data analyst is detailed further in Chapter 2.

Management expectations of data

So much of modern society is being transformed by a changing attitude to the role of data and the insight that we can get from its analysis. This change comes from growing expectations that decision-makers in these areas have of the importance of a scientific approach to data analysis and in the increasing use of automated decision-making.

Managers and administrators expect data analysts to calculate accurately and report on past performance data in their areas in addition to making projections and predictions of what the future will hold for their business. This isn’t new.

In the past, the collection of data was somewhat expensive, and it was often collected for only one purpose. Now, leaders of businesses and public services are realising that the price of collecting data has become much cheaper and there is much more of it available in their organisations. The falling price of data collection also means that it is easier for organisations to procure data (ethically) from external vendors and there is a growing amount of free public data available for use, such as the British government’s data.gov.uk initiative.5

Business leaders are now increasingly demanding that data analysts make use of all this data to help answer business questions to a greater extent than ever before.

Cross-disciplinary cooperation

Data analysis is a broad discipline and analysts cannot be experts in all the specialist skills required, so it’s often necessary to draw on the competencies of other professionals such as database specialists, statisticians, machine learning experts, business analysts and project managers (see Figure 1.1). These professions need to work effectively alongside data analysts, appreciate the challenges faced and understand the results good data analysis can produce. Data analysts should ensure that they have at least a basic understanding of these disciplines in order to effectively engage with these specialists in multidisciplinary teams, as well as help to educate them in data analysis.

Figure 1.1 Data analysis disciplines

images

One way of addressing knowledge gaps is to reach out and engage in training, communication and knowledge sharing with these communities.

ADVANCES IN COMPUTER SCIENCE

In the past few decades various areas of computer science have given us great developments in technological capabilities to store, process and analyse data. However, many of these developments demand new knowledge and skills to enable analysts to efficiently utilise them.

Emergence of Big Data

The data that is available for data analysis continues to grow rapidly and the term ‘Big Data’ has emerged to describe this development. The term was coined in 2001 by Doug Laney (2001), and it draws attention to three Vs that characterise the challenges that this new data poses.

Volume: the amount of data collected is becoming too big for traditional storage solutions.

Velocity: the speed with which data has to be processed exceeds traditional processing capabilities.

Variety: the lack of an information structure required by traditional data analysis methods.

All these factors are continually challenging the traditional tools that have been used for data analysis. To meet these challenges, over recent decades there have been significant developments in new tools and techniques that can efficiently address the challenges of Big Data. However, many of these tools and techniques require new specialist knowledge by the data analysts that operate them.

The Internet of Things and digitalisation

The emergence of Big Data challenges come partly from a growing trend popularly referred to as the Internet of Things (IoT), wherein previously mechanical equipment is turned into computer-controlled internet devices, such as phones, white goods, cars, watches and so on. These devices are able to generate a large amount of data that is increasingly easy to capture and store for analysis. The data generated by these devices, together with speed and lack of traditional structure, are all contributing to the Big Data challenges faced by analysts.

Analytical technology and tools are also challenged by the growing amount of unstructured data that is available, such as natural language text.

images

WHAT IS STRUCTURED AND UNSTRUCTURED DATA?

Structured data is organised in a predetermined format. It is often stored in database tables or other systems that ensure it conforms to a specific arrangement that typically suits computer analysis.

Unstructured data is not arranged so that it suits computer analysis. It can be natural language text, images or other formats that are typically aimed at humans and difficult for computers to analyse.

Natural language text information comes partly from the digitalisation of previously paper-based records and partly from collected emails, social media records, blogs and so on. The information contained in this data is very accessible to a human reader, but it’s challenging to design algorithms that enable computers to access this information.

ADVANCES IN DATA STORAGE

The physical cost of data storage has dramatically decreased over the past few decades and this has enabled the storage of ever-increasing amounts of data. The use of outsourced data storage solutions, also known as cloud storage, and the use of distributed data storage solutions has been expanding to accommodate this demand.

images

WHAT IS DISTRIBUTED STORAGE AND CLOUD?

Distributed storage is when data is stored across several computers in a network. These systems can increase the amount of data that can be stored and the speed with which it can be accessed. A common system for distributed storage is cloud solutions, where data is accessed via the internet. This can also bring the benefit of being able to access data from any machine connected to the internet and means that the data is still available even after the failure of a local computer.

The changing nature of the data used for data analysis has both been facilitated by and required advancements in our data storage solutions. The traditional relational databases, which are characterised by the use of Structured Query Language (SQL), have been challenged by the rise of non SQL (or not only SQL; NoSQL) alternatives. These solutions are designed to be more efficient and intuitive to use on unstructured textual data. However, they often require the use of different query languages to extract and process the data. Relational databases, SQL and NoSQL alternatives are covered in more detail in Chapter 3.

FREE, PUBLIC AND OPEN DATA

As briefly mentioned earlier, there is a growing amount of free, open and public data available for data analysis from a variety of online sources. The UK government is leading this trend with its data.gov.uk website initiative where thousands of free data sets are provided about public services, demographics and democracy. This creates immense opportunities for analysts to gain deeper insights.

However, the integration of such data sets also creates a great challenge. They need to be organised so that they can be integrated into the structure of existing data and reports that the organisation has access to. This requires analysts to understand how this data has been accumulated and summarised.

ADVANCES IN DATA PROCESSING

There have also been great improvements in the ability to process the growing volumes of data and to handle the increasing complexity of this data. These improvements have come both from progress in the ability to process data in-memory, as opposed to reading and writing data to disk between processing steps, as well as the emergence of parallel processing capabilities.

Parallel hardware improvements

Using parallel processing to increase the speed with which data can be processed is nothing new, but recent years have seen the maturity and wider commercial use of a number of advanced techniques. One of these is the switch from exclusively using central processing units (CPUs) to using units designed for graphical processing (GPUs). These processing units are specifically built to perform calculations on many data points at once and, with the right techniques, can help significantly in processing Big Data volumes.

Parallel cooperating clusters

The parallel processing trend that has emerged involves the use of collections of low specification, relatively cheap, commodity computers that each process a small subset of a larger calculation. In these setups, the individual computers are referred to as nodes and a collection of them is referred to as a cluster. The most successful system for managing such clusters is the open source Apache Hadoop system and its closely related relatives. However, this system has a new programming paradigm and requires an understanding of the techniques that it employs, thus demanding more specialist knowledge from data analysts.

ADVANCES IN STATISTICAL AND MACHINE LEARNING

The past few decades have also seen rapid developments in statistical analysis techniques, which have been both enabled by computational science developments and driven the demand for more developments. Traditional statistical analysis methods have required data to be in a tabular structure where each row is an independent observation. We are now seeing new techniques to analyse data that does not adhere to this strict format.

These techniques are being expanded with machine learning algorithms that are often inspired by biological processes and apply a pragmatic approach to finding hidden information. The implementation of these algorithms poses a challenge to data analysts, who need to both understand and employ the new techniques associated with them.

Machine learning

The field of data mining has brought in new techniques that use computers to get insights from very large data sets. These are often referred to as machine learning techniques, and include both supervised and unsupervised methods.

Supervised learning methods use a data set with a number of characteristics and a target variable that the system is trained to predict. When the system is subsequently presented with the characteristics, it will make a prediction as to the expected value of the target variable. Such methods are popular when companies want to predict a future or unknown outcome, such as whether an applicant is likely to repay a loan.

images

WHAT IS A VARIABLE?

A variable is a singular value that is stored, manipulated and used in a computer program. It can contain information such as a number, a text string, a date and so on.

Unsupervised learning methods are used to find patterns in a large data set where the existence or nature of these patterns is not known. There are many areas where these techniques have been used, which range from finding out if a data set contains groups of similar records, to detection of anomalies.

Natural language processing

Within the last decade or two we have seen a rapid increase in the amount of unstructured textual data that is being captured and stored. Natural language processing has emerged as a very popular discipline for the analysis of textual data. The techniques that are being deployed for this type of analysis depend on the computational processing and storage capabilities that are available to data analysts. This is another analysis task that data analysts need to get to grips with.

Visualisation

In data analysis, the ability to display information in an easily understandable manner has always been important for efficient exploration. However, with growing volumes of data and the more complex formats that data takes, the ability to graphically explore and deliver information is business-critical. This has led to the development of a variety of data visualisation tools and techniques that are able to convey data in an intuitive and visually appealing way. These tools often require some knowledge on the part of the analyst about how best to effectively communicate complex ideas to non-technical people. Visualisation is a very useful, often a necessary, skill for data analysts to have.

The emergence of data science

The role ‘data scientist’ was touted as the ‘sexiest’ job of the 21st century in 2012 (Davenport and Patil 2012). Not surprisingly, this has led to a large growth of educational programmes that cater to data science and an even bigger increase in people having this job title or self-describing as such. However, the definition of the discipline of data science and the skills that the practitioners in this field need to have continue to be the subject of intense debate.

Opinions on what defines a data scientist range from covering only people that employ the scientific method when working with data, to everyone that works with data exploration. The attitude of applying the scientific method of making observations and testing hypotheses when working with large data sets might be an important attitude, but not one that describes the skills needed for data science.

Most business leaders consider the successful application of the data science discipline to require skills in manipulating data with programming, drawing inferences with statistics or machine learning techniques and the ability to influence decision-makers by effectively communicating the analytical results. All of these skills are needed to effectively extract, manipulate and communicate data, and experts in unstructured techniques often need to cooperate with experts in structured techniques.

There is a significant overlap between data science and data analysis, although it is more common to use the title ‘data science’ to describe people that apply machine learning instead of statistics, and work with unstructured rather than ‘old-fashioned’ data sources.

LEGAL AND ETHICAL CONSIDERATIONS FOR DATA ANALYSIS

Organisations have for a long time been aware of the value that can be gained from using the data they own. The data that these organisations own and have responsibility for will need to be managed stringently in order to comply with the growing number of laws and regulations as well as increasing public focus on how data is used. Due to privacy concerns related to mishandling or misuse of data, organisations are now facing increasingly tighter restrictions on how they can use the data they gather.

Data analysts will need to have knowledge of data management and governance so that they can comply with these requirements. This area is increasingly complex and will be discussed in much greater depth in Chapter 4. This section will merely introduce the subject in a broader context.

General Data Protection Regulation

There have been regulations in the UK to protect the use of data since the 1980s and these have grown considerably over the years. The legal and regulatory complexities come from both the breadth and depth of the laws within different countries, but also from the differences that exist between them. There have been a number of initiatives to harmonise laws and regulations between countries, with the latest being the General Data Protection Regulation (GDPR), which has been created by and applies to the European Union (EU). This lists six principles for data protection that apply to personal data attributed to a single individual:

lawful, fair and transparent processing;

specific, explicit and legitimate purpose;

adequate, relevant and limited data;

accurate data;

kept no longer than necessary;

processed and stored securely.

This regulation defines a separate classification for information containing sensitive personal data that can only be processed in very limited and specific circumstances.

Data privacy concerns

There is an ongoing and intense debate about who owns data, what can be done with it and how we can ensure that it is kept safe. This debate has sprung from the realisation that data presents an incredible amount of knowledge, much of it personal, and gives a great deal of influence to the people that have access to it and the ability to analyse it.

There are a growing number of organisations, both public and private, that are able to make automated personalised services, such as purchase suggestions and targeted advertising, based on algorithms that have access to detailed data about individuals. These personalised services are popular with some groups but also very unpopular with others, and it can be difficult to satisfy both sides.

There is a tension between the progress that can be made by implementing new, advanced data analysis, contrasted with the expectations of privacy that the public has. It is important that data analysts and the IT community adhere to ethical guidelines and working practices, alongside relevant laws and regulations, that will accommodate both of these attitudes.

Security

The secure storage and processing of data is a growing concern, which is in part driven by the increasing amount of data protection regulation that imposes severe penalties in cases of non-compliance, but also by the implementation of cloud solutions. Implementation of cloud solutions results in organisational data being geographically spread out across many locations, and these locations might not be controlled by the organisations themselves, which makes the protection of the data they hold more complex.

When an organisation controls the location where the data is secured, then they can focus on securing its ‘border’ with the outside world. With cloud solutions, this ‘border’ perimeter becomes much less defined, and organisations increasingly rely on internal security measures for data protection.6

These security measures often involve a combination of encrypting data so that it is not able to be read outside the organisation and providing various forms of access control to databases and systems.

Although the design and maintenance of information security systems is normally outside the responsibilities of a data analyst, they will often be very significantly influenced by these systems. They will therefore greatly benefit from having a good understanding of both the techniques used in this field and familiarising themselves with the security policies in place in their organisation.

Decision transparency

There is a growing trend for regulators and the public to demand information about the reasons behind a decision that affects them. There is often consternation among customers when they are given, or notice, an unexplained analytical decision.

This is exemplified by a comedy sketch in the TV show Little Britain in which a member of the public approaches a customer service agent with a request, which is then keyed into a computer. The agent subsequently turns to the customer and, expressionless, responds ‘computer says no’.

To meet these expectations, data analysts need to consider carefully how they design the models that they are using. When models are used to make a recommendation, these may need to be accompanied by the data used to make the decision and enough explanation for a customer or member of staff to interpret it.

This can pose specific problems with regard to types of models that are particularly difficult to interpret, such as some modern machine learning techniques. Data analysts will therefore need to be aware and knowledgeable about how to explain their results to non-analytical colleagues, customers and, occasionally, members of the public.

HOW THE IT INDUSTRY CAN ADDRESS THE CHALLENGES OF DATA ANALYSIS

As this chapter has explained, there are many challenges that the IT industry faces with respect to data analysis. This section will merely mention five important areas; these should by no means be taken as an exhaustive list.

Making IT (and data analysis) good for society

BCS, The Chartered Institute for IT is committed to the vision of ‘Making IT good for Society’, which is written into its charter.

The institute is focused on addressing four key challenges, of which data analysis has a crucial role to contribute to each.

Challenge: education

In education, it is their goal to ensure that every child has the opportunity to learn computer science. With the role of data in society becoming ever more prevalent, it is essential that children learn and understand how data is analysed. Some children will grow up to perform this analysis, but all children will be affected by data and should therefore have a fundamental understanding of its principles.

Challenge: health and social care

In health and social care, the institute wants to bring people together and create an environment that focuses on individuals. The amount of data that the health and social care industry records about people is growing very rapidly. It is easy for data analysts to become captivated by the endless possibilities to improve conditions through sophisticated analytical methods, but it is vital that we do not lose sight of the individual people behind the mountains of data that is collected.

Challenge: personal data

In personal data, the institute aims to ensure that individual data works not just for organisations, but also for people and society. The data that is collected about us and our daily lives has great value for organisations and their ability to provide us with a personalised service. This can bring benefits to both the organisations and to the people they serve, but it is important that this data is shared legally, ethically and fairly.

Challenge: capability

In capability, the institute works to provide the skills that individuals need to enable them to do great work, in the right roles and as part of strong teams. Analysis of data requires both advanced individual technical skills and for us to work in diverse interdisciplinary teams. It is therefore equally important that, as data analysts, we both improve our technical competencies and engage and improve our relationship with other professions.

Technical challenges

There are many technical challenges faced by the IT and data analysis communities to keep up with the increasing amounts of data and changes in the way that we relate to our data. This section cannot describe them all, but will merely draw attention to one that is of specific importance to the changing skill set of data analysts.

The IT profession cannot and should not be alone in addressing all the challenges that face data analysis. Data analysis is a cross-discipline approach involving professionals in related fields such as statistics, machine learning, project management, business analysis and communication. Data analysts need to work together with these specialists to address challenges such as those mentioned above.

Traditionally, data has been stored and analysed in a tabular format, with rows representing unrelated observations. This is usually done using a relational database management system (RDBMS) which is accessed with SQL. There have been many decades of research into the efficient handling of such data and development of commercial technical solutions.

However, we are seeing a rising amount of data and specialised analytical solutions that do not fit this format, such as textual data and social media network data. This creates technical challenges for the IT profession to develop ways of storing, processing and analysing data that are as efficient and effective as the traditional methods.

This challenge is being met by development and research into alternative storage solutions, which are often known as NoSQL databases. Data analysts need to be aware of these systems as they often require different data extraction and manipulation skills from those for traditional systems. Chapter 3 provides an expanded introduction to NoSQL systems.

Ethics of data analysis

BCS, The Chartered Institute for IT sets out four ethical standards in its Code of Conduct, which describe the minimum expectations for its members. These standards have a high relevance to data analysis, whether one is a member of BCS or not, and can be used as a guideline for data analysis professionals. The descriptions below are not a comprehensive coverage of the code, but merely show that the issues it raises are highly pertinent to the activities of data analysis.7

Public interest

It is important that data analysts are aware of how to keep data, especially personal information about members of the public, secure from unauthorised access. They also need to know how to ensure that the results of the analysis do not negatively impact those about whom the data is collected. These are some of the important principles of the GDPR, which is described in more detail in Chapter 4.

Professional competence and integrity

The area of data analysis is broad and covers many skills and competencies. It is important that analysts keep improving their knowledge and engage in continuous professional development in order to stay relevant.

Duty to relevant authority

Data analysts are often in possession of important knowledge that can be very valuable to both to the organisation in which they are employed and to people who do not have legitimate access to that data. It is important that analysts are vigilant about data protection and are aware of the obligations that organisations such as the Information Commissioner’s Office in the UK put on organisations that use personal data.

Duty to the profession

It is important for data analysis as a profession, as well as for the reputation of data analysis as a discipline, for data analysts to maintain a high level of professional integrity and responsibility at all times.

Another code of conduct that is relevant to data analysts is that used by the Royal Statistical Society.8

Democratisation of data access

The insights that come from analysis of the data generated by public institutions has always had great importance for the democratic debate. Political interest groups that are able to demonstrate their agenda with economic or demographic data (such as unemployment, health care, the use of libraries, roadwork schedules, etc.) demand more attention in newspapers, TV broadcasts and social media than those that are not so able. It is not surprising, therefore, that the data these institutions hold is increasingly seen as an asset that belongs to the public and should therefore be freely available.

The UK government is leading the way in making large data sets available via the internet with the data.gov.uk initiative, where thousands of detailed data sets are available. These data sets cover a wide range of areas concerning local councils, national agencies and governmental departments. The data is available to freely download to improve knowledge about how these institutions work, but also for use in commercial enterprises.

Explaining data analysis to the public

As data analysts, we have an important role to play in ensuring public and political debate stays informed and factual. The insights that come from the analysis of data can reveal complex information, but are sometimes counter-intuitive to both data analysts and other stakeholders, although they may contain very valuable information. This can be difficult enough to communicate to professional managers, but may present a real challenge when the audience is politicians or the public. Such audiences could have a significant interest in how the results can affect them, but might not have extensive, or any, experience of the topic and background of the analysis. When the results are controversial, it can present a major challenge to present them clearly, objectively and with impact in a debate, especially one dominated by social media and which demands simplicity and brevity.

To ensure that data continues to inform an objective public debate, the analytical community needs to continue to improve the ability to communicate both the results and the methods that are used in data analysis. To help us in achieving this, we need to work with journalists and technical educators as well as improve our own skills in these subjects. It is encouraging to see the rising trend of journalists that specialise in scrutinising and communicating analytical knowledge, part of a discipline known as data journalism. As data analysts, it is important that we encourage this trend and engage with this community of journalists.

SUMMARY

This chapter has provided an overview of the current state of the data analysis discipline. This is a very big discipline that requires knowledge of statistics, machine learning and data manipulation, among other areas. It is also a discipline that touches business, non-profit organisations and public services alike.

The wide-ranging skills and the complex specialities involved in data analysis are continually challenging those in this profession. It is therefore necessary that data analysts have a broad and constantly developing skill set. This skill set, and the mentality needed to keep developing it, is the subject of the next chapter.

1 See www.merriam-webster.com/dictionary/data

2 See https://en.wikipedia.org/wiki/Data

3 See https://en.wikipedia.org/wiki/Data_analysis

4 You can see a comparison of these models in this paper by Ana Azevedo and M.F. Santos: https://recipp.ipp.pt/bitstream/10400.22/136/3/KDD-CRISP-SEMMA.pdf

5 See https://data.gov.uk/

6 For further details see Sutton 2017.

7 You can see the full BCS Code of Conduct here: www.bcs.org/category/6030

8 See www.rss.org.uk/RSS/Join_the_RSS/Code_of_conduct/RSS/Join_the_RSS/Code_of_conduct.aspx

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.190.232