Big Data

Introduction to Big Data

This is a very short introduction to a very topical issue in business statistics, namely big data. Consider the following cases:
  • A company like Visa gathers data on every swipe of every Visa card in the world on a continuous basis. This is a massive, constant stream of transactions.
  • A company like Walmart gathers data on every purchase at every till in real time. Imagine the volume of purchases registered by each till, and the opportunity this data may provide to customize merchandising or the like.
  • Social media through multiple channels like Facebook, Twitter and others has become a strategic way for companies to reach customers and to get feedback. However, if you are a big firm with a massive digital presence, how do you systematically tap into the constant flow of unstructured, textual social media comments and feedback by customers? Stopping to analyze it carefully will immediately leave you days, if not weeks, behind the real trends.
  • Large logistics companies marshal immense movements of products every day. For instance, Davenport & Dyché (2013: 4) report the following: “UPS is no stranger to big data, having begun to capture and track a variety of package movements and transactions as early as the 1980s. The company now tracks data on 16.3 million packages per day for 8.8 million customers, with an average of 39.5 million tracking requests from customers per day. The company stores more than 16 petabytes of data.” They go on to discuss an even bigger logistics system problem involving big data for UPS.
In short, big data is that which traditional means of data analysis will struggle to analyze. As the name suggests, this is related to certain growth phenomena in modern datasets. These include the following:
  • The growth in measurement. Businesses and other generators of data have simply found more and more ways to analyze data. From computers and networks that naturally gather data to sensors in machines and in many physical spaces, the growing volume of data offers the opportunity for more and more business research. These increased means of gathering data have led to an explosion in the amount and breadth of data collected. The speed of data collection has also increased, especially with networking and mobile and wireless connectivity.
  • The growth in storage. The growth in measurement has been matched with exponentially greater storage. Storing data used to be prohibitively expensive. The storage costs have dramatically declined, allowing for more and more data to be stored.
  • The growth in computing power and speed. The ability to move, process and analyze data has also increased dramatically. This has, in turn, allowed for the growth in data.
  • The growth in unstructured data. As suggested by the social media example above, unstructured data has become a far more regular feature of business information. Unstructured data is that which has no immediate, latent meaning. Often, it is textual in nature, like the flows of social media commentary that need interpretation before they can be given even the most simple “good/neutral/bad” coding. Unstructured data includes more than text, however. It includes images (for instance, photographs), sounds, and the like.
The next section expands on the characteristics of big data, after which I briefly describe some solutions for analyzing big data.

Characteristics of Big Data

What does big data look like? Many writers have suggested characteristics based on the growth trends discussed earlier, including the following (e.g. Vorhies, 2013):
  • Volume. As discussed in the previous section, increasing means of measuring and other factors have led to massive growth in the volume of data. The volumes are truly staggering. For instance, IDC (2014) states, “Like the physical universe, the digital universe is large – by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe – the data we create and copy annually – will reach 44 zettabytes, or 44 trillion gigabytes.”
  • Velocity. It’s not just about the amounts of data. Data is also being generated at faster and faster rates. Computer processing and networking, as well as increasingly fast instrumentation (such as sensors in cars or appliances), are again allowing the speed of data streaming to increase exponentially. Traditional business statistical methods required us to look at historical data at our leisure, however in business we are increasingly called on to analyze real-time data as it is being produced. Banks, for instance, wish to detect fraud as it is occurring, not afterwards, so predictive models that sift through huge volumes of transactions per second must quickly identify possible cases of fraud and report this to the fraud department for immediate follow-up.
  • Variety. The diversity of data has increased substantially. Whereas businesses used to have relatively small, stable and controllable datasets previously, many factors have broadened the variety. Consider the increase in unstructured data mentioned in the introduction. Examples include social media text, images such as photographs, audio data such as recorded telephonic conversations in a call center, and the like. Business data used to be largely structured, where the data had native meaning (numerical or qualitative data forced into categories). Increased ability to gather and analyze unstructured data has led to its explosion and to a commensurate increase in the huge variety of data. There has, therefore, also been an increase in the variety of sources from which data is gathered.
  • Value. Value speaks to the usefulness of given parts of the data being gathered. One challenge associated with increased measurement and storage capacity is the temptation to gather everything, whether useful or not. There are two views on this. The first is “why not gather everything? Storage is cheap these days and we might find trends or findings that we did not know were there.” Another view is that excessive data gathering makes for problems in merely sorting through the information. This point of view might argue for gathering data based on specific business needs.
  • Veracity. Along with volume, velocity and variety comes issues with data accuracy, integrity, abnormalities, and the like. This can be a real challenge, especially when you do not have the luxury of time to sit and analyze a dataset.
There are other characteristics that have been suggested, but these seem to have achieved something of a consensus.

Solutions for Big Data

Improved Storage & Processing through Distributed Computing

There are various solutions for big data storage and analysis. These three sections discuss just a few.
Distributed computing refers to data storage and processing achieved through many smaller, networked computers rather than on single supercomputers or very large servers. ApacheTM Hadoop® is a leading example of open-source software that uses distributed computing to achieve big data storage and analysis. The interested reader can find many articles on Hadoop on www.sas.com.
There are many advantages to using distributed computing:
  • Distributed computing can achieve dramatic improvements in storage efficiency over other solutions. Efficiency is essential when dealing with big data.
  • Distributed computing allows for faster processing speed and power, which means it can more easily deal with the velocity of big data.
  • Scalability – the ability to grow or shrink your processing capacity quickly and easily – is facilitated by systems similar to Hadoop, because you can add or subtract smaller machines, with ease, as required.
  • You do not have to force the data into particular shapes in distributed systems like Hadoop. Therefore, you can store and potentially analyze completely unstructured data. This is not the case with structured data warehouses, as discussed next.
  • Cost is lower. Networking and using smaller commodity machines is cheaper than buying fewer, bigger machines, and usually easier to upgrade.
  • Data integrity and safety is far better. Because the tasks are spread over many (potentially thousands) of small machines, individual machine failures are easily dealt with by the framework by quickly excluding the failed machine. Backups are usually kept.
SAS has positioned itself as an excellent partner for Hadoop in particular, providing an end-to-end solution for processing the big data, passing it to Hadoop, and analyzing the processed data.

Improved Processing through In-Memory Processing

Many modern data processing systems can improve on the storage and analysis speed involved, by extracting, processing and loading the raw data to the warehouse ”in memory.” This means processing and analyzing the data in computing processors (i.e. in the direct processing memory or RAM/DRAM of a computer) without storing it on disk in the initial phases. SAS® LASR™ is a cutting-edge example of this technique. For instance, in fraud detection, the transactions can be passed directly through the processor capability to be analyzed instantaneously as they arrive, rather than having to first be stored on disk, then retrieved, then analyzed.

Textual Analysis

Increasingly, algorithms are being written that facilitate smart analysis of text. For instance, you can increasingly identify the conditions (“rules”) that will cause a Twitter post by a customer to be identified as problematic, for example, a concurrence of certain key words within a certain proximity of each other or in given order (e.g. the word ”hate”’ within five words of the company’s name or the name of one of its products). SAS® Text Miner is one example of an analysis product that can help you analyze textual big data.

Conclusion on Big Data

Big Data has been one of the most prominent business themes of the past five years. It remains a nascent topic in practice, with relatively few companies really tapping fully into their big data potential. Genuine skills in the area are very scarce. Where you have genuine big data, a competitive advantage may be open to you in the future.
Last updated: April 18, 2017
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.66.94