CHAPTER 4

How Big Is Big Data?

Big Data is big. It is so big that we reduce the term to a singular noun rather than its proper plural counterpart. It is big enough to create a myth surrounding itself.

Myths are notoriously self-referential. Although Big Data is just an input to AI, it is viewed as having a life of its own. The myth is exemplified by a few statistics.

The Big Data analytics market is set to reach $103 billion by 2023.

Internet users generate about 2.5 quintillion bytes of data each day.

A person needs 181 million years to download all Internet data.

In 2020, there will be around 40 trillion gigabytes of data.

What statistics reveal is suggestive and what they conceal is essential. Size numbers are impressive. Many come from Internet usage. In 2012, there were 2.5 billion active Internet denizens rising to over half the world’s population in 2019. Facebook alone accounted for 2.45 billion active users that year, even though a new generation considers it to be a site for the elderly. Photo viewing on Instagram reached 112.5 million U.S. users in 2020. eMarketer predicts the network will attract 117.2 million U.S. users alone in 2021. Instagram citizens constitute 13 percent of the world’s population and rising.

Impressive but misleading.

Most Internet data are in the form of animal videos on YouTube or kids exchanging messages about the next Marvel movie. IDC’s Digital Universe Study reports only 0.5 percent of data are analyzed. They discovered only 22 percent of all the data had the potential for analysis. Their educated guess is the percentage of useful data might jump as high as 37 percent in 2020.1

Wishful thinking. Cat videos will be with us for a long time, superseded only by puppies.

New Vantage published its sixth Executives Survey with a focus on Big Data and artificial intelligence in 2018.2 The study recorded executives’ answers from 60 Fortune 1000 companies including biggies such as Motorola, American Express, and NASDAQ. Responses indicated a prevalence of Big Data, and the New Vantage study asked the question, how much do companies spend on data analytics? A lot.

Organizations invested in Big Data and artificial intelligence initiatives at a 97.2 percent participation rate. Only 12.7 percent of participants said their companies invested more than $500 million, but the number is for deep pockets and represents the tip of the proverbial iceberg. Over 25 percent of participants said their companies’ cumulative investments in Big Data fall between $50 million and $550 million. To put the finding in perspective, my experience suggests a Big Data overhaul for a small global operation can be done for under $50 million.

Growth numbers are equally impressive. According to Wikibon, the Big Data analytics market is expected to increase at a compounded annual growth rate of 11 percent.

Everyone treats the potential business opportunities as though there is a scarce resource in play. Never fear. In 2017, the Economist claimed data replaced oil as the world’s most valuable resource. Data are more easily extracted, and supplies are endless.

Unlike oil, we can use data multiple times. Unsaid is that we may not get new insights from the practice. The comparison between oil and data suggests we should collect and store as much data as possible. If we do so without labeling the information, its value will be far less than that of oil.

My colleagues called the labeling exercise timing and tagging. Events are ordered chronologically, but data contain many events for which the timing overlaps. The event itself must be labeled in some fashion. As much as half the time spent on data was devoted to tagging it. The other half was spent scrubbing data for inadequacies despite the fact the data never left a computer. They were generated by computerized trading, funneled directly into databases and on to us. This brings us to another Big Data statistic: Forbes informs us job listings for data science reached 2.7 million in 2020, and demand overwhelms supply.

It Is Strange to Be Known So Universally and Yet to Be So Lonely

Adaptation involves the assignment of roles in complex systems. Roles serve as a means of establishing order as the company adapts to its environment. Technological advance induces elaborate division of labor and an increasingly elaborate organization follows. Differentiation of functions complicates role assignment since it entails the need for the micro coordination, which develops at the same time. Henry Ford taught us this lesson in 1913 with the moving assembly line.

One of the repercussions of AI within the firm is the restructuring of occupational roles. New techniques create new roles, and old roles are redefined with respect to technical content. Only in recent years did such a thing as a data scientist exist. That does not mean the role is exactly new. Novel technical roles develop by extension of familiar ones. The role of professor existed long before there were any researchers in gender studies, and the latter were quickly assimilated to the wider category in order to legitimize the field. But the interdependence between the function of a role and role expectations is sufficiently close that adjustments are necessary as technical content evolves. There are many different respects in which the role of a professor of economics differs from that of a professor of gender studies in the same university with the same social structure and cultural traditions. The economists’ teaching and research are different. They also dress, talk, and play differently than scholars in gender studies.

The flipside of role creation is the rendering of old roles and their content obsolete. This is the phenomenon of technological unemployment. It is difficult for the same personnel to take over new knowledge and techniques. They have a vested interest in their ways of operating, manifested in their status and in its compensation. Incumbents of threatened roles resist the introduction of changes. A society experiencing rapid technological progress shows signs of strain centering about this process. A company sees defensive behavior on the part of groups, which are threatened.

Adaptation of AI to the environment brings the newish role of data scientist. When Bell Labs existed, there were information theorists. You don’t need to know what they do these days. Information theorist sounds so … technical. Data scientists come from several areas in business and are more easily identified.

Programmers who once were database developers are now data scientists. So are statisticians. Workflow-oriented product managers are pressed into service. Mathematicians pour data into calculus and filter answers out the other side. If the answer is wrong, they know how to stir the symbols until the answer comes out right. As we cure MBAs of an allergy to technical detail, management consultants are another source for the role. They are accustomed to twisting data around. The professor of economics is joined by that of empirical gender studies as other resources. Both groups are accustomed to spinning data.

If you know your way around Spark, Hadoop, Hive, Pig, SQL, Neo4J, MySQL, Python, R, Scala, A/B Testing, NLP, and anything else data-related imaginable, you are hereby christened data scientist. Congratulations.

Role renovation is not limited to the rank and file. In the New Vantage study, 62.5 percent of respondents said their organization appointed a Chief Data Officer, a fivefold increase in the job category since 2012. Back then, we had Chief Technology Officers and Chief Information Officers and the two definitions were often the same.

Roughly 32,000 new jobs with the title of data scientist were listed on Glassdoor in early 2020. Average annual compensation was $125,000. What do these guys do? A few conversations with people offering courses in the art provide a theoretical guide.

First comes language training. Python is the lingua franca of data science. The budding data scientist learns to program and follow best coding practices. Next is civics class. A data scientist is expected to write software. Some companies want their data scientists to contribute directly to the code base, while others have engineers around to help translate prototype code to production. The data scientist learns how to be a good citizen of the code base with a focus on testing and working with production systems.

The data scientist now must shake off high school horrors of applied mathematics. Statistics is the foundation of data science. Inferential techniques identify trends and characteristics of a data set. Unfortunately, lessons on the adverse consequences of excessive data mining often are missed in this segment.

The machine comes next. Machine learning is advertised as combining aspects of computer science and statistics to extract useful insights and predictions from data. The bait of real AI is dangled at this juncture. Follow-up curricula reinforce the temptation to bite, just like reinforced learning. Natural language processing and deep learning for self-driving cars make the syllabus.

The student is finally ready to meet Big Data. The rite of passage is called data science at scale. Lots of jargon is introduced at this stage, ranging from MapReduce to NoSQL and Spark. The interviewees on my list start to lose me here.

One of them wakes me up with the introduction of stories into the data science toolkit. Create a story out of a data set. The data will drive interesting questions and reveal insights to create a narrative. This sounds like extra credit but certainly will extend language training past Python.

We finally come to what an advanced degree does not recognize but constitutes the biggest part of the job. To make this bitter pill go down smoothly, data science has given it the moniker data wrangling. A dictionary tells you wrangling is engagement in a long and complicated dispute. The curriculum is clear. You are definitely going to engage in a long argument with the data. Corralling the raw stuff, cleaning it, and getting it into a format useful for analysis outlines the debate. If confused, there are books on cat wrangling to help out. Yes, the animal, although it does bring the Consolidated Audit Trail of the Securities and Exchange Commission to mind. For the uninitiated, CAT is Big Data. For all, the design and construction of CAT has taken over a decade and people still argue as to how to pay for it. Wrangling is a common subject of debate.

The curriculum omits a few things about business life.

Adaptation involves role assignment, but role interpretations vary. As a data scientist, you are expected to know everything vaguely data related. The front office views you as a resource capable of answering complicated questions for which one-line answers are desired. The answer is supposed to be provided instantaneously. After all, it’s just running a computer program, right? Wrong. Failure to answer quickly creates tension.

The tension is nothing compared to that expected by introducing the new role into a company. Much of this has nothing to do with personal interactions.

The role demands infrastructure for data analysis. The foundation does not exist if the role is new. The data scientist attempts to install various tools found to be incompatible with the company’s computer architecture. In that case, we are back to doing one-off parsing of data logs to answer every question. The engineering team doesn’t feel it’s safe to give the data scientist access to the production system, so they provide an offline copy of the database. Except … data in an offline database is not structured in such a way as to make it easy to combine, let alone analyze the data. The only data scrubbing evident is whatever was important to the operations team at some time. Missing values abound. Queries take forever to run. Back to the tension with the front office.

The front office is not the only frustrated entity in the house. The CEO is annoyed. Months on the job, and the data scientist did not even produce a decent customer service dashboard. The CEO expects a magical machine that learns on its own.

Scientists seem like a bad cultural fit as well. The engineering team is frustrated. The data scientist takes cycles away from their work to do thankless chores. Pattern maintenance strikes. The scientist appears to be sitting around doing nothing useful. This is anathema in a company with good work ethics. The data scientist complicates the problem by constantly complaining the data are not good enough. The CFO sees nothing but red ink and is having trouble amortizing the investment.

We are left with the scientist’s frustration.

Data scientists want to work on machine learning. They expect to put time into gathering and cleaning data, but the process is messier than pessimistic expectations suggest. And what about all the time spent in meetings? The time is spent on an endless stream of questions about how the data was gathered and what, exactly, is in <insert your favorite gripe here>. Scientists do not expect the rest of the company to care so little about how each tweak in software infrastructure messes up month-tomonth information comparisons. They do not understand missing data due to a change in some user interface. The scientist does not see AI training data, rather just training.

A company playbook includes a warning to the scientist that they would be spending 80 percent of their time cleaning data. Managers always manage expectations. Dream on.

The 80 percent is spent on begging for data to be created, accessed, moved, or explained. The other 20 percent is spent lobbying for data science tools, security policies, and infrastructure. Internet searches for new employment are a natural result.

Anyone Who Says Size Doesn’t Matter Sees Too Many Small Knives

You might be expecting real-world examples of Big Data accompanied by size statistics. They are numerous, often misleading, and add little when asking how big is Big. Practical application is all about slicing data into interpretable chunks. Big as fallacy is best illustrated by the process followed in all such examples. Climate change is a case in point.

Machine learning introduced itself to climate science at the 31st Conference on Neural Information Processing Systems in 2017.3 The data consisted of visual imagery comprising 78,840 pictures of 16 weather characteristics over a period of years. The resolution of the photos was much like that of your TV if purchased in the last decade. All in, the data set consisted of roughly 1,120 billion pixels. That is 3.36 trillion bytes of data in living color. It also is the type of number encountered in public discussions about Big Data. Impressive but misleading.

One chooses a time period for analysis. These are called benchmarking levels and are chosen in order to zero in on a particular phenomenon thought to occur during the period. Categories of interesting events come next. The researchers chose tropical cyclones, extratropical cyclones, tropical depressions, and U.S. atmospheric rivers. The event decision is made based on information extraneous to AI, and a split between data used to train a model and those employed in testing must be specified.

Once all the slicing and dicing was done, the researchers were left with 3,190, 3,510, 433, and 404 images to estimate a neural network model for each of the four categories. Having started my professional career in statistics, I can tell you—these are small numbers. The advertised size of the data set masks ground truth in terms of applicability.

You can do this for yourself with something more familiar like a spreadsheet. Pick a size; any size will do. I’ll choose a data set reminiscent of my own work. The database consists of roughly one million rows.

Each row is an event which I call a trade. Associated with each trade are characteristics that make up the entries. They are time-of-day, institutions which initiated the trade, 5 traders per institution who may have participated, another 10 possible brokers involved, security ID, buy versus sell, foreign versus domestic security, principal versus agency transaction, and 10 phases of the moon. Over 200 institutions contribute to trades in a universe of 5,000 stocks, yielding roughly 6.24 billion observations.

With all due respect to the astrology buff who formatted the data, we throw away the moon data. In data-speak, we are imposing a Bayesian prior but others call it common sense. The moon phases are only rounding error, but we are down to 6.23 billion.

Ask a question of the data: what does the head trader’s deal pattern at the largest investment company look like on the last day of the month? The client asking this question only is interested in sell orders done without committing capital. The client uses only five brokers and the portfolio consists of the Dow Jones index.

I am not making this up. This type of query is commonplace. The restriction to Dow Jones alone brings us down to 1.26 billion pieces of data. The last day of the month is a portfolio rebalance event which yields only 63 million observations on the assumption events are more or less evenly distributed across days. Following similar assumptions of uniformity, a focus on the largest institution brings us to 63,000. Since only one of the traders is under the microscope, make it 12,600. Buys and sells are balanced in the normal course of events, so we really have only 6,300 events. Agency trades make up half the amount, say. The result is 3,150 trades.

The figure is reminiscent of the climate change example and is hardly impressive. Yet the full data set was the largest on the planet relating to institutional trading activity not too long ago. There are reasonable questions leading to queries, which yield only a few hundred observations in a desired category.

If we were looking at a company such as Amex, the total number of observations would be much larger. The data are pumped up by many individuals and categories. The same principle applies, however. Narrow down the search to something actionable such as personalization and Big Data is not noticeably big.

The Discrepancy Between the Expected and the Observed Has Grown

AI has an insatiable appetite for Big Data, and it is time to have a little fun. The data scientist in me wonders if the slicing, dicing, and massaging of Internet data is really all there is. Do we reach a point at which AI chokes? How much information is there in the known universe? One answer begins with dark matter.

In 1933, a Swiss astronomer by the name of Fritz Zwicky wondered how galaxies in the Coma Cluster were kept together. There was not enough mass to keep them from flying apart. He speculated unobservable matter was the glue. Speculation was dismissed for lack of evidence until 1968 when Vera Rubin discovered stars in the Andromeda Galaxy violated Newton’s laws of motion. Logic dictated there was more matter in the universe than previously thought, albeit undetectable. Physicists now believe dark matter comprises 27 percent of the universe. Dark energy makes up 68 percent of all energy. The latter is all about the rate at which the universe is expanding. Dark matter influences how observable matter comes together. And that’s where we come in.

There are theories, of course. MACHO is one; think of black holes. WIMPS is another. It centers around particles without an atomic nucleus like neutrinos. Enter Big Data. Melvin Vopson claims information is the fundamental building block of the universe. Through an equivalence of mass, energy, and information, the puzzle of dark matter disappears.

Information generates about a quarter of the known universe.

Nothing is new in the galaxy. Information theory dates back to the 1950s. Claude Shannon was one of my heroes in school. We know him for the design of electrical systems such as telephone switching circuits. He was the first to define a unit of information as a bit. We are all about bits and bytes these days. A contemporary, John Wheeler proclaimed, everything is information. In design-thinking terms, the statement was a frame within which quantum mechanics could be connected to information theory as a principle. Wheeler coined the phrase, it from bit. Every particle comes from the information locked inside it. The fabric of space-time derives existence from actions depending on binary choices, otherwise known as bits. He was a man filled with radical notions, but Einstein and Bohr deeply appreciated him.

Vopson echoes the basis of Wheeler’s principle. Information is the basic unit. One step further, information is energy and has mass. Not unique yet. Rolf Landauer envisioned erasing even a single bit of information would release heat in 1961. He calculated how much.

Uniqueness of thought follows by connecting information theory and thermodynamics. Applied to digital systems, the combination suggests once information is created it has quantifiable mass. Vopson proposes a means to measure it.

We are back to dark matter. In 2008, M. Paul Gough worked out the number of bits of information the known universe must contain to account for all the missing matter, and Vopson agrees on the quantity. I personally am missing something in the physics jargon, but here it is anyway: once stars began forming, there was a constant information energy density as the increasing proportion of matter at high stellar temperatures exactly compensated for the expanding universe. Never mind. His information state equation was close to calculated dark energy values over one half of cosmic time.

The conclusion? A reasonable universe information content of 1087 bits is sufficient for information energy to account for all dark energy. The quantity is enough to link dark energy and matter, and becomes the basis of thinking about star formation. Your color TV contains 40 times this amount of information in one screenshot.

The information energy contribution to dark energy is determined by the extent of stellar formation, possibly answering the “cosmic coincidence” question—why now? The climate change study asks the same question. The researchers are contributing directly to the climate change they desire to slow down.

I have a theory. The explosion of information in the last few years is the driver of climate change. Decentralized finance in the form of bitcoin variants adds fuel. Information is heat. The upward trend in data correlates as well as with climate metrics as carbon emissions. Data have no location, so no single country is more or less responsible. Explode the Internet; problem solved. For the conservatively inclined, you can blow up only the 78 percent devoted to cat videos, since the remainder is all that is of use anyway.

There is always a catch. The Landauer Principle says erasing even a single bit of information would release heat.

Our climate change researchers alone contributed 140 billion bits of data. Those bits encode 4.48 trillion tonal levels of imagery. In the end, they could usefully analyze a tiny fraction.

Think of all the heat. Please don’t delete the database.

Chapter Notes

I am thankful to Christo Petrov for his March 2019 summary of data statistics and references to sources in Big Data Statistics 2020, https://techjury.net/stats-about/big-data-statistics/. The Wikibon reference is from their 2018 Big Data Analytics Trends and Forecast. Other sources include Physics.org and Statista.

“It is strange to be known so universally and yet to be so lonely” is due to Albert Einstein. The small knives saying is a rewrite of a phrase of the horror fiction author, Laurell Hamilton. I couldn’t squeeze her vampire themes into the chapter, but the saying is cool.

In discussing the role of data scientists, I was inspired by Monica Rogati. You can taste a sample by looking at “How not to hire your first data scientist,” Hackernoon, February 2017. She is much funnier than I am.

The climate change piece is “Extreme Weather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events,” by a team led by Evan Racah. It was published as part of the proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach. I may have made it sound as though the climate change article referenced is the only attempt at introducing AI to climate change. Climate scientists do use basic machine learning techniques in the form of statistics such as principal component analysis for dimensionality reduction, and k-means clustering algorithms. The climate science community primarily relies on expert engineering systems and ad hoc rules for characterizing climate and weather patterns, however. An example is TECA (Toolkit for Extreme Climate Analysis) using heuristic methods.

M. Paul Gough’s paper is “Information Equation of State,” in Entropy, August 2008. For Vopson’s contribution, check out “The mass-energy-information equivalence principle,” August 2019, in AIP Advances. If you’ve got technical chops, go for it. It doesn’t take a specialist to realize the two together suggest a vastly different way of looking at the role of information within the universe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.111.9