Since the Hollerith machine census of the American population in 1890, mentioned in Chapter 1, data have changed considerably in quantity, representation and diversity. But what do we mean by “data”?
A bit can be used to represent the numbers 0 and 1, the logical values true or false, that is, very little. We will therefore associate several bits in order to represent or code more consequent data.
A data, a set of bits, has no meaning in itself. It must be interpreted in a specific context to derive information from it. Is it a color code, a first name or an instruction in an assembly language?
The accumulation of coherent data leads to knowledge in a particular domain. It allows the interpretation of information for decision-making and action initiation. To take a simple example, the set of couples “dish, price” gives knowledge of the menu offered by a restaurant to order from it.
At the most elaborate stage, the analysis of a body of knowledge leads to intelligence, since it is a question of exploiting this knowledge with a precise objective.
This chapter attempts to take a look at what we generally call data.
We will start from the bit, because the basis of computing is a digital world. It was the American mathematician Claude Shannon, already mentioned, who in 1948 popularized the use of the word bit as an elementary measure of digital information.
The data are the work object of data processing. A data is a raw element, which has not yet been interpreted or put into context. It can be recorded, stored and transmitted.
Information is interpreted data. A number can represent a price, a temperature, etc. In other words, putting data in context creates added value to constitute information.
As we saw in Chapter 1, data must be digitized to be understandable and manipulable by computers. Digitization is a process that allows objects (text, images, sounds, signals, etc.) to be represented by sequences of binary numbers (0 and 1). Let us see some examples of digitization.
The ASCII code (American Standard Code for Information Interchange) is one of the oldest codes used in computing. Initially based on 7 bits, it had to be extended to take into account accented characters among others. The need to be able to code multiple alphabets has led to the definition of new codes such as Unicode. The coexistence of these standards still presents problems, and we can see it in the e-mails we receive when an accented character is represented in a surprising way!
A number is not a sequence of characters; it is a value that will enter into mathematical operations. The representation of numbers, in binary, depends on the type of number, and we will summarize the main representations.
The case of natural numbers is the simplest. The conversion to a bit string was explained in Chapter 1. If we code a 32-bit number (4 bytes), the maximum recordable integer (all bits are at 1) is 232 – 1.
The coding of relative numbers can be done by reserving a bit for the sign (+ or –). This method has drawbacks; however, representation by the complement to 2 offers a solution. To apply this, we change 0 into 1 and 1 into 0 in the binary number, and add 1 to the result. For example, if the number 6 is 0110 in binary, the complement to 2 is 1010. The result of the addition x + (– x) is zero! The first bit (most significant bit) of a negative number is always equal to 1, and we only have to make additions.
A real number is a number that can be represented by a whole part and a sequence of decimals, for example, –37.5 in decimal form. On a computer, floating-point numbers are generally used. In decimal form, –37.5 can be written as –0.375 × 102, the general form being s × m × e, where s is the sign, m is the mantissa and e is the exhibitor. To represent a real number over 32 bits, we can have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa.
A sound is a continuous analog quantity represented by a curve varying with time. Digitization is carried out in two stages: sampling and quantification.
In sampling, the sound is cut into slices or samples. The sampling frequency corresponds to the number of samples per second. In order to translate the analog signal as accurately as possible, as many measurements as possible should be taken per second.
Each sample is then quantified, which gives a numerical value to the sound. The greater the sampling frequency, the closer the digital signal will be to the analog signal and therefore the better the digitization will be.
A digital image is composed of elementary units (called pixels), each of which represents a portion of the image. An image is defined by the number of pixels that constitute it in width and height, each pixel being a triplet of numbers (r,v,b) encoding the luminous intensity of red, green and blue in a color coding system. Digitizing a document consists of assigning a numerical value to each of these points.
We will return to images in Chapter 5.
Data can be used as a variable in a computer program, and many types have emerged.
Metadata is data used to define or describe another piece of data (e.g. associating the date of capture and GPS coordinates with a photo).
A Java program can handle simple (or primitive)-type variables and complex-type variables (objects).
The Web has also brought a new approach to the notion of data.
All digital cameras on the market have sensors with several million pixels. Each pixel takes up space in memory: for example, in 16 million color mode, the pixel sometimes occupies 32 bits (4 bytes). The storage of a large number of photos therefore poses volume problems.
If we look at a photo, we note that many contiguous pixels are identical. On the contrary, it is not vital to preserve every little detail that will not be perceived by the human senses anyway. We will therefore, using specific algorithms, compress the data corresponding to this image.
Many compression algorithms exist, each with its own particularity and especially a target data type. A distinction is made between lossless compression algorithms (which, after decompression, restore a sequence of bits that is strictly identical to the original) and lossy compression algorithms (the sequence of bits obtained after decompression is more or less close to the original, depending on the desired quality). The former are mainly used for the compression of text and certain types of images (medicine), and the latter for speech, music, images and video.
For still images, JPEG is a lossy compression mode, but it allows a file to be reduced to one tenth of its original size without any visual quality change. PNG, on the contrary, is a lossless mode that makes files larger. MPEG is a lossy compression format for video sequences. MP3 is a lossy compression format for sound.
In this approach to digitization, we have cited elemental data and more complex entities such as images or sound sequences. The structuring of data is important; it is an organization that makes it possible to create information. It also simplifies and improves the efficiency of information processing and exchange.
These are data decomposed according to a formal scheme with constraints on data types.
An array T is a data structure that allows us to store a determined number of elements T[i] marked by an index i. All the elements of the array generally have the same basic type (integer, for example). A program will be able to manipulate an array more simply than a sequence of elements.
Unlike an array, whose length must be determined, the elements of a chained list are distributed in memory and linked together by pointers. We can add and remove elements of a chained list at any place and at any time, without having to recreate the whole list.
A record is a logical grouping of several variables into a single variable consisting, for example, of a person’s name, address and telephone number. It can be treated as information.
Numerous other data structures exist and are used in programming: stacks, trees, graphs, etc.
More generally, structured data can be controlled by repositories and presented in boxes that allow their interpretation and processing by humans and machines. Databases, which we will present in section 4.5, use structured data.
We have more and more heterogeneous information, without a formal description scheme.
Electronic messaging, social networks and the Web carry large amounts of data that are impossible to interpret because they do not have a clear structure.
If I send a message to a friend that contains the data:
Victor Hugo Notre Dame de Paris François Rabelais Gargantua
he will only understand what it is about if he knows the subject of my message and French literature. Of course, a computer cannot do much without proper instructions.
If, on the other hand, I use tags to say what each of these data represents, I can transform them into information:
<table. <author> <first name>Victor>/first name> <surname>Hugo</surname> <title>Notre Dame de Paris</title> </author> <author> <first name>François</first name> <surname>Rabelais</surname> <title>Gargantua>/title> </author> </table>
A program will be able to decode and process this information, which can be said to be semi-structured and can therefore be processed by many applications.
XML (Extensible Markup Language) is a standard that has been adopted by the W3C to complement HTML for data exchange on the Web. The strength of XML lies in its ability to describe any data domain through its extensibility. It enables structuring, setting the vocabulary and syntax of the data it will contain.
Data can be created by a user, generated by devices such as sensors (e.g. temperature), automatically generated by a program itself or simply an object from the Web. They are all ultimately represented by 0s and 1s (binary representation). But a sequence of 0s and 1s does not tell us what they represent; it is the format that will allow it, and we have just seen formats for numbers and images. A format is therefore a particular way of representing data in a form that can be understood by a computer, whether the data is structured or semi-structured.
A file is a record of data stored in a computer. The most common is the file containing the text we have just entered and saved. We can reread it, modify it, send it, etc. It is an entity in its own right.
Consistent collections of data, such as those relating to people working in a company, can be recorded and stored in a memory, in the form of a file, just as one could do on a notebook or a set of sheets.
A file has a name that is used to identify and access its contents, often with a suffix (the extension, i.e. the last letters of the file name after the “.”), which provides information about the file type and therefore the software used to manipulate it. There are hundreds, if not thousands, of file types, which differ by the nature of the content, the format, the software used to manipulate the content, and the use the computer makes of it. The file format is the convention by which information is digitized and organized in the file.
Because of its flexibility, the notion of format has allowed the development of a large number of so-called proprietary formats (their specifications are not public or their use is restricted by their owner), which has posed many problems. To solve them, standard formats have been proposed and validated worldwide, and there are currently hundreds of standard formats for all file types. Table 4.1 provides a short list of common formats.
Table 4.1. Common file formats
|TXT||File extension for a plaintext file (data that represent only characters of readable material). Its advantage is that it can be read on all platforms.|
|DOC||File extension associated with the Microsoft Office Word application.|
|SXW||File created with the word processor of the Open Office suite.|
|XLS||File created with the Excel spreadsheet.|
|HTML||A file containing text with HTML tags that will govern the presentation of the text by a web browser on a screen.|
|File readable by Acrobat Reader software available on all platforms.|
|ZIP||File containing compressed files made with compression software (Winzip, Power Archiver, etc.).|
|MP3||Coding format for digital audio, allowing a large reduction in file sizes when compared to uncompressed audio.|
|MPEG||Format that compresses videos using the fact that some scenes are fixed or not very animated. There are several MPEG standards.|
|Compressed image format. Most image processing software gives the user a choice of compression ratio by allowing the user to see the effect of compression on the image.|
|TIFF||Universal image format recognized by all computer platforms. This format is more sophisticated than JPEG and has more options.|
We have moved from isolated information to knowledge, with these consistent data collections providing an essential additional level of information. But we are going to progress further in the level of knowledge.
For years, files have been the most common way to organize and store data.
As more and more data become available, managing and using them can pose many problems. Let us take the example of a company’s human resources department. It needs to have a complete list of employees with their identity, their employment contract, the unit to which each one is assigned, their index (to establish the salary), etc. The data can be used to establish the salary of each employee. Some information changes (family situation, unit of assignment, etc.). The unit manager needs additional information (e.g. holiday). The department in charge of training needs to know the development of each person’s career. The multiplicity of files results in redundant data, which makes it difficult to update. In addition, the needs of the users of these data are diverse and subject to change. A solution must therefore be found.
A database (DB) is an entity in which it is possible to store data in a structured way and with as little redundancy as possible. These data, which are described in the database itself, must be able to be used by different programs and users. The distinction between the physical level and the logical level is essential, and a last level provides the views that users can have of these data.
A database makes it possible to make data available to users for consultation, entry or updating, while ensuring the rights granted to the latter. The standard approach in databases is based on a client–server architecture: the client (person or program) sends a request to the server, this request is compiled and executed, and the response is sent to the client by the server.
DBMS (database management systems) are software programs that manage databases. A DBMS supports the structuring, storage, updating and maintenance of a database. It is the only interface between computer scientists and data (definition of schemas, programming of applications), as well as between users and data (consultation and updating).
A DBMS must allow:
A DBMS is in charge of a lot of tasks!
There are three main DBMS models, differentiated according to the representation of the data contained in the database.
The data is classified hierarchically, in a top-down tree structure. This model uses pointers between the different records. This is the first generation of DBMS, which appeared in the 1970s.
The early work on DBs is generally attributed to Ted Codd (1970), an IBM researcher. In this model, data are stored in two-dimensional tables (rows and columns) called relations. The data are manipulated by relational algebra operators such as intersection, join, or Cartesian product. This is the second generation of DBMS (1980s).
Relational databases are the most widespread databases. The most used relational DBMS are Oracle, MySQL, Microsoft SQL Server, PostgreSQL and DB2.
The data are stored as objects, that is, structures called classes. Object databases make sense when it comes to modeling very rich and complex data with many variations: multimedia documents, geographical data, etc. This is the third generation (1990s). But the complexity of their implementation and the place already taken by RDBMS have severely limited the deployment of ODBMS.
Designing and setting up a database is a major task that can take weeks or even months, depending on the size of the database, before the data are even entered into the database. The work of analyzing the needs of the organization and the users involved is essential because it supports the overall success of the operation. Figure 4.9 summarizes the main steps.
For a long time, companies and other organizations have set up several databases, each concerning one of the major functions: personnel, financial management, production and stock, customers, etc. However, these different databases contained some common data and it was difficult to ensure consistency. It was necessary to have a global vision of the organization’s data in order to manage it efficiently.
The concept of an information management system (IMS) appeared in the mid-1960s in the United States and a few years later in France. However, this concept has evolved considerably up to the present day. Information management systems are influenced by research into system structures and the conceptualization of decision support at the IT level.
ERP (Enterprise Resource Planning) is a subset of information management systems. An ERP is a software solution aimed at unifying a company’s information system by integrating the various functional components around a single database. Entering or modifying data in any of the modules (human resources, sales management, inventory management, production, etc.) impacts all the other modules: the database is updated and applies the change to the entire company.
An ERP meets specific characteristics:
More than a simple piece of software, an ERP is a real project requiring a total integration of a software tool within an organization, and therefore has significant engineering costs. However, its implementation leads to significant changes in the work habits of a large number of staff. The organization concerned therefore often calls upon a specialized consulting firm.
We especially mentioned the databases dedicated to the management of companies, administrations and other organizations, large or medium. However, a large number of organizations have developed and are still developing the databases necessary for an efficient management and use of the information for which they are responsible. Here are a few examples.
The purpose of a knowledge base is to model and store a set of knowledge, ideas, concepts or data in a computerized manner and to allow their consultation/use.
Bringing together all documents, instructions, business processes and other elements in one place, the business knowledge base can be accessed and used by all employees at any time. These databases can be set up by companies to improve their internal operations and/or their relations with their customers, by various organizations to improve the relationship with their users, etc.
They are often associated with an expert system that will allow the user to use them: the user simply enters the information at his or her disposal, and the expert system gives him or her the answer. Troubleshooting is a fairly classic example of an application.
Some chatbots, or conversational agents, have a knowledge base in which all the information they need to answer users’ questions is recorded.
A geographic information system (GIS) is an information system designed to collect, store, process, analyze, manage and present all types of spatial and geographic data. GIS offers all the possibilities of databases (such as querying and statistical analysis) through the unique visualization and geographic analysis specific to maps.
Many fields are closely related to geography: environment, demography, public health, territorial organization, network management, civil protection, etc. They have a direct interest in the power of GIS to create maps, to integrate all types of information, to better visualize various scenarios, to better present ideas and to better understand the scope of possible solutions.
There are two types of data in a GIS:
For example, the CARTO database of the French IGN (Institut géographique national) contains a homogeneous vector description of the different elements of the country with decametric precision. It also offers a wealth of thematic information: road (more than 1 million km of roads) and rail networks, administrative units, hydrographic network, land use, etc.
Today, GIS represents a market worth several billion euros worldwide and employs several hundred thousand people.
There are a large number of scientific databases in various fields. Here are three examples.
The base de données publique des medicaments (BDPM), the French public drug database, was opened in 2013. This administrative and scientific database on the treatment and proper use of health products is implemented by the Agence nationale de sécurité du medicament et des produits de santé (ANSM). It is intended for healthcare professionals, as well as for the general public.
SIMBAD (Set of Identifications, Measurements and Bibliography for Astronomical Data) is an astronomical database of objects outside the solar system. Created in 1980, it is maintained by the Centre de données astronomiques in Strasbourg and allows astronomers around the world to easily know the properties of each of the objects listed in an astronomical catalog. As of February 2017, SIMBAD contains more than 9 million objects with 24 million different names, and more than 327,000 bibliographical references have been entered.
The Observatoire Transmedia is a research platform that enables the analysis of large volumes of transmedia data (TV, radio, Web, AFP, Twitter) that are multimodal, heterogeneous and related to French and Francophone news. The consortium of this project led by the Ina (Institut national de l’audiovisuel) has brought together technological partners as well as partners in the human and social sciences, and has enabled the acquisition of know-how and tools for mass media processing. This platform has been renamed OTMedia+.
The importance of safety was emphasized in Chapter 2 of this work. Attacks coming from outside (networks) are not the only attacks, and the protection of data in a database must also take into account the problems, accidental or malicious, that may occur.
The fundamental principles are as follows:
And all other risks must be analyzed according to the specific context. For example, when an employee leaves the company, we have to make sure that his or her access rights to the database are removed.
Here we are in the ultimate level of data, which combines intelligence with data.
Anything and everything can emit data: web browsing, e-mails, SMS, phone conversations, GPS, radios, bank cards, sensors, satellites, connected objects, etc. These data can be recorded for reuse after (or not). The volume of stored data is estimated in zettabytes (trillions of bytes) in 2020.
Unlike an organization’s data, which are structured and can be stored in a database, there are a wide variety of formats (digital data, text, images, sound, video, etc.) and each has a low density of information. It is mainly a flow of data, like a fountain that flows continuously and from which we would take only a small part for our use.
All these data are stored and made available in huge storage spaces, the data centers introduced in Chapter 1.
Big Data refers to a very voluminous set of data that no conventional database management or data management tool can handle. Traditional computing, including business intelligence, is model-based; Big Data is about mathematics finding patterns in data.
Big Data is characterized by “3 Vs”: its unprecedented Volume; its Velocity, which represents the frequency at which data are simultaneously generated, captured, shared and updated; its Variety: data are raw, semi-structured or even unstructured, coming from multiple sources.
Data only makes sense with the algorithms that process them. To process this phenomenal amount of data, two types of means must be combined: very powerful computers and efficient algorithms capable of extracting useful information.
In data mining applications, raw information is collected and placed in a giant database, then analyzed and summarized to extract potentially useful information. Data mining is at the crossroads of several scientific and technological fields (machine learning, pattern recognition, databases, statistics, artificial intelligence, data analysis, data visualization, etc.), which are presented in Figure 4.10.
Data mining is part of the Knowledge Discovery in Databases (KDD) process, which is summarized in Figure 4.11.
Data mining applications are used in many areas, including:
The use of these vast amounts of data can be worrisome, and several types of risks to privacy and fundamental rights have been reported in the literature:
Data mining is one of the new research and development priorities of many organizations, companies, universities, research centers, etc. According to IDC (International Data Corporation), the Big Data market is expected to reach $203 billion in revenue in 2020, compared to $130 billion in 2016. That is how significant this field is becoming in the global economy!
This deluge of digital data poses a number of problems, including its quality and the legal possibility of using it. Do I have the right to use, for commercial purposes in particular, any data I find on the Web? Does its use not infringe on a legal right?
Let us recall that intellectual property is the domain comprising all the exclusive rights granted on intellectual creations. It comprises two branches: literary and artistic property, which applies to works of the mind (copyright), and industrial property, which mainly concerns patents.
Data ownership has been the subject of legal debate for years and is approached in different ways in different countries. In France, as in many countries, the notion of data ownership has no legal status as such.
No intellectual property rights are generally attached to raw data (my date of birth, for example). On the contrary, someone who takes a picture of me on a Brazilian beach without my knowledge has no right to publish it without my agreement, even if I am not a movie star! That is the right to the image.
More seriously, the images acquired by the SPOT satellite constellation constitute a cost-effective source of information, suitable for the knowledge, monitoring, forecasting and management of resources and human activities on our planet. They have a significant value and are therefore protected and commercialized.
Things continue to get more complicated, for example, with the connected objects already mentioned: Should the property rights on captured data be assigned to the manufacturers of connected objects, or to those who use these connected objects to capture data?
The debates, which concern more than lawyers alone, will undoubtedly continue for many years to come.
Our personal data is captured by GAFA (e-mails, cell phones, social networks, Internet browsing, shopping carts, etc.) and many others, no doubt. The data concerning our tastes, our travels or our loves, are broken down, collected, aggregated, and often resold.
In addition, they are used for a variety of purposes: commercial (I experience it every day, like everyone else, especially when I surf the Web and receive advertising pop-ups), political (the recent scandal surrounding the Cambridge Analytica company is just one example), all kinds of proselytism, etc. And nobody asks me if I agree!
Our personal data are subject to a fundamental right, the right to privacy. Each individual should be able to freely decide on the use of the data that concern him or her. This is not the case because the legal vagueness is great and no one is going to attack a Web giant for using his or her personal data in an analysis involving enormous amounts of data. The question for the individual is therefore not so much one of ownership of his or her data as it is of keeping control of the data flows concerning him or her.
However, in 2018, we witnessed debates aimed at allowing each person to monetize his or her personal data, which currently enriches the giants of the Internet. Is this realistic and even useful?
But better regulation is needed, and the European Union’s General Data Protection Regulation (GDPR)1, which is the reference text for personal data protection, came into force in 2018. The main objectives of the GDPR are to increase both the protection of persons concerned by the processing of their personal data and the accountability of those involved in such processing.
This text integrates the right to erasure for any person in Article 17, which states that the person “has the right to obtain from the data controller the erasure of personal data concerning him/her without undue delay, and the controller shall have the obligation to erase personal data without undue delay”; this is what is called the right to digital oblivion.
Open Data refers to a (worldwide) movement to open up data to meet the growing need of economic, academic and cultural players to collect and analyze masses of data. Open Data is digital data that is freely accessible and usable. It can be of public or private origin, produced in particular by a community, a public service or a company. It is disseminated in a structured manner according to a method and an open license guaranteeing its free access and reuse by all without technical, legal or financial restrictions.
Since 2003, the European Union has been working on this subject and has produced several directives. On April 25, 2018, the European Commission adopted a proposal to revise the Public Sector Information (PSI) Directive, including various measures to create a European data space. The reuse of Open Data has an important interest in addressing societal challenges for the development of new technologies, etc. The reuse of Open Data is of great interest in the approach of societal challenges, for the development of new technologies, etc. The reuse of Open Data is a key element in the development of a European Data Area.
The European Data Portal2 collects metadata of public sector information available on the public data portals of the different European countries. In addition, the European Union Open Data Portal3 provides access to Open Data published by EU institutions and bodies.
In France, the Etalab mission4 coordinates the policy of opening up and sharing public data. It develops and runs the Open Data5 platform, which is designed to collect and make freely available all the public information held by the State, its public institutions and, if they so wish, local authorities and public or private sector bodies responsible for public service missions.