2
Symbolic Data: Basics

In this chapter, we describe what symbolic data are, how they may arise, and their different formulations. Some data are naturally symbolic in format, while others arise as a result of aggregating much larger data sets according to some scientific question(s) that generated the data sets in the first place. Thus, section 2.2.1 describes non‐modal multi‐valued or lists of categorical data, with modal multi‐valued data in section 2.2.2; lists or multi‐valued data can also be called simply categorical data. Section 2.2.3 considers interval‐valued data, with modal interval data more commonly known as histogram‐valued data in section 2.2.4. We begin, in section 2.1, by considering the distinctions and similarities between individuals, classes, and observations. How the data arise, such as by aggregation, is discussed in section 2.3. Basic descriptive statistics are presented in section 2.4. Except when necessary for clarification purposes, we will write “interval‐valued data” as “interval data” for simplicity; likewise, for the other types of symbolic data.

It is important to remember that symbolic data, like classical data, are just different manifestations of sub‐spaces of the images‐dimensional space images always dealing with the same random variables. A classical datum is a point in images, whereas a symbolic value is a hypercube or a Cartesian product of distributions in images. Thus, for example, the images‐dimensional random variable images measuring height and weight (say) can take a classical value images inches and images kg, or it may take a symbolic value with images and images interval values which form a rectangle or a hypercube in the plane. That is, the random variable is itself unchanged, but the realizations of that random variable differ depending on the format. However, it is also important to recognize that since classical values are special cases of symbolic values, then regardless of analytical technique, classical analyses and symbolic analyses should produce the same results when applied to those classical values.

2.1 Individuals, Classes, Observations, and Descriptions

In classical statistics, we talk about having a random sample of images observations images as outcomes for a random variable images. More precisely, we say images is the observed value for individual images, images. A particular observed value may be images, say. We could equivalently say the description of the imagesth individual is images. Usually, we think of an individual as just that, a single individual. For example, our data set of images individuals may record the height images of individuals, Bryson, Grayson, Ethan, Coco, Winston, Daisy, and so on. The “individual” could also be an inanimate object such as a particular car model with images describing its capacity, or some other measure relating to cars. On the other hand, the “individual” may represent a class of individuals. For example, the data set consisting of images individuals may be images classes of car models, Ford, Renault, Honda, Volkswagen, Nova, Volvo, … , with images recording the car's speed over a prescribed course, etc. However individuals may be defined, the realization of images for that individual is a single point value from its domain images.

If the random variable images takes quantitative values, then the domain (also called the range or observation space) is images taking values on the real line images, or a subset of images such as images if images can only take non‐negative or zero values. When images takes qualitative values, then a classically valued observation takes one of two possible values such as imagesYes, Noimages or coded to images, for example. Typically, if there are several categories of possible values, e.g., bird colors with domain images, a classical analysis will include a different random variable for each category and then record the presence (Yes) or absence (No) of each category. When there are images random variables, then the domain of images is images.

In contrast, when the data are symbolic‐valued, the observations images are typically realizations that emerge after aggregating observed values for the random variable images across some specified class or category of interest (see section 2.3). Thus, for example, observations may refer now to images classes, or categories, of ageimagesincome, or to images species of dogs, and so on. Thus, the class Boston (say) has a June temperature range of imagesF, imagesF]. In the language of symbolic analysis, the individuals are ground‐level or order‐one individuals and the aggregations – classes – are order‐two individuals or “objects” (see, e.g., Diday (1987, 2016), Bock and Diday (2000a,b), Billard and Diday (2006a), or Diday and Noirhomme‐Fraiture (2008)).

On the other hand, suppose Gracie's pulse rate images is the interval images. Gracie is a single individual and a classical value for her pulse rate might be images. However, this interval of values would result from the collection, or aggregation, of Gracie's classical pulse rate values over some specified time period. In the language of symbolic data, this interval represents the pulse rate of the class “Gracie”. However, this interval may be the result of aggregating the classical point values of all individuals named “Gracie” in some larger data base. That is, some symbolic realizations may relate to one single individual, e.g., Gracie, whose pulse rate may be measured as images over time, or to a set of all those Gracies of interest. The context should make it clear which situation prevails.

In this book, symbolic realizations for the observation images can refer interchangeably to the description images of classes or categories or “individuals” images, images, that is, simply, images will be the unit (which is itself a class, category, individual, or observation) that is described by images. Furthermore, in the language of symbolic data, the realization of images is referred to as the “description” images of images, images. For simplicity, we write simply images, images.

2.2 Types of Symbolic Data

2.2.1 Multi‐valued or Lists of Categorical Data

We have a random variable images whose realization is the set of values images from the set of possible values or categories images, where images and images with images are finite. Typically, for a symbolic realization, images, whereas for a classical realization, images. This realization is called a list (of images categories from images) or a multi‐valued realization, or even a multi‐categorical realization, of images. Formally, we have the following definition.

Notice that, in general, the number of categories images in the actual realization differs across realizations (i.e., images), images, and across variables images (i.e., images), images.

Table 2.1 List or multi‐valued data: regional utilities (Example 2.1)

Region images Major utility Cost
1 imageselectricity, coal, wood, gasimages [190, 230]
2 imageselectricity, oil, coalimages [21.5, 25.5]
3 imagescoalimages [40, 53]
4 imagesotherimages 15.5
5 imagesgas, oil, otherimages [25, 30]
6 imageselectricityimages 46.0
7 imageselectricity, coalimages [37, 43]

While list data mostly take qualitative values that are verbal descriptions of an outcome, such as the types of utility usages in Example 2.1, quantitative values such as coded values images may be the recorded value. These are not necessarily the same as ordered categorical values such as images = imagessmall, medium, largeimages. Indeed, a feature of categorical values is that there is no prescribed ordering of the listed realizations. For example, for the seventh region (images) in Table 2.1, the description imageselectricity, coalimages is exactly the same description as imagescoal, electricityimages, i.e., the same region. This feature does not carry over to quantitative values such as histograms (see section 2.2.4).

2.2.2 Modal Multi‐valued Data

Modal lists or modal multi‐valued data (sometimes called modal categorical data) are just list or multi‐valued data but with each realized category occurring with some specified weight such as an associated probability. Examples of non‐probabilistic weights include the concepts of capacities, possibilities, and necessities (see Billard and Diday, 2006a, Chapter 2 ; see also Definitions 2.62.9 in section 2.2.5). In this section and throughout most of this book, it is assumed that the weights are probabilities; suitable adjustment for other weights is left to the reader.

Without loss of generality, we can write the number of categories from images for the random variable images, as images, for all images, by simply giving unrealized categories (images, say) the probability images. Furthermore, the non‐modal multi‐valued realization of Eq. 2.2.1 can be written as a modal multi‐valued observation of Eq. 2.2.2 by assuming actual realized categories from images occur with equal probability, i.e., images for images, and unrealized categories occur with probability zero, for each images.

Table 2.2 Modal multi‐valued data: smoking deaths (images) (Example 2.2)

Proportion images of smoking deaths attributable to:
Region images Smoking Lung cancer Respiratory
1 0.628 0.184 0.188
2 0.623 0.202 0.175
3 0.650 0.197 0.153
4 0.626 0.209 0.165
5 0.690 0.160 0.150
6 0.631 0.204 0.165
7 0.648
8 1.000

2.2.3 Interval Data

A classical realization for quantitative data takes a point value on the real line images. An interval‐valued realization takes values from a subset of images. This is formally defined as follows.

There are numerous examples of naturally occurring symbolic data sets. One such scenario exists in the next example.

Table 2.3 Interval data: weather stations (Example 2.4)

Station images = January images = July images = Elevation
images images images images
1 [images18.4, images7.5] [17.0, 26.5]  4.82
2 [images23.4, images15.5] [12.9, 23.0] 14.78
3 [images8.4, 9.0] [10.8, 23.2] 73.16
4 [10.0, 17.7] [24.2, 33.8]  2.38
5 [11.5, 17.7] [25.8, 33.5]  1.44
6 [11.8, 19.2] [25.6, 32.6]  0.02

2.2.4 Histogram Data

Histogram data usually result from the aggregation of several values of quantitative random variables into a number of sub‐intervals. More formally, we have the following definition.

Usually, histogram sub‐intervals are closed at the left end and open at the right end except for the last sub‐interval, which is closed at both ends. Furthermore, note that the number of histogram sub‐intervals images differs across images and across images. For the special case that images, and hence images for all images, the histogram is an interval.

Table 2.4 Histogram data: flight times (Example 2.5)

Airline images = Flight time     
images images     
 1 images[40, 100), 0.082; [100, 180), 0.530;[180, 240), 0.172; [240, 320), 0.118; [320, 380], 0.098images
 2 images[40, 90), 0.171; [90, 140), 0.285; [140, 190), 0.351; [190, 240), 0.022; [240, 290], 0.171images
 3 images[35, 70), 0.128; [70, 135), 0.114; [135, 195), 0.424; [195, 255], 0.334images
 4 images[20, 40), 0.060; [40, 60), 0.458; [60, 80), 0.259; [80, 100), 0.117; [100, 120], 0.106images
 5 images[200, 250), 0.164; [250, 300), 0.395; [300, 350), 0.340; [350, 400], 0.101images
 6 images[25, 50), 0.280; [50, 75), 0.301; [75, 100), 0.250; [100, 125], 0.169images
 7 images[10, 50), 0.117; [50, 90), 0.476; [90, 130), 0.236; [130, 170], 0.171images
 8 images[10, 50), 0.069; [50, 90), 0.368; [90, 130), 0.514; [130, 170], 0.049images
 9 images[20, 35), 0.066; [35, 50), 0.337; [50, 65), 0.281; [65, 80), 0.208; [80, 95], 0.108images
10 images[20, 40), 0.198; [40, 60), 0.474; [60, 80), 0.144; [80, 100), 0.131; [100, 120], 0.053images

In the context of symbolic data methodology, the starting data are already in a histogram format. All data, including histogram data, can themselves be aggregated to form histograms (see section 2.4.4).

2.2.5 Other Types of Symbolic Data

A so‐called mixed data set is one in which not all of the images variables take the same format. Instead, some may be interval data, some histograms, some lists, etc.

Table 2.5 Mixed‐valued data: joggers (Example 2.6)

Group images images = Pulse rate images = Running time
 1 [73, 114] {[5.3, 6.2), 0.3; [6.2, 7.1), 0.5; [7.1, 8.3], 0.2}
 2 [70, 100] {[5.5, 6.9), 0.4; [6.7, 8.0), 0.4; [8.0, 9.0], 0.2}
 3 [69, 91] {[5.1, 6.6), 0.4; [6.6, 7.4), 0.4; [7.4, 7.8], 0.2}
 4 [59, 89] {[3.7, 5.8), 0.6; [5.8, 6.3], 0.4}
 5 [61, 87] {[4.5, 5.9), 0.4; [5.9, 6.2], 0.6}
 6 [69, 95] {[4.1, 6.1), 0.5; [6.1, 6.9], 0.5}
 7 [65, 78] {[2.4, 4.8), 0.3; [4.8, 5.7), 0.5; [5.7, 6.2], 0.2}
 8 [58, 83] {[2.1, 5.4), 0.2; [5.4, 6.0), 0.5; [6.0, 6.9], 0.3}
 9 [79, 103] {[4.8, 6.5), 0.3; [6.5, 7.4); 0.5; [7.4, 8.2], 0.2}
10 [40, 60] {[3.2, 4.1), 0.6; [4.1, 6.7], 0.4}

Other types of symbolic data include probability density functions or cumulative distributions, as in the observations in Table 2.6(a), or models such as the time series models for the observations in Table 2.6(b).  

Table 2.6 Some other types of symbolic data

images Description of images
(a) 1 Distributed as a normal images
2 Distributed as a normal images
3 Distributed as exponential (images)
images images
(b) 5 Follows an AR(1) time‐series model
6 Follows a MAimages time‐series model
7 Is a first‐order Markov chain
images images

The modal multi‐valued data of section 2.2.2 and the histogram data of section 2.2.4 use probabilities as the weights of the categories and the histogram sub‐intervals; see Eqs. 2.2.2 and 2.2.4, respectively. While these weights are the most common seen by statistical analysts, there are other possible weights. First, let us define a more general weighted modal type of observation. We take the number of variables to be images; generalization to images follows readily.

Thus, for a modal list or multi‐valued observation of Definition 2.2, the category images and the probability images. Likewise, for a histogram observation of Definition 2.4, the sub‐interval images occurs with relative frequency images, which corresponds to the weight images, images. Note, however, that in Definition 2.5 the condition images does not necessarily hold, unlike pure modal multi‐valued and histogram observations (see Eqs. 2.2.2 and 2.2.4, respectively). Thus, in these two cases, the weights images are probabilities or relative frequencies. The following definitions relate to situations when the weights do not necessarily sum to one. As before, images can differ from observation to observation.

More examples for these cases can be found in Diday (1995) and Billard and Diday (2006a). This book will restrict attention to modal list or multi‐valued data and histogram data cases. However, many of the methodologies in the remainder of the book apply equally to any weights images, including those for capacities, credibilities, possibilities, and necessities.

More theoretical aspects of symbolic data and concepts along with some philosophical aspects can be found in Billard and Diday (2006a, Chapter 2 ).

2.3 How do Symbolic Data Arise?

Symbolic data arise in a myriad of ways. One frequent source results when aggregating larger data sets according to some criteria, with the criteria usually driven by specific operational or scientific questions of interest.

For example, a medical data set may consist of millions of observations recording a slew of medical information for each individual for every visit to a healthcare facility since the year 1990 (say). There would be records of demographic variables (such as age, gender, weight, height, images), geographical information (such as street, city, county, state, country of residence, etc.), basic medical tests results (such as pulse rate, blood pressure, cholesterol level, glucose, hemoglobin, hematocrit, images), specific aliments (such as whether or not the patient has diabetes, a heart condition and if so what, i.e. mitral value syndrome, congestive heart failure, arrhythmia, diverticulitis, myelitis, etc.). There would be information as to whether the patient had a heart attack (and the prognosis) or cancer symptoms (such as lung cancer, lymphoma, brain tumor, etc.). For given aliments, data would be recorded indicating when and what levels of treatments were applied and how often, and so on. The list of possible symptoms is endless. The pieces of information would in analytic terms be the variables (for which the number images is also large), while the information for each individual for each visit to the healthcare facility would be an observation (where the number of observations images in the data set can be extremely large). Trying to analyze this data set by traditional classical methods is likely to be too difficult to manage.

It is unlikely that the user of this data set, whether s/he be a medical insurer or researcher or maybe even the patient him/herself, is particularly interested in the data for a particular visit to the care provider on some specific date. Rather, interest would more likely center on a particular disease (angina, say), or respiratory diseases in a particular location (Lagos, say), and so on. Or, the focus may be on age images gender classes of patients, such as 26‐year‐old men or 35‐year‐old women, or maybe children (aged 17 years and under) with leukemia, again the list is endless. In other words, the interest is on characteristics between different groups of individuals (also called classes or categories, but these categories should not be confused with the categories that make up the lists or multi‐valued types of data of sections 2.2.1 and 2.2.2).

However, when the researcher looks at the accumulated data for a specific group, 50‐year‐old men with angina living in the New England district (say), it is unlikely all such individuals weigh the same (or have the same pulse rate, or the same blood pressure measurement, etc.). Rather, thyroid measurements may take values along the lines of, e.g., images. These values could be aggregated into an interval to give images or they could be aggregated as a histogram realization (especially if there are many values being aggregated). In general, aggregating all the observations which satisfy a given group/class/category will perforce give realizations that are symbolic valued. In other words, these aggregations produce the so‐called second‐level observations of Diday (1987). As we shall see in section 2.4, taking the average of these values for use in a (necessarily) classical methodology will give an answer certainly, but also most likely that answer will not be correct.

Instead of a medical insurer's database, an automobile insurer would aggregate various entities (such as pay‐outs) depending on specific classes, e.g., age images gender of drivers or type of car (Volvo, Renault, Chevrolet, images), including car type by age and gender, or maybe categories of drivers (such as drivers of red convertibles). Statistical agencies publish their census results according to groups or categories of households. For example, salary data are published as ranges such as imagesimages, i.e., the interval images in 1000s of $.

Let us illustrate this approach more concretely through the following example.

Most symbolic data sets will arise from these types of aggregations usually of large data sets but it can be aggregation of smaller data sets. A different situation can arise from some particular scientific question, regardless of the size of the data set. We illustrate this via a question regarding hospitalizations of cardiac patients, described more fully in Quantin et al. (2011).

There are numerous other situations which perforce are described by symbolic data. Species data are examples of naturally occurring symbolic data. Data with minimum and maximum values, such as the temperature data of Table 2.4, also occur as a somewhat natural way to record measurements of interest. Many stockmarket values are reported as high and low values daily (or weekly, monthly, annually). Pulse rates may more accurately be recorded as images, i.e., images rather than the midpoint value of 64; blood pressure values are notorious for “bouncing around”, so that a given value of say 73 for diastolic blood pressure may more accurately be images. Sensitive census data, such as age, may be given as images, and so on. There are countless examples.

A question that can arise after aggregation has occurred deals with the handling of outlier values. For example, suppose data aggregated into intervals produced an interval with specific values images. Or, better yet, suppose there were many many observations between 25 and 30 along with the single value 9. In mathematical terms, our interval, after aggregation, can be formally written as [a,b], where

(2.3.1)equation

where images is the set of all images values aggregated into the interval images. In this case, we obtain the interval images. However, intuitively, we conclude that the value 9 is an outlier and really does not belong to the aggregations in the interval images. Suppose instead of the value 9, we had a value 21, which, from Eq. 2.3.1, gives the interval images. Now, it may not be at all clear if the value 21 is an outlier or if it truly belongs to the interval of aggregated values. Since most analyses involving interval data assume that observations within an interval are uniformly spread across that interval, the question becomes one of testing for uniformity across those intervals. Stéphan (1998), Stéphan et al. (2000), and Cariou and Billard (2015) have developed tests of uniformity, gap tests and distance tests, to help address this issue. They also give some reduction algorithms to achieve the deletion of genuine outliers.

2.4 Descriptive Statistics

In this section, basic descriptive statistics, such as sample means, sample variances and covariances, and histograms, for the differing types of symbolic data are briefly described. For quantitative data, these definitions implicitly assume that within each interval, or sub‐interval for histogram observations, observations are uniformly spread across that interval. Expressions for the sample mean and sample variance for interval data were first derived by Bertrand and Goupil (2000). Adjustments for non‐uniformity can be made. For list multi‐valued data, the sample mean and sample variance given herein are simply the respective classical values for the probabilities associated with each of the corresponding categories in the variable domain.

2.4.1 Sample Means

2.4.2 Sample Variances

Let us consider Eq. 2.4.5 more carefully. For these observations, it can be shown that the total sum of squares (SS), Total SS, i.e., images, can be written as

(2.4.6)equation

where images is the overall mean of Eq. 2.4.2, and where the sample mean of the observation images is

(2.4.7)equation

The term inside the second summation in Eq. 2.4.6 equals images given in Eq. 2.4.5 when images. That is, this is a measure of the internal variation, the internal variance, of the single observation images. When summed over all such observations, images, we obtain the internal variation of all images observations; we call this the Within SS. To illustrate, suppose we have a single observation images. Then, substituting into Eq. 2.4.5, we obtain the sample variance as images, i.e., interval observations each contain internal variation. The first term in Eq. 2.4.6 is the variation of the interval midpoints across all observations, i.e., the Between SS.

Hence, we can write

(2.4.8)equation

where

(2.4.9)equation
(2.4.10)equation

By assuming that values across an interval are uniformly spread across the interval, we see that the Within SS can also be obtained from

equation

Therefore, researchers, who upon aggregation of sets of classical data restrict their analyses to the average of the symbolic observation (such as interval means) are discarding important information; they are ignoring the internal variations (i.e., the Within SS) inherent to their data.

When the data are classically valued, with images, then images and hence the Within SS of Eq. 2.4.10 is zero and the Between SS of Eq. 2.4.9 is the same as the Total SS for classical data. Hence, the sample variance of Eq. 2.4.5 for interval data reduces to its classical counterpart for classical point data, as it should.

As for intervals, we can show that the total variation for histogram data consists of two parts, as in Eq. 2.4.8, where now its components are given, respectively, by

(2.4.12)equation
(2.4.13)equation

with

(2.4.14)equation

It is readily seen that for the special case of interval data, where now images and hence images for all images, the histogram sample variance of Eq. 2.4.11 reduces to the interval sample variance of Eq. 2.4.5.

2.4.3 Sample Covariance and Correlation

When the number of variables images, it is of interest to obtain measures of how these variables depend on each other. One such measure is the covariance. We note that for modal data it is necessary to know the corresponding probabilities for the pairs of each cross‐sub‐intervals in order to calculate the covariances. This is not an issue for interval data since there is only one possible cross‐interval/rectangle for each observation.

As for the variance, we can show that the sum of products (SP) satisfies

(2.4.16)equation

where

(2.4.17)equation
(2.4.18)equation
(2.4.19)equation

with images obtained from Eq. 2.4.2.

As for the variance, we can show that Eq. 2.4.16 holds where now

(2.4.21)equation
(2.4.22)equation
(2.4.23)equation

with images obtained from Eq. 2.4.3.

2.4.4 Histograms

Brief descriptions of the construction of a histogram based on interval data and on histogram data, respectively, are presented here. More complete details and examples can be found in Billard and Diday (2006a).

Table 2.12 Airlines joint histogram (images) (Example 2.16). images = flight time in minutes, images = arrival delay time in minutes

images images images images images
images images images images images images images images images images images images images images images
images images 0.0246 images images 0.0113 images images 0.0143 images images 0.0062 images images 0.0808
images 0.1068 images 0.0676 images 0.0297 images 0.0412 images 0.0874
images 0.0867 images 0.0218 images 0.0132 images 0.0075 images 0.0220
images 0.0293 images 0.0166 images 0.0116 images 0.0106 images 0.0080
images 0.0328 images images 0.0689 images images 0.0388 images images 0.0301 images images 0.0714
images images 0.0215 images 0.2293 images 0.1725 images 0.1950 images 0.2836
images 0.1013 images 0.0976 images 0.1047 images 0.0674 images 0.0933
images 0.0921 images 0.0802 images 0.0521 images 0.0443 images 0.0255
images 0.0398 images images 0.0336 images images 0.0535 images images 0.0182 images images 0.0172
images 0.0463 images 0.1011 images 0.1943 images 0.1503 images 0.0747
images images 0.0070 images 0.0562 images 0.1726 images 0.0700 images 0.0390
images 0.0677 images 0.0449 images 0.0941 images 0.0430 images 0.0130
images 0.0925 images images 0.0344 images images 0.0023 images images 0.0306 images images 0.0288
images 0.0377 images 0.0711 images 0.0097 images 0.1126 images 0.0626
images 0.0449 images 0.0418 images 0.0186 images 0.0381 images 0.0314
images images 0.0123 images 0.0235 images 0.0181 images 0.0270 images 0.0083
images 0.0420 images images 0.0027 images images 0.0045
images 0.0558 images 0.0585 images 0.0210
images 0.0258 images 0.0301 images 0.0217
images 0.0330 images 0.0164 images 0.0057
images images images images images
images images images images images images images images images images images images images images images
images images 0.0079 images images 0.0011 images images 0.0152 images images 0.0113 images images 0.0120
images 0.0421 images 0.0675 images 0.0675 images 0.0249 images 0.0737
images 0.0147 images 0.0660 images 0.0262 images 0.0117 images 0.0556
images 0.0095 images 0.0363 images 0.0193 images 0.0120 images 0.0226
images 0.0076 images images 0.0119 images images 0.0179 images images 0.0652 images images 0.0226
images images 0.0293 images 0.1259 images 0.0537 images 0.2258 images 0.1368
images 0.2793 images 0.1008 images 0.0207 images 0.0964 images 0.1729
images 0.1216 images 0.0463 images 0.0220 images 0.0709 images 0.0632
images 0.0568 images images 0.0197 images images 0.0758 images images 0.0180 images images 0.0135
images 0.0435 images 0.1173 images 0.1860 images 0.1096 images 0.1398
images images 0.0095 images 0.1251 images 0.0978 images 0.0791 images 0.1474
images 0.0798 images 0.0886 images 0.0647 images 0.0520 images 0.0391
images 0.0473 images images 0.0019 images images 0.0468 images images 0.0050 images images 0.0195
images 0.0189 images 0.0038 images 0.1708 images 0.0239 images 0.0481
images 0.0161 images 0.0048 images 0.0826 images 0.0472 images 0.0331
images images 0.0202 images 0.0119 images 0.0331 images 0.0413
images 0.0656 images images 0.0314 images images 0.0050
images 0.0211 images 0.0868 images 0.0195
images 0.0065 images 0.0400 images 0.0378
images 0.0046 images 0.0130 images 0.0435
images images 0.0082
images 0.0446
images 0.0314
images 0.0077
images 0.0063

Table 2.13 Airlines joint histogram (images) (Example 2.16). images = flight time in minutes, images = departure delay time in minutes

images images images images images
images images images images images images images images images images images images images images images
images images 0.0835 images images 0.0227 images images 0.0165 images images 0.0106 images images 0.0636
images images images 0.0532 images 0.0114 images 0.0319 images 0.0924
images 0.0381 images 0.0292 images 0.0242 images 0.0111 images 0.0423
images 0.0547 images 0.0122 images 0.0089 images 0.0120 images images 0.0978
images 0.0273 images images 0.1238 images 0.0079 images images 0.0550 images 0.2361
images images 0.0703 images 0.1844 images images 0.0410 images 0.1543 images 0.1399
images 0.0789 images 0.1024 images 0.0896 images 0.0718 images images 0.0312
images 0.0459 images 0.0654 images 0.1572 images 0.0559 images 0.0527
images 0.0681 images images 0.0436 images 0.0454 images images 0.0417 images 0.0600
images 0.0377 images 0.0889 images 0.0347 images 0.1246 images images 0.0255
images images 0.0525 images 0.0628 images images 0.0583 images 0.0705 images 0.0546
images 0.0550 images 0.0405 images 0.0896 images 0.0448 images 0.0510
images 0.0472 images images 0.0471 images 0.2377 images images 0.0470 images images 0.0116
images 0.0595 images 0.0615 images 0.0753 images 0.0971 images 0.0201
images 0.0357 images 0.0436 images 0.0535 images 0.0368 images 0.0213
images images 0.0451 images 0.0187 images images 0.0066 images 0.0275
images 0.0377 images 0.0068 images images 0.0230
images 0.0297 images 0.0181 images 0.0505
images 0.0316 images 0.0088 images 0.0230
images 0.0248 images 0.0084 images 0.0111
images images images images images
images images images images images images images images images images images images images images images
images images 0.0115 images images 0.0214 images images 0.0496 images images 0.0072 images images 0.0135
images 0.0227 images 0.0394 images 0.0551 images 0.0110 images 0.0526
images 0.0151 images 0.0499 images 0.0234 images 0.0154 images images
images 0.0188 images 0.0534 images images 0.0413 images 0.0186 images 0.0466
images 0.0136 images 0.0069 images 0.0427 images 0.0076 images images 0.0662
images images 0.0714 images images 0.0252 images 0.0303 images images 0.0976 images images
images [0.1640 images 0.0648 images images 0.0661 images 0.1411 images 0.1143
images 0.1195 images 0.0884 images 0.1887 images 0.0951 images images
images 0.1121 images 0.0905 images 0.1694 images 0.0784 images images 0.0541
images 0.0634 images 0.0159 images images 0.1019 images 0.0460 images 0.0887
images images 0.0194 images images 0.0358 images 0.1639 images images 0.0611 images 0.1158
images 0.0435 images 0.0897 images 0.0675 images 0.0721 images 0.0812
images 0.0468 images 0.1167 images 0.0444 images images 0.0256
images 0.0377 images 0.0947 images 0.0450 images 0.0361
images 0.0241 images 0.0138 images 0.0359 images 0.0271
images images 0.0216 images images 0.0019 images images 0.0148 images 0.0120
images 0.0432 images 0.0044 images 0.0202
images 0.0278 images 0.0090 images 0.0302
images 0.0170 images 0.0063 images 0.0265
images 0.0084 images 0.0008 images 0.0258
images images 0.0104 images images 0.0289 images images 0.0192
images 0.0292 images 0.0633 images 0.0183
images 0.0295 images 0.0413 images 0.0195
images 0.0210 images 0.0314 images 0.0214
images 0.0082 images 0.0063 images 0.0274

Table 2.14 Airlines joint histogram (images) (Example 2.16). images = arrival delay time in minutes, images = departure delay time in minutes

images images images images images
images images images images images images images images images images images images images images images
images images 0.0484 images images 0.0667 images images 0.0469 images images 0.0434 images images 0.1011
images 0.0139 images 0.0693 images 0.0343 images 0.0443 images 0.0912
images 0.0031 images 0.0122 images 0.0275 images images 0.1157 images 0.0104
images images 0.1349 images images 0.1369 images 0.0002 images 0.3409 images images 0.1101
images 0.1277 images 0.2345 images images 0.0606 images 0.1006 images 0.2935
images 0.0476 images 0.0968 images 0.1156 images 0.0004 images 0.1257
images 0.0076 images 0.0009 images 0.2191 images images 0.0155 images images 0.0163
images images 0.0601 images images 0.0283 images 0.0107 images 0.0665 images 0.0640
images 0.0898 images 0.0702 images images 0.0129 images 0.0962 images 0.1271
images 0.0865 images 0.0920 images 0.0447 images 0.0350 images images 0.0021
images 0.0904 images 0.0270 images 0.1714 images images 0.0027 images 0.0071
images 0.0002 images images 0.0052 images 0.0787 images 0.0066 images 0.0513
images images 0.0064 images 0.0139 images 0.0014 images 0.0164
images 0.0129 images 0.0371 images images 0.0020 images 0.1157
images 0.0197 images 0.1090 images 0.0029
images 0.0824 images 0.0191
images 0.0113 images 0.0488
images images 0.0016 images 0.1030
images 0.0039
images 0.0041
images 0.0334
images 0.1140
images images images images images
images images images images images images images images images images images images images images images
images images 0.0267 images images 0.0220 images images 0.0689 images images 0.0652 images images 0.0135
images 0.0331 images 0.0314 images 0.0813 images 0.0318 images 0.0165
images 0.0145 images 0.0107 images 0.0055 images 0.0076 images 0.0165
images 0.0008 images 0.0019 images images 0.1377 images images 0.1090 images 0.0015
images images 0.0904 images images 0.0691 images 0.2562 images 0.1729 images images 0.0917
images 0.2129 images 0.1601 images 0.0840 images 0.1043 images 0.1398
images 0.1545 images 0.1446 images images 0.0510 images 0.0176 images 0.1023
images 0.0533 images 0.0275 images 0.0978 images images 0.0230 images 0.0361
images 0.0003 images images 0.0199 images 0.0785 images 0.0523 images images 0.0496
images images 0.0147 images 0.0627 images images 0.0014 images 0.0828 images 0.1353
images 0.0486 images 0.1224 images 0.0152 images 0.1128 images 0.1564
images 0.0604 images 0.1310 images 0.1226 images 0.0013 images 0.0827
images 0.1069 images 0.0006 images images 0.0028 images images 0.0045
images 0.0055 images images 0.0023 images 0.0057 images 0.0226
images images 0.0025 images 0.0075 images 0.0101 images 0.0331
images 0.0068 images 0.0275 images 0.0595 images 0.0977
images 0.0074 images 0.1157 images 0.1414
images 0.0378 images 0.0430
images 0.0448
images images 0.0002
images 0.0013
images 0.0019
images 0.0077
images 0.0670

2.5 Other Issues

There are very few theoretical results underpinning the methodologies pertaining to symbolic data. Some theory justifying the weights associated with modal valued observations, such as capacities, credibilities, necessities, possibilities, and probabilities (briefly described in Definitions 2.62.9), can be found in Diday (1995) and Diday and Emilion (2003). These concepts include the union and interception probabilities of Chapter , and are the only choice which gives Galois field sets. Their results embody Choquet (1954) capacities. Other Galois field theory supporting classification and clustering ideas can be found in Brito and Polaillon (2005).

The descriptive statistics described in section 2.4 are empirically based and are usually moment estimators for the underlying means, variances, and covariances. Le‐Rademacher and Billard (2011) have shown that these estimators for the mean and variance of interval data in Eqs. (2.4.2) and (2.4.5), respectively, are the maximum likelihood estimators under reasonable distributional assumptions; likewise, Xu (2010) has shown the moment estimator for the covariance in Eq. (2.4.15) is also the maximum likelihood estimator. These derivations involve separating out the overall distribution from the internal distribution within the intervals, and then invoking conditional moment theory in conjunction with standard maximum likelihood theory. The current work assumed the overall distribution to be a normal distribution with the internal variations following appropriately defined conjugate distributions. Implicit in the formulation of these estimators for the mean, variance, and covariance is the assumption that the points inside a given interval are uniformly spread across the intervals. Clearly, this uniformity assumption can be changed. Le‐Rademacher and Billard (2011), Billard (2008), Xu (2010), and Billard et al. (2016) discuss how these changes can be effected, illustrating with an internal triangular distribution. There is a lot of foundational work that still needs to be done here.

By and large, however, methodologies and statistics seem to be intuitively correct when they correspond to their classical counterparts. However, to date, they are not generally rigorously justified theoretically. One governing validity criterion is that, crucially and most importantly, methods developed for symbolic data must produce the corresponding classical results when applied to the special case of classical data.

Exercises

  1. 2.1 2.1 Show that the histogram data of Table 2.4 for images = flight time have the sample statistics images, images and images.
  2. 2.2 2.2 Refer to Example 2.16 and use the data of Tables 2.122.14 for all images airlines for images = AirTime, images = ArrDelay, and images = DepDelay. Show that the sample statistics are images, images, images, images, images, images, images, images, images, images, images, images.
    Table 2.15 Airlines joint histogram (images) (Exercise 3). images = flight time in minutes, images = arrival delay time in minutes
    images images images images images
    images images images images images images images images images images images images images images images
    images images 0.0246 images images 0.0113 images images 0.0032 images images 0.0062 images images 0.0808
    images 0.1068 images 0.0676 images 0.0408 images 0.0412 images 0.0940
    images 0.0867 images 0.0218 images 0.0132 images 0.0075 images 0.0154
    images 0.0293 images 0.0166 images 0.0116 images 0.0106 images 0.0080
    images 0.0328 images images 0.0689 images images 0.0088 images images 0.0301 images images 0.0714
    images images 0.0215 images 0.2293 images 0.2025 images 0.1950 images 0.3155
    images 0.1013 images 0.0976 images 0.1047 images 0.0674 images 0.0614
    images 0.0921 images 0.0802 images 0.0521 images 0.0443 images 0.0255
    images 0.0398 images images 0.0336 images images 0.0157 images images 0.0182 images images 0.0172
    images 0.0463 images 0.1011 images 0.2320 images 0.1503 images 0.0846
    images images 0.0070 images 0.0562 images 0.1726 images 0.0700 images 0.0291
    images 0.0677 images 0.0449 images 0.0941 images 0.0430 images 0.0130
    images 0.0925 images images 0.0344 images images 0.0005 images images 0.0306 images images 0.0288
    images 0.0377 images 0.0711 images 0.0114 images 0.1126 images 0.0709
    images 0.0449 images 0.0418 images 0.0186 images 0.0381 images 0.0232
    images images 0.0123 images 0.0235 images 0.0181 images 0.0270 images 0.0083
    images 0.0420 images images 0.0027 images images 0.0045
    images 0.0558 images 0.0585 images 0.0279
    images 0.0258 images 0.0301 images 0.0149
    images 0.0330 images 0.0164 images 0.0057
    images images 0.0079 images images 0.0011 images images 0.0193 images images 0.0022 images images 0.0120
    images 0.0421 images 0.0675 images 0.0634 images 0.0340 images 0.0737
    images 0.0147 images 0.0660 images 0.0262 images 0.0117 images 0.0556
    images 0.0074 images 0.0363 images 0.0096 images 0.0120 images 0.0226
    images 0.0046 images images 0.0119 images 0.0096 images images 0.0110 images images 0.0226
    images 0.0050 images 0.1259 images images 0.0179 images 0.2800 images 0.1368
    images images 0.0293 images 0.1008 images 0.0537 images 0.0964 images 0.1729
    images 0.2793 images 0.0463 images 0.0207 images 0.0709 images 0.0632
    images 0.1216 images images 0.0197 images 0.0138 images images 0.0013 images images 0.0135
    images 0.0462 images 0.1173 images 0.0083 images 0.1263 images 0.1398
    images 0.0238 images 0.1251 images images 0.0813 images 0.0791 images 0.1474
    images 0.0303 images 0.0886 images 0.1804 images 0.0520 images 0.0391
    images images 0.0095 images images 0.0019 images 0.0978 images images 0.0003 images images 0.0000
    images 0.0798 images 0.0038 images 0.0234 images 0.0287 images 0.0195
    images 0.0473 images 0.0048 images 0.0413 images 0.0472 images 0.0481
    images 0.0156 images 0.0119 images images 0.0565 images 0.0413 images 0.0331
    images 0.0079 images images 0.0314 images 0.1612 images images 0.0013
    images 0.0115 images 0.0868 images 0.0826 images 0.0233
    images images 0.0202 images 0.0400 images 0.0207 images 0.0378
    images 0.0656 images 0.0130 images 0.0124 images 0.0435
    images 0.0211
    images 0.0062
    images 0.0013
    images 0.0036
    images images 0.0082
    images 0.0446
    images 0.0314
    images 0.0065
    images 0.0030
    images 0.0046

    Table 2.16 Airlines joint histogram (images) (Exercise 3). images = flight time in minutes, images = departure delay time in minutes

    images images images images images
    images images images images images images images images images images images images images images images
    images images 0.0835 images images 0.0227 images images 0.0165 images images 0.0106 images images 0.0636
    images 0.0767 images 0.0532 images 0.0114 images 0.0319 images 0.0924
    images 0.0381 images 0.0292 images 0.0224 images 0.0111 images 0.0423
    images 0.0547 images 0.0122 images 0.0107 images 0.0120 images images 0.0978
    images 0.0273 images images 0.1238 images 0.0079 images images 0.0550 images 0.2361
    images images 0.0703 images 0.1844 images images 0.0410 images 0.1543 images 0.1399
    images 0.0789 images 0.1024 images 0.0896 images 0.0718 images images 0.0312
    images 0.0459 images 0.0654 images 0.1447 images 0.0559 images 0.0527
    images 0.0681 images images 0.0436 images 0.0580 images images 0.0417 images 0.0600
    images 0.0377 images 0.0889 images 0.0347 images 0.1246 images images 0.0255
    images images 0.0525 images 0.0628 images images 0.0583 images 0.0705 images 0.0546
    images 0.0550 images 0.0405 images 0.0896 images 0.0448 images 0.0510
    images 0.0472 images images 0.0471 images 0.2165 images images 0.0470 images images 0.0116
    images 0.0595 images 0.0615 images 0.0966 images 0.0971 images 0.0201
    images 0.0357 images 0.0436 images 0.0535 images 0.0368 images 0.0213
    images images 0.0451 images 0.0187 images images 0.0066 images 0.0275
    images 0.0377 images 0.0068 images images 0.0230
    images 0.0297 images 0.0161 images 0.0505
    images 0.0316 images 0.0107 images 0.0230
    images 0.0248 images 0.0084 images 0.0111
    images images 0.0115 images images 0.0214 images images 0.0496 images images 0.0072 images images 0.0135
    images 0.0227 images 0.0394 images 0.0551 images 0.0110 images 0.0526
    images 0.0151 images 0.0499 images 0.0234 images 0.0154 images 0.0511
    images 0.0188 images 0.0534 images images 0.0413 images 0.0186 images 0.0466
    images 0.0136 images 0.0069 images 0.0427 images 0.0076 images images 0.0662
    images images 0.0714 images images 0.0252 images 0.0303 images images 0.0976 images 0.1368
    images 0.1640 images 0.0648 images images 0.0661 images 0.1411 images 0.1143
    images 0.1195 images 0.0884 images 0.1887 images 0.0951 images 0.0782
    images 0.1121 images 0.0905 images 0.1694 images 0.0784 images images 0.0541
    images 0.0634 images 0.0159 images images 0.1019 images 0.0460 images 0.0887
    images images 0.0194 images images 0.0358 images 0.1639 images images 0.0611 images 0.1158
    images 0.0435 images 0.0897 images 0.0675 images 0.0721 images 0.0812
    images 0.0468 images 0.1167 images 0.0444 images images 0.0256
    images 0.0377 images 0.0947 images 0.0450 images 0.0361
    images 0.0241 images 0.0138 images 0.0359 images 0.0271
    images images 0.0216 images images 0.0019 images images 0.0148 images 0.0120
    images 0.0432 images 0.0044 images 0.0202
    images 0.0278 images 0.0090 images 0.0302
    images 0.0170 images 0.0063 images 0.0265
    images 0.0084 images 0.0008 images 0.0258
    images images 0.0104 images images 0.0289 images images 0.0192
    images 0.0292 images 0.0633 images 0.0183
    images 0.0295 images 0.0413 images 0.0195
    images 0.0210 images 0.0314 images 0.0214
    images 0.0082 images 0.0063 images 0.0274

    Table 2.17 Airlines joint histogram (images) (Exercise 3). images = arrival delay time in minutes, images = departure delay time in minutes

    images images images images images
    images images images images images images images images images images images images images images images
    images images 0.0277 images images 0.1116 images images 0.0469 images images 0.0434 images images 0.1011
    images 0.0328 images 0.1373 images 0.0343 images 0.0443 images 0.0912
    images 0.0049 images 0.0275 images 0.0275 images images 0.1157 images 0.0104
    images images 0.0595 images images 0.0920 images 0.0002 images 0.3409 images images 0.1101
    images 0.1837 images 0.1665 images images 0.0606 images 0.1006 images 0.2935
    images 0.0718 images 0.0815 images 0.1156 images 0.0004 images 0.1257
    images 0.0029 images 0.0009 images 0.2191 images images 0.0155 images images 0.0163
    images images 0.0273 images images 0.0283 images 0.0107 images 0.0665 images 0.0640
    images 0.0960 images 0.0702 images images 0.0129 images 0.0962 images 0.1271
    images 0.1171 images 0.0920 images 0.0447 images 0.0350 images images 0.0021
    images 0.0457 images 0.0270 images 0.1714 images images 0.0027 images 0.0071
    images images 0.0037 images images 0.0052 images 0.0787 images 0.0066 images 0.0513
    images 0.0207 images 0.0139 images 0.0014 images 0.0164
    images 0.0459 images 0.0371 images images 0.0020 images 0.1157
    images 0.1007 images 0.1090 images 0.0025
    images 0.0027 images 0.0168
    images images 0.0006 images 0.0463
    images 0.0037 images 0.0504
    images 0.0076 images images 0.0004
    images 0.0515 images 0.0023
    images 0.0937 images 0.0025
    images 0.0526
    images images 0.0267 images images 0.0220 images images 0.0702 images images 0.0652 images images 0.0135
    images 0.0331 images 0.0314 images 0.0829 images 0.0318 images 0.0165
    images 0.0145 images 0.0107 images 0.0056 images 0.0076 images 0.0165
    images 0.0008 images 0.0019 images images 0.1320 images images 0.1090 images 0.0015
    images images 0.0904 images images 0.0691 images 0.2528 images 0.1729 images images 0.0917
    images 0.2129 images 0.1601 images 0.0829 images 0.1043 images 0.1398
    images 0.1545 images 0.1446 images images 0.0520 images 0.0176 images 0.1023
    images 0.0533 images 0.0275 images 0.0997 images images 0.0230 images 0.0361
    images 0.0003 images images 0.0199 images 0.0801 images 0.0523 images images 0.0496
    images images 0.0147 images 0.0627 images images 0.0014 images 0.0828 images 0.1353
    images 0.0486 images 0.1224 images 0.0154 images 0.1128 images 0.1564
    images 0.0604 images 0.1310 images 0.1250 images 0.0013 images 0.0827
    images 0.1069 images 0.0006 images images 0.0028 images images 0.0045
    images 0.0055 images images 0.0023 images 0.0057 images 0.0226
    images images 0.0025 images 0.0075 images 0.0101 images 0.0331
    images 0.0068 images 0.0275 images 0.0595 images 0.0977
    images 0.0074 images 0.1157 images 0.1414
    images 0.0378 images 0.0430
    images 0.0448
    images images 0.0002
    images 0.0013
    images 0.0019
    images 0.0077
    images 0.0670
  3. 2.3 2.3 Consider the airline data with joint distributions in Tables 2.152.17.
    1. Using these tables, calculate the sample statistics images, images, images and images, images.
    2. How do the statistics of (a) differ from those of Exercise 2.2, if at all? If there are differences, what do they tell us about different aggregations?

Appendix

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.89.24