5.6 Exercises

5.1 Assume that a 10-D base cuboid contains only three base cells: (1) (a1, d2, d3, d4, …, d9, d10), (2) (d1, b2, d3, d4, …, d9, d10), and (3) (d1, d2, c3, d4, …, d9, d10), where a1d1, b2d2, and c3d3. The measure of the cube is count().

(a) How many nonempty cuboids will a full data cube contain?

(b) How many nonempty aggregate (i.e., nonbase) cells will a full cube contain?

(c) How many nonempty aggregate cells will an iceberg cube contain if the condition of the iceberg cube is “count ≥ 2”?

(d) A cell, c, is a closed cell if there exists no cell, d, such that d is a specialization of cell c (i.e., d is obtained by replacing a ∗ in c by a non-∗ value) and d has the same measure value as c. A closed cube is a data cube consisting of only closed cells. How many closed cells are in the full cube?

5.2 There are several typical cube computation methods, such as MultiWay [ZDN97], BUC [BR99], and Star-Cubing [XHLW03]. Briefly describe these three methods (i.e., use one or two lines to outline the key points), and compare their feasibility and performance under the following conditions:

(a) Computing a dense full cube of low dimensionality (e.g., less than eight dimensions).

(b) Computing an iceberg cube of around 10 dimensions with a highly skewed data distribution.

(c) Computing a sparse iceberg cube of high dimensionality (e.g., over 100 dimensions).

5.3 Suppose a data cube, C, has D dimensions, and the base cuboid contains k distinct tuples.

(a) Present a formula to calculate the minimum number of cells that the cube, C, may contain.

(b) Present a formula to calculate the maximum number of cells that C may contain.

(c) Answer parts (a) and (b) as if the count in each cube cell must be no less than a threshold, v.

(d) Answer parts (a) and (b) as if only closed cells are considered (with the minimum count threshold, v).

5.4 Suppose that a base cuboid has three dimensions, A, B, C, with the following number of cells: |A| = 1,000,000, |B| = 100, and |C| = 1000. Suppose that each dimension is evenly partitioned into 10 portions for chunking.

(a) Assuming each dimension has only one level, draw the complete lattice of the cube.

(b) If each cube cell stores one measure with four bytes, what is the total size of the computed cube if the cube is dense?

(c) State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes.

5.5 Often, the aggregate count value of many cells in a large data cuboid is zero, resulting in a huge, yet sparse, multidimensional matrix.

(a) Design an implementation method that can elegantly overcome this sparse matrix problem. Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data from your structures.

(b) Modify your design in (a) to handle incremental data updates. Give the reasoning behind your new design.

5.6 When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality problem: There exists a huge number of subsets of combinations of dimensions.

(a) Suppose that there are only two base cells, {(a1, a2, a3, …, a100) and (a1, a2, b3, …, b100)}, in a 100-D base cuboid. Compute the number of nonempty aggregate cells. Comment on the storage space and time required to compute these cells.

(b) Suppose we are to compute an iceberg cube from (a). If the minimum support count in the iceberg condition is 2, how many aggregate cells will there be in the iceberg cube? Show the cells.

(c) Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a data cube. However, even with iceberg cubes, we could still end up having to compute a large number of trivial uninteresting cells (i.e., with small counts). Suppose that a database has 20 tuples that map to (or cover) the two following base cells in a 100-D base cuboid, each with a cell count of 10: {(a1, a2, a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10}.

i. Let the minimum support be 10. How many distinct aggregate cells will there be like the following: {(a1, a2, a3, a4, …, a99, ∗) : 10, …, (a1, a2, ∗, a4, …, a99, a100) : 10, …, (a1, a2, a3, ∗, …, ∗, ∗) : 10}?

ii. If we ignore all the aggregate cells that can be obtained by replacing some constants with ∗’s while keeping the same measure value, how many distinct cells remain? What are the cells?

5.7 Propose an algorithm that computes closed iceberg cubes efficiently.

5.8 Suppose that we want to compute an iceberg cube for the dimensions, A, B, C, D, where we wish to materialize all cells that satisfy a minimum support count of at least v, and where cardinality(A) < cardinality(B) < cardinality(C) < cardinality(D). Show the BUC processing tree (which shows the order in which the BUC algorithm explores a data cube’s lattice, starting from all) for the construction of this iceberg cube.

5.9 Discuss how you might extend the Star-Cubing algorithm to compute iceberg cubes where the iceberg condition tests for an avg that is no bigger than some value, v.

5.10 A flight data warehouse for a travel agent consists of six dimensions: traveler, departure (city), departure_time, arrival, arrival_time, and flight; and two measures: count() and avg_fare(), where avg_fare() stores the concrete fare at the lowest level but the average fare at other levels.

(a) Suppose the cube is fully materialized. Starting with the base cuboid [traveler, departure, departure_time, arrival, arrival_time, flight], what specific OLAP operations (e.g., roll-up flight to airline) should one perform to list the average fare per month for each business traveler who flies American Airlines (AA) from Los Angeles in 2009?

(b) Suppose we want to compute a data cube where the condition is that the minimum number of records is 10 and the average fare is over $500. Outline an efficient cube computation method (based on common sense about flight data distribution).

5.11 (Implementation project) There are four typical data cube computation methods: MultiWay [ZDN97], BUC [BR99], H-Cubing [HPDW01], and Star-Cubing [XHLW03].

(a) Implement any one of these cube computation algorithms and describe your implementation, experimentation, and performance. Find another student who has implemented a different algorithm on the same platform (e.g., C++ on Linux) and compare your algorithm performance with his or hers.
Input:

i. An n-dimensional base cuboid table (for n < 20), which is essentially a relational table with n attributes.

ii. An iceberg condition: count (C) ≥ k, where k is a positive integer as a parameter.


Output:

i. The set of computed cuboids that satisfy the iceberg condition, in the order of your output generation.

ii. Summary of the set of cuboids in the form of “cuboid ID: the number of nonempty cells,” sorted in alphabetical order of cuboids (e.g., A: 155, AB: 120, ABC: 22, ABCD: 4, ABCE: 6, ABD: 36), where the number after: represents the number of nonempty cells. (This is used to quickly check the correctness of your results.)

(b) Based on your implementation, discuss the following:

i. What challenging computation problems are encountered as the number of dimensions grows large?

ii. How can iceberg cubing solve the problems of part (a) for some data sets (and characterize such data sets)?

iii. Give one simple example to show that sometimes iceberg cubes cannot provide a good solution.

(c) Instead of computing a high-dimensionality data cube, we may choose to materialize the cuboids that have only a small number of dimension combinations. For example, for a 30-D data cube, we may only compute the 5-D cuboids for every possible 5-D combination. The resulting cuboids form a shell cube. Discuss how easy or hard it is to modify your cube computation algorithm to facilitate such computation.

5.12 The sampling cube was proposed for multidimensional analysis of sampling data (e.g., survey data). In many real applications, sampling data can be of high dimensionality (e.g., it is not unusual to have more than 50 dimensions in a survey data set).

(a) How can we construct an efficient and scalable high-dimensional sampling cube in large sampling data sets?

(b) Design an efficient incremental update algorithm for such a high-dimensional sampling cube.

(c) Discuss how to support quality drill-down given that some low-level cells may be empty or contain too few data for reliable analysis.

5.13 The ranking cube was proposed for efficient computation of top-k (ranking) queries in relational databases. Recently, researchers have proposed another kind of query, called a skyline query. A skyline query returns all the objects pi such that pi is not dominated by any other object pj, where dominance is defined as follows. Let the value of pi on dimension d be v(pi, d). We say pi is dominated by pj if and only if for each preference dimension d, v(pj, d) ≤ v(pi, d), and there is at least one d where the equality does not hold.

(a) Design a ranking cube so that skyline queries can be processed efficiently.

(b) Skyline queries are sometimes too strict to be desirable to some users. One may generalize the concept of skyline into generalized skyline as follows: Given a d-dimensional database and a query q, the generalized skyline is the set of the following objects: (1) the skyline objects and (2) the nonskyline objects that are ent-neighbors of a skyline object, where r is an ent-neighbor of an object p if the distance between p and r is no more than ent. Design a ranking cube to process generalized skyline queries efficiently.

5.14 The ranking cube was designed to support top-k (ranking) queries in relational database systems. However, ranking queries are also posed to data warehouses, where ranking is on multidimensional aggregates instead of on measures of base facts. For example, consider a product manager who is analyzing a sales database that stores the nationwide sales history, organized by location and time. To make investment decisions, the manager may pose the following query: “What are the top-10 (state, year) cells having the largest total product sales?” He may further drill down and ask, “What are the top-10 (city, month) cells?” Suppose the system can perform such partial materialization to derive two types of materialized cuboids: a guiding cuboid and a supporting cuboid, where the former contains a number of guiding cells that provide concise, high-level data statistics to guide the ranking query processing, whereas the latter provides inverted indices for efficient online aggregation.

(a) Derive an efficient method for computing such aggregate ranking cubes.

(b) Extend your framework to handle more advanced measures. One such example could be as follows. Consider an organization donation database, where donors are grouped by “age,” “income,” and other attributes. Interesting questions include: “Which age and income groups have made the top-k average amount of donation (per donor)?” and “Which income group of donors has the largest standard deviation in the donation amount?”

5.15 The prediction cube is a good example of multidimensional data mining in cube space.

(a) Propose an efficient algorithm that computes prediction cubes in a given multidimensional database.

(b) For what kind of classification models can your algorithm be applied? Explain.

5.16 Multifeature cubes allow us to construct interesting data cubes based on rather sophisticated query conditions. Can you construct the following multifeature cube by translating the following user requests into queries using the form introduced in this textbook?

(a) Construct a smart shopper cube where a shopper is smart if at least 10% of the goods she buys in each shopping trip are on sale.

(b) Construct a data cube for best-deal products where best-deal products are those products for which the price is the lowest for this product in the given month.

5.17 Discovery-driven cube exploration is a desirable way to mark interesting points among a large number of cells in a data cube. Individual users may have different views on whether a point should be considered interesting enough to be marked. Suppose one would like to mark those objects of which the absolute value of z score is over 2 in every row and column in a d-dimensional plane.

(a) Derive an efficient computation method to identify such points during the data cube computation.

(b) Suppose a partially materialized cube has (d − 1)-dimensional and (d + 1)-dimensional cuboids materialized but not the d-dimensional one. Derive an efficient method to mark those (d − 1)-dimensional cells with d-dimensional children that contain such marked points.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.10.194