12.4 Data Placement

One of the most important decisions a distributed database designer must make is data placement. Proper data placement is a crucial factor in determining the success of a DDBS. There are four basic alternatives, namely, centralized, replicated, partitioned, and hybrid. Some of these require additional analysis to fine-tune the placement of data. In deciding among data placement alternatives, the following factors need to be considered:

  • Locality of data reference. The data should be placed at the site where it is used most often. The designer studies the applications to identify the sites where they are performed and attempts to place the data in such a way that most accesses are local.

  • Reliability of the data. By storing multiple copies of the data in geographically remote sites, the designer maximizes the probability that the data will be recoverable in case of physical damage to any site.

  • Data availability. As with reliability, storing multiple copies assures users that data items will be available to them, even if the site from which the items are normally accessed is unavailable due to failure of the node or its only link.

  • Storage capacities and costs. Nodes can have different storage capacities and storage costs that must be considered in deciding where data should be kept. Storage costs are minimized when a single copy of each data item is kept, but the plunging costs of data storage make this consideration less important.

  • Distribution of processing load. One of the reasons for choosing a distributed system is to distribute the workload so that processing power will be used most effectively. This objective must be balanced against locality of data reference.

  • Communications costs. The designer must consider the cost of using the communications network to retrieve data. Retrieval costs and retrieval time are minimized when each site has its own copy of all the data. However, when the data is updated, the changes must then be sent to all sites. If the data is very volatile, this results in high communications costs for update synchronization.

The four placement alternatives, as shown in FIGURE 12.6, are the following:

A table titled Alternative. The table has 5 columns labeled Criterion, Centralized, Replicated, Partitioned, and Hybrid. The row entries are as follows.
Row 1. Criterion: Locality of Reference. Centralized: Lowest. Replicated: Highest. Partitioned: asterisk should be high. Hybrid: asterisk should be high.
Row 2. Criterion: Reliability. Centralized: Lowest. Replicated: Highest. Partitioned: High for system, low for item. Hybrid: asterisk should be high.
Row 3. Criterion: Availability. Centralized: Lowest. Replicated: Highest. Partitioned: High for system, low for item. Hybrid: asterisk should be high.
Row 4. Criterion: Storage Costs. Centralized: Lowest. Replicated: Highest. Partitioned: Lowest. Hybrid: asterisk should be average.
Row 5. Criterion: Load Distribution. Centralized: Poor. Replicated: Best. Partitioned: Good. Hybrid: asterisk should be good.
Row 6. Criterion: Communication Costs. Centralized: Highest. Replicated: Low, except for updates. Partitioned: asterisk should be low. Hybrid: asterisk should be low.

FIGURE 12.6 Evaluation of Data Placement Alternatives

*Depends on exact data placement decision made

  • Centralized. This alternative consists of a single database and DBMS stored in one location, with users distributed, as illustrated in Figure 12.1. There is no need for a DDBMS or global data dictionary because there is no real distribution of data, only of processing. Retrieval costs are high because all users, except those at the central site, use the network for all accesses. Storage costs are low because only one copy of each item is kept. There is no need for update synchronization, and the standard concurrency control mechanism is sufficient. Reliability is low and availability is poor because a failure at the central node results in the loss of the entire system. The workload can be distributed, but remote nodes need to access the database to perform applications, so locality of data reference is low. This alternative is not a true DDBS.

  • Replicated. With this alternative, a complete copy of the database is kept at each node. Advantages are maximum locality of reference, reliability, data availability, and processing load distribution. Storage costs are highest in this alternative. Communications costs for retrievals are low, but the cost of updates is high because every site must receive every update. If updates are very infrequent, this alternative is a good one.

  • Partitioned. Here, there is only one copy of each data item, but the data is distributed across nodes. To allow this, the database is split into disjoint fragments or parts. If the database is a relational one, fragments can be vertical table subsets (formed by projection) or horizontal subsets (formed by selection) of global relations. FIGURE 12.7 provides examples of fragmentation. In any horizontal fragmentation scheme, each tuple of every relation must be assigned to one or more fragments such that taking the union of the fragments results in the original relation; for the horizontally partitioned case, a tuple is assigned to exactly one fragment. In a vertical fragmentation scheme, the projections must be lossless, so that the original relations can be reconstructed by taking the join of the fragments. The easiest method of ensuring that projections are lossless is, of course, to include the key in each fragment; however, this violates the disjointness condition because key attributes would then be replicated. The designer can choose to accept key replication, or the system can add a tuple ID—a unique identifier for each tuple—invisible to the user. The system would then include the tuple ID in each vertical fragment of the tuple and would use that identifier to perform joins.

    Besides vertical and horizontal fragments, there can be mixed fragments, obtained by successive applications of select and project operations. Partitioning requires careful analysis to ensure that data items are assigned to the appropriate site. If data items have been assigned to the site where they are used most often, locality of data reference will be high with this alternative. Because only one copy of each item is stored, data reliability and availability for a specific item are low. However, failure of a node results in the loss of only that node’s data, so the systemwide reliability and availability are higher than in the centralized case. Storage costs are low, and the communications costs of a well-designed system should be low. The processing load should also be well distributed if the data is properly distributed.

  • Hybrid. In this alternative different portions of the database are distributed differently. For example, those records with high locality of reference are partitioned, while those commonly used by all nodes are replicated, if updates are infrequent. Those that are needed by all nodes but updated so frequently that synchronization would be a problem, might be centralized. This alternative is designed to optimize data placement, so that all the advantages and none of the disadvantages of the other methods are possible. However, very careful analysis of data and processing is required with this plan.

A table with 5 columns labeled s t u I d, last Name, first Name, major, and credits. The row entries are as follows.
Row 1. s t u I d: S 1001. last Name: Smith. first Name: Tom. major: History. credits: 90.
Row 2. s t u I d: S 1002. last Name: Chin. first Name: Ann. major: Math. credits: 36.
Row 3. s t u I d: S 1005. last Name: Lee. first Name: Perry. major: History. credits: 3.
Row 4. s t u I d: S 1010. last Name: Burns. first Name: Edward. major: Art. credits: 63.
Row 5. s t u I d: S 1013. last Name: McCarthy. first Name: Owen. major: Math. credits: 0.
Row 6. s t u I d: S 1015. last Name: Jones. first Name: Mary. major: Math. credits: 42.
Row 7. s t u I d: S 1020. last Name: Rivera. first Name: Jane. major: C S C. credits: 15.
A table with 5 columns labeled s t u I d, last Name, first Name, major, and credits. The row entries are as follows.
Row 1. s t u I d: S 1001. last Name: Smith. first Name: Tom. major: History. credits: 90.
Row 2. s t u I d: S 1002. last Name: Chin. first Name: Ann. major: Math. credits: 36.
Row 3. s t u I d: S 1005. last Name: Lee. first Name: Perry. major: History. credits: 3.
Row 4. s t u I d: S 1010. last Name: Burns. first Name: Edward. major: Art. credits: 63.
Row 5. s t u I d: S 1013. last Name: McCarthy. first Name: Owen. major: Math. credits: 0.
Row 6. s t u I d: S 1015. last Name: Jones. first Name: Mary. major: Math. credits: 42.
Row 7. s t u I d: S 1020. last Name: Rivera. first Name: Jane. major: C S C. credits: 15.
A table with 5 columns labeled s t u I d, last Name, first Name, major, and credits. The row entries are as follows.
Row 1. s t u I d: S 1001. last Name: Smith. first Name: Tom. major: History. credits: 90.
Row 2. s t u I d: S 1002. last Name: Chin. first Name: Ann. major: Math. credits: 36.
Row 3. s t u I d: S 1005. last Name: Lee. first Name: Perry. major: History. credits: 3.
Row 4. s t u I d: S 1010. last Name: Burns. first Name: Edward. major: Art. credits: 63.
Row 5. s t u I d: S 1013. last Name: McCarthy. first Name: Owen. major: Math. credits: 0.
Row 6. s t u I d: S 1015. last Name: Jones. first Name: Mary. major: Math. credits: 42.
Row 7. s t u I d: S 1020. last Name: Rivera. first Name: Jane. major: C S C. credits: 15.

FIGURE 12.7 Fragmentation

As an example, let us consider how the University database schema might be distributed. We assume that the university has a main campus and four branch campuses (North, South, East, and West), and that students have a home campus where they normally register and take classes. However, students can register for any class at any campus. Faculty members teach at a single campus. Each class is offered at a single campus. We will modify the schema slightly to include campus information in our new global schema:

Student open parentheses s t u I d, comma, last Name, comma, first Name, comma, major, comma, credits, comma, home campus close parentheses. The variable s t u I d is underlined. Faculty open parentheses f a c I d, comma, name, comma, department, comma, rank, comma, teach campus close parentheses. The variable f a c I d is underlined. Class open parentheses class Number, comma, campus offered, comma, f a c I d, comma, schedule, comma, room close parentheses. The variable class Number is underlined. Enroll open parentheses s t u I d, comma, class Number, comma, grade, close parentheses. The variables s t u I d and class Number are underlined.

It is reasonable to assume that Student records are used most often at the student’s home campus, so we would partition the Student relation, placing each student’s record at his or her home campus. The same is true for Faculty records. The Class relation could also be partitioned according to the campus where the class is offered. However, this would mean that queries about offerings at other campuses would always require use of the network. If such queries are frequent, an alternative would be to replicate the entire Class relation at all sites. Because updates to this table are relatively infrequent, and most accesses are read-only, replication would reduce communication costs, so we will choose this alternative. The Enroll table would not be a good candidate for replication because during registration, updates must be done quickly, and updating multiple copies is time-consuming. It is desirable to have the Enroll records at the same site as the Student record and as the Class record, but these may not be the same site. Therefore, we might choose to centralize the Enroll records at the main site, possibly with copies at the Student site and the Class site. We can designate the main site as the primary copy for Enroll. The fragmentation scheme would be expressed by specifying the conditions for selecting records, as in

Student subscript Main hyphen sigma subscript home campus hyphen single quote Main single quote open parentheses Student close parentheses.
Student subscript North hyphen sigma subscript home campus hyphen single quote North single quote open parentheses Student close parentheses.
Faculty subscript Main hyphen sigma subscript teach campus hyphen single quote Main single quote open parentheses Faculty close parentheses.
Faculty subscript North hyphen sigma subscript teach campus hyphen single quote North single quote open parentheses Faculty close parentheses.
Enroll subscript North hyphen pi subscript s t u I d, comma, class Number, comma, grade open parentheses sigma subscript student dot home campus hyphen single quote North single quote open parentheses Enroll Natural join Student close parentheses close parentheses union pi subscript s t u I d, comma, class Number, comma, grade open parentheses sigma subscript class dot campus offered hyphen single quote North single quote open parentheses Enroll Natural join Class close parentheses close parentheses.

We are not fragmenting the Class relation. Because we have made different decisions for different parts of the database, this is a hybrid fragmentation scheme. FIGURE 12.8 illustrates the data placement decisions for this example.

An illustration of the Data placement for the distributed university schema. The sites connected to the network are as follows. Main site, West site, North site, South site, and East site. The data in the Main site are Student subscript Main, Faculty subscript Main, class, and Enroll open parentheses primary close parentheses. The data in the West site are Student subscript West, Faculty subscript West, Enroll subscript West, open parentheses secondary close parentheses, and class. The data in the North site are Student subscript North, Faculty subscript North, Enroll subscript North, open parentheses secondary close parentheses, and Class. The data in the South site are Student subscript South, Faculty subscript South, Enroll subscript South, open parentheses secondary close parentheses and class. The data in the East site are Student subscript East, Faculty subscript East, Enroll subscript East, open parentheses secondary close parentheses and class.

FIGURE 12.8 Data Placement for the Distributed University Schema

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.199.122