12.9 Chapter Summary
A distributed database system (DDBS) has multiple sites connected by a communications system, so that data at any site is available to users at other sites. The system can consist of sites that are geographically far apart but linked by telecommunications, or the sites can be close together and linked by a local area network. Advantages of distribution may include local autonomy, improved reliability, better data availability, increased performance, reduced response time, and lower communications costs.
The designer of a DDBS must consider the type of communications system, data models supported, types of applications, and data placement alternatives. Final design choices include distributed processing using a centralized database, client-server systems, parallel databases, peer-to-peer data management, cloud services, or a true distributed database. A distributed database can be homogeneous, where all nodes use the same hardware and software, or heterogeneous, where nodes have different hardware or software. Heterogeneous systems require translations of codes and word lengths due to hardware differences, or of data models and data structures due to software differences.
A DDBS has the following software components: a data communications (DC) component, a local database management system (LDBMS), a global data dictionary (GDD), and a distributed database management system (DDBMS) component. The responsibilities of the DDBMS include providing the user interface, locating the data, processing queries, providing network-wide concurrency control and recovery procedures, and providing translation in heterogeneous systems. Data placement alternatives are centralized, replicated, partitioned, and hybrid. Factors to be considered in making the data placement decision are locality of reference, reliability of data, availability of data, storage capacities and costs, distribution of processing load, and communications costs. The other components of a DDBS may be placed using any of the four alternatives.
Forms of transparency that are desirable in a distributed database include data distribution transparency (which includes fragmentation transparency, location transparency, and replication transparency), DBMS heterogeneity transparency, transaction transparency (which includes concurrency transparency and recovery transparency), and performance transparency.
One of the most difficult tasks of a distributed database management system is transaction management. The CAP (Consistency, Availability, Partition tolerance) theorem states that a distributed system cannot simultaneously satisfy all three properties of consistency, availability, and partition tolerance. Distributed systems usually sacrifice immediate consistency to guarantee the other two properties, using a BASE (Basically Available, Soft state, Eventually consistent) model rather than the ACID (Atomicity, Consistency, Isolation, Durability) model used for traditional databases. Each site that initiates transactions also has a transaction coordinator whose function it is to manage all transactions, whether local, remote, or global, that originate at that site. Concurrency control problems that can arise include the lost update problem, the uncommitted update problem, the problem of inconsistent analysis, the nonrepeatable read problem, and the phantom data problem, as well as the multiple-copy consistency problem, which is unique to distributed databases. As for the centralized case, solutions for the distributed case include techniques such as locking and timestamping. The two-phase locking protocol has variations in the distributed environment that include single-site lock manager, distributed lock managers, primary copy, and majority locking. The Read-One-Write-All rule is often used for replicated data. Distributed deadlock is detected using a global wait-for graph. Timestamping can also be used to guarantee serializability. Timestamps may include two parts—a normal timestamp and a node identifier. To prevent timestamps from different sites from diverging too much, each site can automatically advance its timestamps whenever it receives a later timestamp than its current time. Recovery protocols include the two-phase commit and the three-phase commit protocols.
Another important task of the DDBMS is distributed query processing, which includes several steps, the most difficult of which is to determine a request-processing strategy and supervise its execution. Standard query optimization techniques must be extended to consider the cost of transferring data between sites. The semijoin operation is sometimes used when a join of data stored at different sites is required. It can result in substantial savings when a join is required.
A peer-to-peer (P2P) data management system is a distributed system with a high level of decentralization. Blockchain technology is one of the newest developments in distributed data management that uses P2P computing to securely manage a form of data known as distributed ledgers. A distributed ledger is an append-only data log that is used to record transactions in distributed applications related to finance, real estate, insurance, and government, as well as other application areas. A key benefit of blockchain technology is that it creates a cryptographically secure solution for providing the integrity of distributed transactions in a decentralized environment.