HDFS architecture principles (and assumptions)

Listed here are the main design principles behind HDFS:

For more details about the design principles, you can refer to https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
  •  Hardware failure: HDFS makes an implicit assumption that the underlying hardware on which the data will be hosted will ultimately fail. An HDFS cluster will usually consist of hundreds of server machines, each responsible for a chunk of the filesystem’s data. The sheer size and number of components simply means that at any given point in time, a component in HDFS will not work. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
  • Streaming Data Access: HDFS is primarily designed to store data for batch-mode processing and thus requires fast, stream-based access to data. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.
  • Large datasets: This is the primary design goal of HDFS—to be able to handle large datasets efficiently. A typical file in HDFS ranges in size from gigabytes to terabytes. Thus, HDFS is tuned to support large files. It should provide high-aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
  • Simple coherency model: The primary set of applications that make use of HDFS writes the data once on the filesystem and then reads it multiple times. Usually a file, once it has been created, is seldom changed. This simple assumption makes the system fast and simplifies data-coherency issues that arise with constant updates to the data.
  • Moving computation is cheaper than moving data: It is cheaper to bring the computation to the data than to bring the data to computation, especially in the Big Data world where datasets are huge. The main benefit we get out of this is we save on network bandwidth.
  • Portability across heterogeneous hardware and software platforms: HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.95.106