Best practice 5 – storing large-scale data

With the ever-growing size of data, oftentimes we can't simply fit the data in our single local machine and need to store it on the cloud or distributed filesystems. As this is mainly a book on machine learning with Python, we will just touch on some basic areas that you can look into. The two main strategies of storing big data are scale-up and scale-out:

  • A scale-up approach increases storage capacity if data exceeds the current system capacity, such as by adding more disks. This is useful in fast-access platforms.
  • In a scale-out approach, storage capacity grows incrementally with additional nodes in a storage cluster. Apache Hadoop (https://hadoop.apache.org/) is used to store and process big data on scale-out clusters, where data is spread across hundreds or even thousands of nodes. Also, there are cloud-based distributed file services, such as S3 in Amazon Web Services (https://aws.amazon.com/s3/), and Google Cloud Storage in Google Cloud (https://cloud.google.com/storage/). They are massively scalable and are designed for secure and durable storage.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.9.169