Alluxio, formerly known as Tachyon, is an open source project from UC Berkeley AMPLab. Alluxio is a distributed memory-centric storage system originally developed as a research project by Haoyuan Li in 2012, then a PhD student and a founding Apache Spark committer at AMPLab. i The project is the storage layer of the Berkeley Data Analytics Stack (BDAS). In 2015, Alluxio, Inc. was founded by Li to commercialize Alluxio, receiving a $7.5 million cash infusion from Andreesen Horowitz. Today, Alluxio has more than 200 contributors from 50 organizations around the world such as Intel, IBM, Yahoo, and Red Hat. Several high-profile companies are currently using Alluxio in production such as Baidu, Alibaba, Rackspace, and Barclays. ii
Ion Stoica, co-author of Spark, co-founder and executive chairman of DataBricks, co-director of UC Berkeley AMPLab, and PhD co-advisor to Haoyuan Li has stated, “As a layer that abstracts away the differences of existing storage systems from the cluster computing frameworks such as Apache Spark and Hadoop MapReduce, Alluxio can enable the rapid evolution of the big data storage, similarly to the way the Internet Protocol (IP) has enabled the evolution of the Internet.” iii
Additionally, Michael Franklin, Professor of Computer Science and Director of the AMPLab at UC Berkeley said that “AMPLab has created some of the most important open source technologies in the new big data stack, including Apache Spark. Alluxio is the next project with roots in the AMPLab to have major impact. We see it playing a huge disruptive role in the evolution of the storage layer to handle the expanding range of big data use cases.” iv
Architecture
Alluxio is the middle layer that coordinates data sharing and directs data access, while at the same time providing computing frameworks and big data applications high-performance low-latency memory speed. Alluxio integrates seamlessly with Spark and Hadoop, only requiring minor configuration changes. By taking advantage of Alluxio’s unified namespace feature, applications only need to connect to Alluxio to access data stored in any of the supported storage engines. Alluxio has its own native API as well as a Hadoop-compatible file system interface. The convenience class enables users to execute code originally written for Hadoop without any code changes. A REST API provides access to other languages. We will explore the API’s later in the chapter.
Why Use Alluxio?
The typical Hadoop distribution includes more than 20 open source components. Adding another component to your technology stack is probably the furthest thing in your mind. Nevertheless, Alluxio delivers substantial benefits that will make you wonder why Alluxio is not part core Apache Spark.
Significantly Improve Big Data Processing Performance and Scalability
Over the years memory has gotten cheaper, while its performance has gotten faster. Meanwhile, performance of hard drives has only gotten marginally better. There is no question that data processing in-memory is an order of magnitude faster than processing data on disk. In almost all programming paradigms, we are advised to cache data in-memory to improve performance. One of the main advantages of Apache Spark over MapReduce is its ability to cache data. Alluxio takes that to the next level, providing big data applications not just as a caching layer, but a full-blown distributed high-performance memory-centric storage system.
Baidu is operating one of the largest Alluxio clusters in the world, with 1,000 worker nodes handling more than 2PB of data. With Alluxio, Baidu is seeing an average of 10x and up to 30x performance improvement in query and processing time, significantly improving Baidu’s ability to make important business decisions. v Barclays published an article describing their experience with Alluxio. Barclays Data Scientist Gianmario Spacagna and Harry Powell, Head of Advanced Analytics were able to tune their Spark jobs from hours to seconds using Alluxio. vi Qunar.com, one of China’s largest travel search engines, experienced a 15x – 300x performance improvement using Alluxio. vii
Multiple Frameworks and Applications Can Share Data at Memory Speed
Provides High Availability and Persistence in Case of Application Termination or Failure
Optimize Overall Memory Usage and Minimize Garbage Collection
Reduce Hardware Requirements
Big data processing with Alluxio is significantly faster than with HDFS. IBM’s tests shows Alluxio outperforming HDFS by 110x for write IO. ix With that kind of performance, there is less requirement for additional hardware, thus saving you in infrastructure and licensing costs.
Alluxio Components
Similar to Hadoop and other Hadoop components, Alluxio has a master/slave architecture.
Primary Master
The primary master manages the global metadata of the cluster.
Secondary Master
The secondary master manages a journal and periodically does a checkpoint.
Worker
Workers store the data and serve requests from applications to read or write data. Workers also manage local resources such as memory and disk space.
Client
The Alluxio client provides a filesystem API for users to communicate with Alluxio.
Installation
There are several ways to install Alluxio. Alluxio runs on YARN, Mesos, Docker, and EC2 to mention a few. x To get you started quickly, I’ll install Alluxio on a single server.
Download the newest version of Alluxio from the Alluxio website.
Let’s format the worker storage directory and Alluxio journal to prepare the worker and master.
Let’s start Alluxio.
I create a 100 MB file and copy it to memory. You can create a bigger file if you have more memory. List the contents of the directory.
Let’s persist the file from memory to the local file system.
Apache Spark and Alluxio
You access data in Alluxio similar to how you would access data stored in HDFS and S3 from Spark.
You can also access Alluxio from MapReduce, Hive, Flink, and Presto to mention a few. Check Alluxio’s online documentation for more details.
Administering Alluxio
Alluxio provides a web interface to facilitate system administration and monitoring. You get both high level and detailed information on space capacity, usage, uptime, start time, and list of files to name a few. Alluxio provides you with a web interface for the master and workers. Alluxio also provides a command-line interface for typical file system operations.
Master
Worker
Apache Ignite
Ignite is another in-memory platform similar to Alluxio. GridGain Systems originally contributed Apache Ignite to the Apache Software Foundation in 2014. It was promoted to a top-level project in 2015. xi It is extremely versatile and can be used as an in-memory data grid, in-memory database, in-memory distributed filesystem, streaming analytics engine, and accelerator for Hadoop and Spark to mention a few. xii
Apache Geode
Geode is a distributed in-memory database designed for transactional application with low-latency response times and high-concurrency requirements. Pivotal submitted Geode to the Apache Incubator in 2015. It graduated from the Apache Incubator to become a top-level Apache project in November 2016. Gemfire, the commercial version of Geode, was a popular low-latency transactional system used in Wall Street trading platforms. xiii
Summary
Spark is a fast in-memory data processing framework. It can be made significantly faster with Alluxio by providing an off-heap storage that can be utilized to make data sharing across jobs and frameworks more efficient, minimizing garbage collection, and optimizing overall memory usage. Not only will jobs run considerably faster, but you also reduce costs due to decreased hardware requirements. Alluxio is not the only in-memory database available; Apache Ignite and Geode are viable options, as well as other commercial alternatives such as Oracle Coherence and Times Ten.
This chapter serves as an introduction to distributed in-memory computing, and Alluxio in particular. Alluxio is the default off-heap storage solution for Spark. You can learn more about Alluxio by visiting its website at Alluxio.org or Alluxio.com.
References
- i.
Chris A Mattman; “Apache Spark for the Incubator,” Apache Spark, 2013, http://mail-archives.apache.org/mod_mbox/incubator-general/201306.mbox/%3CCDD80F64.D5F9D%[email protected]%3E
- ii.
Haoyuan Li; “Alluxio, formerly Tachyon, is Entering a New Era with 1.0 release,” Alluxio, 2016, https://www.alluxio.com/blog/alluxio-formerly-tachyon-is-entering-a-new-era-with-10-release
- iii.
Haoyuan Li; “Alluxio, formerly Tachyon, is Entering a New Era with 1.0 release,” Alluxio, 2016, https://www.alluxio.com/blog/alluxio-formerly-tachyon-is-entering-a-new-era-with-10-release
- iv.
MarketWired; “Alluxio Virtualizes Distributed Storage for Petabyte Scale Computing at In-Memory Speeds,” Alluxio, 2016, http://www.marketwired.com/press-release/alluxio-virtualizes-distributed-storage-petabyte-scale-computing-in-memory-speeds-2099053.htm
- v.
MarketWired; “Alluxio Virtualizes Distributed Storage for Petabyte Scale Computing at In-Memory Speeds,” Alluxio, 2016, http://www.marketwired.com/press-release/alluxio-virtualizes-distributed-storage-petabyte-scale-computing-in-memory-speeds-2099053.htm
- vi.
Henry Powell, Gianmario Spacagna; “Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds,” DZone, 2016, https://dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon
- vii.
Haoyuan Li; “Alluxio Keynote at Strata+Hadoop World Beijing 2016,” Alluxio, 2016, https://www.slideshare.net/Alluxio/alluxio-keynote-at-stratahadoop-world-beijing-2016-65172341
- viii.
Mingfei S.; “Getting Started with Tachyon by Use Cases,” Intel, 2016, https://software.intel.com/en-us/blogs/2016/02/04/getting-started-with-tachyon-by-use-cases
- ix.
Gil Vernik; “Tachyon for ultra-fast Big Data processing,” IBM, 2015, https://www.ibm.com/blogs/research/2015/08/tachyon-for-ultra-fast-big-data-processing/
- x.
Alluxio; “Quick Start Guide,” Alluxio, 2018, https://www.alluxio.org/docs/1.6/en/Getting-Started.html
- xi.
Nikita Ivanov; “Fire up big data processing with Apache Ignite,” InfoWorld, 2016, https://www.infoworld.com/article/3135070/data-center/fire-up-big-data-processing-with-apache-ignite.html
- xii.
GridGain; “The Foundation of the GridGain In-Memory Computing Platform,” GridGain, 2018, https://www.gridgain.com/technology/apache-ignite
- xiii.
Apache Software Foundation; “The Apache Software Foundation Announces Apache® Geode™ as a Top-Level Project,” GlobeNewsWire, 2018, https://globenewswire.com/news-release/2016/11/21/891611/0/en/The-Apache-Software-Foundation-Announces-Apache-Geode-as-a-Top-Level-Project.html