Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Butch QuintoNext-Generation Big Datahttps://doi.org/10.1007/978-1-4842-3147-0_10

10. Distributed In-Memory Big Data Computing

Butch Quinto¹

(1)

Plumpton, Victoria, Australia

Alluxio, formerly known as Tachyon, is an open source project from UC Berkeley AMPLab. Alluxio is a distributed memory-centric storage system originally developed as a research project by Haoyuan Li in 2012, then a PhD student and a founding Apache Spark committer at AMPLab. ⁱ The project is the storage layer of the Berkeley Data Analytics Stack (BDAS). In 2015, Alluxio, Inc. was founded by Li to commercialize Alluxio, receiving a $7.5 million cash infusion from Andreesen Horowitz. Today, Alluxio has more than 200 contributors from 50 organizations around the world such as Intel, IBM, Yahoo, and Red Hat. Several high-profile companies are currently using Alluxio in production such as Baidu, Alibaba, Rackspace, and Barclays. ⁱⁱ

Ion Stoica, co-author of Spark, co-founder and executive chairman of DataBricks, co-director of UC Berkeley AMPLab, and PhD co-advisor to Haoyuan Li has stated, “As a layer that abstracts away the differences of existing storage systems from the cluster computing frameworks such as Apache Spark and Hadoop MapReduce, Alluxio can enable the rapid evolution of the big data storage, similarly to the way the Internet Protocol (IP) has enabled the evolution of the Internet.” ⁱⁱⁱ

Additionally, Michael Franklin, Professor of Computer Science and Director of the AMPLab at UC Berkeley said that “AMPLab has created some of the most important open source technologies in the new big data stack, including Apache Spark. Alluxio is the next project with roots in the AMPLab to have major impact. We see it playing a huge disruptive role in the evolution of the storage layer to handle the expanding range of big data use cases.” ^iv

Architecture

Alluxio is a memory-centric distributed storage system and aims to be the de facto storage unification layer for big data. It provides a virtualization layer that unifies access for different storage engines such as Local FS, HDFS, S3, and NFS and computing frameworks such as Spark, MapReduce, Hive, and Presto. Figure 10-1 give you an overview of Alluxio’s architecture.

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig1_HTML.jpg — Figure 10-1
Alluxio Architecture Overview

Alluxio is the middle layer that coordinates data sharing and directs data access, while at the same time providing computing frameworks and big data applications high-performance low-latency memory speed. Alluxio integrates seamlessly with Spark and Hadoop, only requiring minor configuration changes. By taking advantage of Alluxio’s unified namespace feature, applications only need to connect to Alluxio to access data stored in any of the supported storage engines. Alluxio has its own native API as well as a Hadoop-compatible file system interface. The convenience class enables users to execute code originally written for Hadoop without any code changes. A REST API provides access to other languages. We will explore the API’s later in the chapter.

Alluxio’s unified namespace feature does not support relational data stores such as Kudu, relational databases such as Oracle or SQL Server, or document databases such as MongoDB. Of course, writing to and from Alluxio and the storage engines mentioned are supported. Developers can use a computing framework such as Spark to create a Data Frame from a Kudu table and store it in an Alluxio file system in Parquet or CSV format, and vice versa (Figure 10-2).

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig2_HTML.jpg — Figure 10-2
Alluxio Technical Architecture

Why Use Alluxio?

The typical Hadoop distribution includes more than 20 open source components. Adding another component to your technology stack is probably the furthest thing in your mind. Nevertheless, Alluxio delivers substantial benefits that will make you wonder why Alluxio is not part core Apache Spark.

Significantly Improve Big Data Processing Performance and Scalability

Over the years memory has gotten cheaper, while its performance has gotten faster. Meanwhile, performance of hard drives has only gotten marginally better. There is no question that data processing in-memory is an order of magnitude faster than processing data on disk. In almost all programming paradigms, we are advised to cache data in-memory to improve performance. One of the main advantages of Apache Spark over MapReduce is its ability to cache data. Alluxio takes that to the next level, providing big data applications not just as a caching layer, but a full-blown distributed high-performance memory-centric storage system.

Baidu is operating one of the largest Alluxio clusters in the world, with 1,000 worker nodes handling more than 2PB of data. With Alluxio, Baidu is seeing an average of 10x and up to 30x performance improvement in query and processing time, significantly improving Baidu’s ability to make important business decisions. ^v Barclays published an article describing their experience with Alluxio. Barclays Data Scientist Gianmario Spacagna and Harry Powell, Head of Advanced Analytics were able to tune their Spark jobs from hours to seconds using Alluxio. ^vi Qunar.com, one of China’s largest travel search engines, experienced a 15x – 300x performance improvement using Alluxio. ^vii

Multiple Frameworks and Applications Can Share Data at Memory Speed

A typical Hadoop cluster has multiple sessions running different computing frameworks such as Spark and MapReduce. In case of Spark, each application gets its own executor processes, with each task within an executor running on its own JVM, isolating Spark applications from each other. This means that Spark (and MapReduce) applications have no way of sharing data, except writing to a storage system such as HDFS or S3. As shown in Figure 10-3, a Spark job and a MapReduce job are using the same data stored in HDFS or S3. In Figure 10-4, Multiple Spark jobs are using the same data with each job storing its own version of the data in its own heap space. ^viii Not only is data duplicated, but sharing data via HDFS or S3 can be slow, particularly if you’re sharing large amounts of data.

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig3_HTML.jpg — Figure 10-3
Different frameworks sharing data via HDFS or S3

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig4_HTML.jpg — Figure 10-4
Different jobs sharing data via HDFS or S3

By using Alluxio as an off-heap storage (Figure 10-5), multiple frameworks and jobs can share data at memory speed, reducing data duplication, increasing throughput, and decreasing latency.

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig5_HTML.jpg — Figure 10-5
Different jobs and frameworks sharing data at memory speed

Provides High Availability and Persistence in Case of Application Termination or Failure

In Spark, the executor processes and the executor memory reside in the same JVM, with all cached data stored in the JVM heap space (Figure 10-6).

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig6_HTML.jpg — Figure 10-6
Spark job with its own heap memory

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig7_HTML.jpg — Figure 10-7
Spark job crashes or completes

When the job completes or for some reason the JVM crashes due to runtime exceptions, all the data cached in heap space will be lost as shown in Figures 10-7 and 10-8.

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig8_HTML.jpg — Figure 10-8
Spark job crashes or completes. Heap space is lost

The solution is to use Alluxio as an off-heap storage (Figure 10-9).

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig9_HTML.jpg — Figure 10-9
Spark using Alluxio as off-heap storage

In this case, even if the Spark JVM crashes, the data is still available in Alluxio (Figures 10-10 and 10-11).

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig10_HTML.jpg — Figure 10-10
Spark job crashes or completes

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig11_HTML.jpg — Figure 10-11
Spark job crashes or completes. Heap space is lost. Off-heap memory is still available.

Optimize Overall Memory Usage and Minimize Garbage Collection

By using Alluxio, memory usage is considerably more efficient since data is shared across jobs and frameworks and because data is stored off-heap, garbage collection is minimized as well, further improving the performance of jobs and applications (Figure 10-12).

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig12_HTML.jpg — Figure 10-12
Multiple Spark and MapReduce jobs can access the same data stored in Alluxio

Reduce Hardware Requirements

Big data processing with Alluxio is significantly faster than with HDFS. IBM’s tests shows Alluxio outperforming HDFS by 110x for write IO. ^ix With that kind of performance, there is less requirement for additional hardware, thus saving you in infrastructure and licensing costs.

Alluxio Components

Similar to Hadoop and other Hadoop components, Alluxio has a master/slave architecture.

Primary Master

The primary master manages the global metadata of the cluster.

Secondary Master

The secondary master manages a journal and periodically does a checkpoint.

Worker

Workers store the data and serve requests from applications to read or write data. Workers also manage local resources such as memory and disk space.

Client

The Alluxio client provides a filesystem API for users to communicate with Alluxio.

Installation

There are several ways to install Alluxio. Alluxio runs on YARN, Mesos, Docker, and EC2 to mention a few. ^x To get you started quickly, I’ll install Alluxio on a single server.

Download the newest version of Alluxio from the Alluxio website.

wget http://alluxio.org/downloads/files/1.6.1/alluxio-1.4.0-bin.tar.gz

tar xvfz alluxio-1.4.0-bin.tar.gz

cd alluxio-1.4.0

Let’s format the worker storage directory and Alluxio journal to prepare the worker and master.

./bin/alluxio format

Waiting for tasks to finish...

All tasks finished, please analyze the log at /opt/alluxio-1.4.0/bin/../logs/task.log.

Formatting Alluxio Master @ server01

Let’s start Alluxio.

./bin/alluxio-start.sh local

Waiting for tasks to finish...

All tasks finished, please analyze the log at /opt/alluxio-1.4.0/bin/../logs/task.log.

Waiting for tasks to finish...

All tasks finished, please analyze the log at /opt/alluxio-1.4.0/bin/../logs/task.log.

Killed 0 processes on server01

Starting master @ server01. Logging to /opt/alluxio-1.4.0/logs

Formatting RamFS: /mnt/ramdisk (4000mb)

Starting worker @ server01. Logging to /opt/alluxio-1.4.0/logs

Starting proxy @ server01. Logging to /opt/alluxio-1.4.0/logs

I create a 100 MB file and copy it to memory. You can create a bigger file if you have more memory. List the contents of the directory.

./bin/alluxio fs ls /

[root@server01 alluxio-1.4.0]# ./bin/alluxio fs copyFromLocal /root/test01.csv /

Copied /root/test01.csv to /

./bin/alluxio fs ls /

-rw-r--r-- root root 103.39MB 05-22-2017 22:21:14:925 In Memory /test01.csv

Let’s persist the file from memory to the local file system.

./bin/alluxio fs persist /test01.csv

persisted file /test01.csv with size 108416290

Apache Spark and Alluxio

You access data in Alluxio similar to how you would access data stored in HDFS and S3 from Spark.

val dataRDD = sc.textFile("alluxio://localhost:19998/test01.csv")

val parsedRDD = dataRDD.map{_.split(",")}

case class CustomerData(userid: Long, city: String, state: String, age: Short)

val dataDF = parsedRDD.map{ a => CustomerData (a(0).toLong, a(1), a(2), a(3).toShort) }.toDF

dataDF.show()

+------+---------------+-----+---+

+------+---------------+-----+---+

| 300| Torrance| CA| 23|

| 302|Manhattan Beach| CA| 21|

+------+---------------+-----+---+

You can also access Alluxio from MapReduce, Hive, Flink, and Presto to mention a few. Check Alluxio’s online documentation for more details.

Administering Alluxio

Alluxio provides a web interface to facilitate system administration and monitoring. You get both high level and detailed information on space capacity, usage, uptime, start time, and list of files to name a few. Alluxio provides you with a web interface for the master and workers. Alluxio also provides a command-line interface for typical file system operations.

Master

You can access Alluxio’s master home page by visiting http://<Master IP Address >:19999 (Figure 10-13).

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig13_HTML.jpg — Figure 10-13
Master home page

Worker

You can access each Alluxio worker’s web interface by visiting http://<Worker IP Address>:30000 (Figure 10-14).

../images/456459_1_En_10_Chapter/456459_1_En_10_Fig14_HTML.jpg — Figure 10-14
Worker home page

Apache Ignite

Ignite is another in-memory platform similar to Alluxio. GridGain Systems originally contributed Apache Ignite to the Apache Software Foundation in 2014. It was promoted to a top-level project in 2015. ^xi It is extremely versatile and can be used as an in-memory data grid, in-memory database, in-memory distributed filesystem, streaming analytics engine, and accelerator for Hadoop and Spark to mention a few. ^xii

Apache Geode

Geode is a distributed in-memory database designed for transactional application with low-latency response times and high-concurrency requirements. Pivotal submitted Geode to the Apache Incubator in 2015. It graduated from the Apache Incubator to become a top-level Apache project in November 2016. Gemfire, the commercial version of Geode, was a popular low-latency transactional system used in Wall Street trading platforms. ^xiii

Summary

Spark is a fast in-memory data processing framework. It can be made significantly faster with Alluxio by providing an off-heap storage that can be utilized to make data sharing across jobs and frameworks more efficient, minimizing garbage collection, and optimizing overall memory usage. Not only will jobs run considerably faster, but you also reduce costs due to decreased hardware requirements. Alluxio is not the only in-memory database available; Apache Ignite and Geode are viable options, as well as other commercial alternatives such as Oracle Coherence and Times Ten.

This chapter serves as an introduction to distributed in-memory computing, and Alluxio in particular. Alluxio is the default off-heap storage solution for Spark. You can learn more about Alluxio by visiting its website at Alluxio.org or Alluxio.com.

References

i.
Chris A Mattman; “Apache Spark for the Incubator,” Apache Spark, 2013, http://mail-archives.apache.org/mod_mbox/incubator-general/201306.mbox/%3CCDD80F64.D5F9D%[email protected]%3E
ii.
Haoyuan Li; “Alluxio, formerly Tachyon, is Entering a New Era with 1.0 release,” Alluxio, 2016, https://www.alluxio.com/blog/alluxio-formerly-tachyon-is-entering-a-new-era-with-10-release
iii.
Haoyuan Li; “Alluxio, formerly Tachyon, is Entering a New Era with 1.0 release,” Alluxio, 2016, https://www.alluxio.com/blog/alluxio-formerly-tachyon-is-entering-a-new-era-with-10-release
iv.
MarketWired; “Alluxio Virtualizes Distributed Storage for Petabyte Scale Computing at In-Memory Speeds,” Alluxio, 2016, http://www.marketwired.com/press-release/alluxio-virtualizes-distributed-storage-petabyte-scale-computing-in-memory-speeds-2099053.htm
v.
MarketWired; “Alluxio Virtualizes Distributed Storage for Petabyte Scale Computing at In-Memory Speeds,” Alluxio, 2016, http://www.marketwired.com/press-release/alluxio-virtualizes-distributed-storage-petabyte-scale-computing-in-memory-speeds-2099053.htm
vi.
Henry Powell, Gianmario Spacagna; “Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds,” DZone, 2016, https://dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon
vii.
Haoyuan Li; “Alluxio Keynote at Strata+Hadoop World Beijing 2016,” Alluxio, 2016, https://www.slideshare.net/Alluxio/alluxio-keynote-at-stratahadoop-world-beijing-2016-65172341
viii.
Mingfei S.; “Getting Started with Tachyon by Use Cases,” Intel, 2016, https://software.intel.com/en-us/blogs/2016/02/04/getting-started-with-tachyon-by-use-cases
ix.
Gil Vernik; “Tachyon for ultra-fast Big Data processing,” IBM, 2015, https://www.ibm.com/blogs/research/2015/08/tachyon-for-ultra-fast-big-data-processing/
x.
Alluxio; “Quick Start Guide,” Alluxio, 2018, https://www.alluxio.org/docs/1.6/en/Getting-Started.html
xi.
Nikita Ivanov; “Fire up big data processing with Apache Ignite,” InfoWorld, 2016, https://www.infoworld.com/article/3135070/data-center/fire-up-big-data-processing-with-apache-ignite.html
xii.
GridGain; “The Foundation of the GridGain In-Memory Computing Platform,” GridGain, 2018, https://www.gridgain.com/technology/apache-ignite
xiii.
Apache Software Foundation; “The Apache Software Foundation Announces Apache® Geode™ as a Top-Level Project,” GlobeNewsWire, 2018, https://globenewswire.com/news-release/2016/11/21/891611/0/en/The-Apache-Software-Foundation-Announces-Apache-Geode-as-a-Top-Level-Project.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Distributed In-Memory Big Data Computing

Create new playlist

Sign In

Sign Up

10. Distributed In-Memory Big Data Computing

Architecture

Why Use Alluxio?

Significantly Improve Big Data Processing Performance and Scalability

Multiple Frameworks and Applications Can Share Data at Memory Speed

Provides High Availability and Persistence in Case of Application Termination or Failure

Optimize Overall Memory Usage and Minimize Garbage Collection

Reduce Hardware Requirements

Alluxio Components

Primary Master

Secondary Master

Worker

Client

Installation

Apache Spark and Alluxio

Administering Alluxio

Master

Worker

Apache Ignite

Apache Geode

Summary

References

Table of Contents for
10. Distributed In-Memory Big Data Computing