Big data tools – what to learn

Big data is not just big in terms of data; it is also big in terms of the number of related tools and technologies to learn. Many of the tools are either open source or based on open source. Also, it is highly encouraged for developers and analysts to participate in open forums and contribute to them. However, you should be realistic and stop chasing every single release or new tool coming up almost daily in the market.

In this section, I will cover an end-to-end data life cycle for typical big data projects in financial organizations and use tools that are popular and a bit established. I am not endorsing any tool, and have named a few of them only to give you some background.

Getting your data into HDFS

Before you can do anything with your big data, you need to get it into HDFS using a variety of tools.

A few tools to put the data into HDFS are:

  • Shell commands: This can be used when you have the data file already accessible on your local network. The Data manipulation on HDFS is very similar to shell commands and easy to use. For example, you can copy the large transaction data file into HDFS, do some calculation, and then even export the results back to the local network.
  • Sqoop: This can be used when you have the data residing in a relational database already. For example, if you have a requirement to move all transaction data older than 6 months from your RDBMS to your HDFS, Sqoop comes in very handy.
  • Flume: This can be used when you have a need to gather log data from the Web or systems and load it into HDFS for further analytics. For example, if you need to access external systems to get the massive data on financial news for research, Flume is a good tool.

I recommend you self-learn these:

  • For developers: Shell commands, Sqoop, and Flume
  • For analysts: Shell commands to manipulate data for analysis

Querying data from HDFS

Once the data has been loaded into HDFS, developers and analysts need to query the data and apply various transformations according to business requirements.

A few tools to query the data from HDFS are:

  • Pig: This can be used if your data operations are simple (such as filter, join, and group) and you prefer procedural languages. It is quite common for developers to load the data into HDFS and let analysts play with it using Pig scripts.
  • Hive: This can be used if your data operations are simple (such as filter, join, and group) and you prefer SQL. You can map Hive tables to structured HDFS files and query them using Hive Query Language.
  • MapReduce Java: This uses Map and Reduce programs written in Core Java, generally suited for very complex algorithms where high-level languages such as Pig and Hive fall short.
  • MapReduce non-Java: You can also write MapReduce jobs in a programming language of your choice and execute them with the Hadoop streaming utility. It may be useful if you like to port existing code written in other programming languages to Hadoop.
  • Other Apache and industry tools are built to take advantage of SQL such as queries of HDFS storage. A few such products are Greenplum, Hadapt, and Impala.

I recommend you self-learn these:

  • For developers: Although not essential, it will be beneficial to learn Java, as there may be a need to write MapReduce Java programs. Learning Pig and Hive will be very useful as well.
  • For analysts: Both Pig and Hive for data analysis.

SQL on Hadoop

Everyone who has been involved in data projects will know how useful it is to write SQL queries on your data. There are multiple established products you can use to apply a SQL layer on the Hadoop platform. Some of them are:

  • Hive: You can create your own Hive tables and load data directly from sources, which can be queried using HiveQL.
  • Stinger.Next: Till now, Hive has been a write-once read-only database that doesn't allow updates and transactions. Stinger.Next is a top project with releases in late 2014 and mid-2015 to provide scalable subsecond query response and transactions with inserts, updates, and deletes.
  • Drill: Based on Google's Dremel (also known as, BigQuery), Drill is a low-latency query tool with multiple types of data stores in parallel.
  • Spark SQL: This is used within financial organizations for real-time, in-memory, parallel processing on Hadoop data. Spark SQL builds on Spark and is 10–100 times faster than Hive due to its in-memory processing.
  • Phoenix: This is a SQL tool for HBase that is built for low latency and read/write operations.
  • Cloudera Impala: As Hive, this queries Hadoop data using similar syntax, but does not uses MapReduce-based execution.
  • Other vendors are HAWQ for Pivotal HD, Oracle Big Data SQL, and IBM BigSQL.

I recommend you self-learn but, as long as you are familiar with basic SQL, that is good enough.

Real time

Hadoop is a batch processing system by design and not really a real-time analytics system. Even if we hear of real-time tools such as Storm, Spark, and IBM Infosphere Streams, they are near real time or mini-batches. However, it can be argued to be real time because, to human eyes, they will feel real time as they have subsecond responses.

Real-time analytics will be used for credit card fraud analytics, live call center data processing, or live stock index patterns.

The key tools in this paradigm are:

  • The database needs to be ideally geared towards wide-column NoSQL such as HBase or Cassandra
  • Spark is an-memory engine with programming support for Scala, Java, and Python
  • Based on Lucene, Elasticsearch, and Solr are search and indexing software for real-time web search and recommendations
  • Kafka remains the top Apache project for publish-subscribe systems

I recommend you self-learn these:

  • For developers: Although slightly advanced, if real-time data processing excites you, learning at least one wide-column NoSQL database and Spark will be a good start
  • For analysts: In addition to basic SQL to write queries on HBase, learning SparkSQL will be really good

Data governance and operations

As briefly discussed in the previous chapter, a few Apache tools widely used are Apache Falcon, Sentry, Kerberos, Oozie, and Zookeeper. Many times, Hadoop distributions provide their own data management tools with better user interfaces.

I recommend you self-learn:

  • It helps if developers are able to use Falcon, Oozie, Zookeeper, or distribution-provided administration tools. However, some may argue that these tasks falls within an administrator's responsibility.

ETL tools

In addition to the ETL functionalities provided by Apache Hadoop tools, the premium ETL tools such as Informatica, Datastage, SAS, Talend, Syncsort, and so on have provided connectors to Hadoop platforms and some even permit generating executables directly running on Hadoop engines. The key selling points are:

  • No coding required and therefore easy-to-maintain code
  • You can continue with in-house ETL tool specialists instead of hiring new programmers

I recommend you self-learn:

  • ETL tools are fourth generation drag-and-drop tools and can be learned very quickly

Data analytics and business intelligence

R is undoubtedly the most popular tool for statistical analytics, but needs to have a slightly better user interface to be widely accepted in financial organizations.

Although there are premium BI tools within financial organizations as a part of their existing data warehouses and BI platforms, the new breed of BI tools such as Tableau, Pentaho, Spotfire, and Datameer are more suited to Hadoop and they have already made inroads into the financial sector.

I recommend you self-learn:

  • Analysts could learn R and one of the BI tools in use by their organization
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.236.70