We will discuss the big data components responsible for functions such as storage, resource management, governance, processing, and analysis. Most of these big data components are packaged into an enterprise-grade-supported Hadoop distribution, which will be discussed later in more detail.
Data storage is where your raw data lives. It is a reliable, fault-tolerant distributed filesystem that contains structured and unstructured data.
The data is stored either on a distributed on-premise filesystem, such as Hadoop Distributed Filesystem (HDFS), or a cloud-based system, such as Amazon S3. The data is also stored in NoSQL databases, such as HBase or Cassandra on Hadoop storage.
To move data into a big data ecosystem, there are data integration tools, such as Flume and Sqoop, or web-service interfaces, such as REST and SOAP for Amazon S3.
NoSQL databases are non-relational, distributed, generally open source, and horizontally scalable.
The term NoSQL is slightly misleading as NoSQL databases do support SQL operations. The term is now popularly defined as Not Only SQL.
Some other characteristics are schema-free, easy replication support, integrated caching, simple API, and eventually consistent/BASE (not ACID).
In comparison to relational databases, NoSQL databases have superior performance and are more scalable if you need to handle the following:
There are mainly four types of NoSQL databases:
It is a flooded market with at least dozens of NoSQL vendors, each claiming superiority over the others. Most of the databases within their type follow similar architecture and development methodologies and so it is not uncommon that organizations stick to a few of the vendors only.
Effective resource management is a must, especially when there will be multiple applications running, fighting for the computing and data resources. Resource Managers such as YARN and Mesos manage allocation, de-allocation, and efficient utilization of your compute/data resources.
There is also a collection of tools to manage the workflow, provisioning, and distributed coordination; for example, Oozie and Zookeeper.
Data governance is all about taking care of the data, ensuring that the metadata (information about data) is recorded properly and the data is only accessed by authorized people and systems. There is a collection of tools to manage metadata, authentication, authorization, security, privacy settings, and lineage. The tools used are Apache Falcon, HCatalog, Sentry, and Kerberos.
Batch computing is an efficient way to process large data volumes, as the data is collected over a period of time and then processed. The MapReduce, Pig, and Hive scripts are used for batch processing.
Typically, a batch process will have persistent-disk-based data storage for input, output, and intermediate results.
Examples include end-of-day risk metrics calculation and historical trade analytics.
Real-time computing is low-latency data processing and usually a subsecond response. Spark and Storm are the popular ones for real-time processing.
Typically, a batch process will have in-memory processing with continuous data input, but it doesn't necessarily need persistent storage for output and intermediate results.
Examples include live processing of tweets and stock prices, fraud detection, and system monitoring.
Data integration tools bring data in and out of Hadoop ecosystems. In addition to tools provided by Apache or Hadoop distributions, such as Flume and Sqoop, there are many premium vendors such as Informatica, Talend, Syncsort, and Datameer, which are a one-stop shop for the majority of data integration needs. As many data integration tools are user-interface-based and use fourth-generation languages, they are easy to use and will keep your Hadoop ecosystem free from complicated MapReduce low-level programs.
Machine learning is the development of models and algorithms that learn from the input data and improve it based on feedback. The program is driven by the input data and doesn't follow explicit instructions. The most popular suite of machine learning libraries is from Mahout, but it's not uncommon to program using the Spark MLlib library or the Java-based custom MapReduce.
A few examples are speech recognition, anomaly or fraud detection, forecasting, and recommending products.
Hadoop distributions can be combined with different business intelligence and data visualization vendors that can connect to the underlying Hadoop platform to produce management and analytical reports.
There are many vendors in this space and nowadays almost all leading BI tools provide connectors to Hadoop platforms. The most trendy ones are Tableau, SAS, Splunk, Qlikview, and Datameer.
Gartner predicted that the roles directly related to big data are predicted to be around 4.4 million across the world by 2015, which could be anything from basic server maintenance to high-end data science innovation.
As more companies want to make use of analytics on their big data, it looks more likely that the number of roles will only increase going forward. 2015 is likely to see a marked increase in the number of people involved within big data directly as the market has started maturing and now more businesses have proven business benefits.
The technical skills required are very diverse—server maintenance, low-level MapReduce programming, NoSQL database administration, data analytics, graphic designers for visualizations, data integration, data science, machine learning, and business analysis. Even the non-technical roles—project management, front office staff such as finance, trading, marketing, and sales teams to analyze the results—will need retraining with the usage of the new set of analytics and visualization tools.
There are very few people with skills on big data or Hadoop and the demand is very high. So, obviously, they are generally paid higher than the market average. The pay packages of Hadoop roles will be better if:
The skills required are so diverse that you can choose a career in a subject that you are passionate about.
Some of the most popular job roles are:
People with development skills in Java, C#, relational databases, and server administration will find it easier to learn Hadoop development, get hands-on with a few projects, and choose to either specialize in a tool or programming language. They can also learn more skills across a variety of different components and take on architect/technical lead roles.
People with analysis skills who already understand the business processes and integration with technology can learn a few high-level programming languages such as Pig, R, Python, or BI tools. Experience in BI tools and high-level programming works best with good business domain knowledge.
Although I am trying here to divide this into two simple career paths—development and analysis—in the real world there is a significant overlap in all the mentioned job roles.
As long as you have excellent development and analysis skills and are also ready to learn mathematics and business (either formal education or experience), there is nothing stopping you from becoming the "data scientist"—claimed to be the sexiest job for the 21st century by Harvard Business Review.
3.135.246.245