Azure HDInsight is a complete cloud-based version of Apache Hadoop and is equivalent to the Hortonworks Data Platform Hadoop Distribution. Apache Hadoop is a framework for distributed processing and analysis of large datasets provided in clusters of computers.
Azure HDInsight currently supports the following cluster types:
- Apache Hadoop: Clusters based on Apache Hadoop use the HDFS, the YARN resource management, and the MapReduce programming model. A cluster based on Apache Hadoop is for parallel processing and analysis of batch data.
- Apache Spark: Apache Spark is a framework for parallel processing that supports in-memory processing to increase the performance of applications for analyzing large amounts of data. Spark works with SQL, data streams, and machine learning datasets.
- Apache HBase: Apache HBase is a Hadoop-based NoSQL database that provides random access and strong consistency for large amounts of unstructured and partially structured data, and that is in a potential dimension of billions of lines, multiplied by billions of columns.
- Machine Learning Server (formerly known as Microsoft R Server): The Machine Learning Server is a server for hosting and managing parallel, distributed R processes. This feature allows data scientists, statisticians, and R programmers to access scalable, distributed analysis methods in HDInsight, as needed.
- Apache Storm: Apache Storm is a distributed real-time calculation system for the fast processing of large data streams and is offered as a managed cluster in HDInsight.
- Apache Interactive Hive: This is an in-memory cache for interactive and faster Hive queries.
- Apache Kafka: Apache Kafka is an open source platform for creating streaming data pipelines and applications, as well as providing a message queue function that allows you to publish and subscribe data streams.