There are various solutions for big data storage and analysis. These three sections
discuss just a few.
Distributed computing refers to data storage and processing achieved through many
smaller, networked computers rather than on single supercomputers or very large servers.
ApacheTM Hadoop® is a leading example of open-source software that uses distributed computing
to achieve big data storage and analysis. The interested reader can find many articles
on Hadoop on www.sas.com.
There are many advantages to using distributed computing:
-
Distributed computing can achieve dramatic improvements in storage efficiency over
other solutions. Efficiency is essential when dealing with big data.
-
Distributed computing allows for faster processing speed and power, which means it
can more easily deal with the velocity of big data.
-
Scalability – the ability to grow or shrink your processing capacity quickly and easily
– is facilitated by systems similar to Hadoop, because you can add or subtract smaller
machines, with ease, as required.
-
You do not have to force the data into particular shapes in distributed systems like
Hadoop. Therefore, you can store and potentially analyze completely unstructured data.
This is not the case with structured data warehouses, as discussed next.
-
Cost is lower. Networking and using smaller commodity machines is cheaper than buying
fewer, bigger machines, and usually easier to upgrade.
-
Data integrity and safety is far better. Because the tasks are spread over many (potentially
thousands) of small machines, individual machine failures are easily dealt with by
the framework by quickly excluding the failed machine. Backups are usually kept.
SAS has positioned itself as an excellent partner for Hadoop in particular, providing
an end-to-end solution for processing the big data, passing it to Hadoop, and analyzing
the processed data.