Chapter 7. Big Data with Python

With the advent of cloud computing technologies, big data has become increasingly commonplace. What is big data exactly and how can you work with big data to gather useful information? How different is big data from the kind of data we come across everyday? This chapter will specifically answer these questions and introduce you to the use of big data in finance. Big data tools provide the scalability and reliability of analyzing large volumes of data coming from multiple sources. In meeting these big data needs, Apache Hadoop became the primary choice for financial institutions and enterprises. As such, it is crucial for financial engineers to be familiar with Hadoop for financial applications.

As we begin to process large datasets, we also need to find an avenue to store this data. The de facto standard for relational database management was Structured Query Language (SQL). The nature of digital data is varied, and other means of storing data became the motivation for non-SQL products. One such nonrelational database mechanism is NoSQL, which stands for Not Only SQL. Besides being able to use SQL-like language for data management, NoSQL allows the storage of nonstructured data, such as key values, graphs, or documents. Because of its simplicity in design, it can also be said to be faster in certain circumstances. One area where NoSQL is used in finance is the storage of incoming tick data. This chapter will introduce you to the use of NoSQL for tick data storage.

In this chapter, we will cover the following topics:

  • An introduction to big data, Apache Hadoop, and its components
  • Getting Hadoop and running a QuickStart virtual machine
  • Using the Hadoop HDFS file store
  • Performing a simple word count on an e-book using MapReduce with Python
  • Testing the MapReduce program before running it on Hadoop
  • Performing a MapReduce operation on the daily price changes of a stock
  • Analyzing the results of the MapReduce operation with Python
  • An introduction to NoSQL
  • Getting and running MongoDB
  • Getting and installing the PyMongo module for Python
  • An introduction to databases and collections with PyMongo
  • Insert, delete, find, and sort tick data with a NoSQL collection

Introducing big data

There has been a lot of excitement about big data and the kind of skills involved with it. Before beginning this chapter, it is important to define what big data is and how you can work with big data to gather useful information. How different is big data from the kind of data we come across everyday, say news stories, reports, literature, or even audio?

Big data is actually data captured at a high velocity, and it accumulates in such large quantities that it takes up terabytes or petabytes of storage. Common software tools are inadequate for capturing, processing, maintaining, and managing such data with a short period of tolerance. Analytical tools are applied on these large datasets to uncover the information and relationships that could possibly be used for forecasting or other analytical activities.

With the advent of cloud computing technologies, big data has become increasingly commonplace. Massive amounts of information can be stored in the cloud at lower costs. The move from relational databases to nonrelational solutions, such as NoSQL, allows nonstructured data to be captured at a more rapid rate. By performing data analytics on the captured information, companies are able to improve their operational efficiency, analyze patterns, run targeted marketing campaigns, and improve customer satisfaction.

Financial sector companies are integrating big data analytics into their operations. Analytics are performed in real time on customer transactions to identify abnormal behavior and detect fraud. Customer records, spending habits, and even activities on social media sites can be used to promote products and services by customer segregation. Big data tools provide the scalability and reliability of analyzing big data in the area of risk and credit analytics, with the data coming in from multiple sources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.136.226