Chapter 5. Spark for Geographic Analysis

Geographic processing is a powerful use case for Spark and therefore the aim of this chapter is to explain how data scientists can process geographic data using Spark to produce powerful, map-based views of very large datasets. We will demonstrate how to process spatio-temporal datasets easily via Spark integrations with GeoMesa, which helps turn Spark into a sophisticated geographic processing engine. As the Internet of Things (IoT) and other location-aware datasets become ever more common, and moving objects data volumes climb, Spark will become a critical tool that closes the geoprocessing gap that exists between spatial functionality and processing scalability. This chapter reveals how to conduct advanced geopolitical analysis of global news with a view to leveraging the data to analyze and perform data science on oil prices.

In this chapter, we will cover the following topics:

  • Using Spark to ingest and preprocess geolocated data
  • Storing geodata which is appropriately indexed, using Geohash indexing inside GeoMesa
  • Running complex spatio-temporal queries, filtering data across time and space
  • Using Spark and GeoMesa together to perform advanced geographic processing in order to study change over time
  • Using Spark to calculate density maps and to visualize changes in these maps over time
  • Querying and integrating spatial data across map layers to build new insights

GDELT and oil

The premise of this chapter is that we can manipulate GDELT data to determine, to a greater or lesser extent, the price of oil based on historic events. The accuracy of our predictor will depend on many variables including the detail of our events, the number used and our hypotheses surrounding the nature of the relationship between oil and these events.

The oil industry is very complex and is driven by many factors. It has been found however, that most major oil price fluctuations are largely explained by shifts in the demand of crude oil. The price also increases during times of greater demand for stock, and historically has been high in times of geopolitical tension in the Middle East. In particular, political events have a strong influence on the oil price and it is this aspect that we will concentrate on.

Crude oil is produced by many countries around the world; there are however, three main benchmarks that are used by producers for pricing:

  • Brent: Produced by various entities in the North Sea
  • WTI: West Texas Intermediate (WTI) covering entities in the mid-west and Gulf Coast regions of North America
  • OPEC: Produced by members of OPEC:

Algeria, Angola, Ecuador, Gabon, Indonesia, Iran, Iraq, Kuwait, Libya, Nigeria, Qatar, Saudi Arabia, UAE, and Venezuela

It becomes clear that the first thing we need to do is to obtain the historical pricing data for the three baselines. By searching the Internet, downloadable data can be found in many places, for example:

Now we know that oil prices are primarily determined by supply and demand, our first hypothesis will be that the supply and demand is affected, to a greater extent, by world events and thus we can predict what that supply and demand is likely to be.

We want to try and determine whether the oil price will rise or fall during the next day, week, or month and, as we have used GDELT throughout the book, we will take that knowledge and expand it to run some very large processing jobs. Before we start, it's worth discussing the path we are going to take, and the reasons for the decisions made. The first area of concern is how GDELT relates to oil; this will define the scope of the initial work, and provide a base upon which we can build later. It is important here that we decide how to leverage GDELT and what the consequences of that decision will be; for example, we could decide to use all of the data for all of the time, but the processing time required for that is very large indeed since just one day of GDELT events data can average 15 MB, and 1.5 GB for GKG. Therefore, we should analyze the contents of the two sets and try to establish what our initial data input will be.

GDELT events

Looking through the GDELT schema, there are a number of points that could be useful; the events schema primarily revolves around identifying the two primary actors in a story and relating an event to them. There is also the ability to look at events at different levels, so we will have good flexibility to work at higher or lower levels of complexity, depending upon how our results work out. For example:

The EventCode field is a CAMEO action code: 0251 (Appeal for easing of administrative sanctions) and can also be used at the levels 02 (Appeal) and 025 (Appeal to yield).

Our second hypothesis is therefore, that the level of detail of the event will provide better or worse accuracy from our algorithm.

Other interesting labels are GoldsteinScale, NumMentions and Lat/Lon. The GoldsteinScale label is a number from -10 to +10 and it attempts to capture the theoretical potential impact that type of event can have on the stability of a country; a great match based on what we have already established about the stability of oil prices. The NumMentions label gives us an indication of how often the event has appeared across all source documents; this could help us to assign an importance to events if we find that we need to reduce the number of assessed events in our processing. For example, we could process the data and find the top 10, 100, or 1000 events in the last hour, day, or week based upon how often they have been mentioned. Finally, the lat/lon label information attempts to assign a geographical point of reference for the event, making this very useful for when we want to produce maps in GeoMesa.

GDELT GKG

The GKG schema is related to summarizing the content of the events and providing enhanced information specific to that content. Areas of interest for our purposes include Counts, Themes, GCAM, and Locations; the Counts field maps any numeric mention, thus potentially allowing us to calculate a severity, for example KILLS=47. The Themes field lists all of the themes based on the GDELT category list; this could help us to machine learn particular areas, over time, which affect oil prices. The GCAM field is the result of content analysis of the event; a quick perusal of the GCAM list shows us that there are some possibly useful dimensions to look out for:

c9.366  9   366   WORDCOUNT   eng   Roget's Thesaurus 1911 Edition   CLASS III - RELATED TO MATTER/3.2 INORGANIC MATTER/3.2.3 IMPERFECT FLUIDS/366 OIL

c18.172  18   172     WORDCOUNT   eng   GDELT   GKG   Themes   ENV_OIL
c18.314  18   314     WORDCOUNT   eng   GDELT   GKG   Themes   ECON_OILPRICE

And finally, we have the Locations field, which provides similar information to the Events, and thus also can be used for visualization of maps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.138.202