Geographic processing is a powerful use case for Spark and therefore the aim of this chapter is to explain how data scientists can process geographic data using Spark to produce powerful, map-based views of very large datasets. We will demonstrate how to process spatio-temporal datasets easily via Spark integrations with GeoMesa, which helps turn Spark into a sophisticated geographic processing engine. As the Internet of Things (IoT) and other location-aware datasets become ever more common, and moving objects data volumes climb, Spark will become a critical tool that closes the geoprocessing gap that exists between spatial functionality and processing scalability. This chapter reveals how to conduct advanced geopolitical analysis of global news with a view to leveraging the data to analyze and perform data science on oil prices.
In this chapter, we will cover the following topics:
The premise of this chapter is that we can manipulate GDELT data to determine, to a greater or lesser extent, the price of oil based on historic events. The accuracy of our predictor will depend on many variables including the detail of our events, the number used and our hypotheses surrounding the nature of the relationship between oil and these events.
The oil industry is very complex and is driven by many factors. It has been found however, that most major oil price fluctuations are largely explained by shifts in the demand of crude oil. The price also increases during times of greater demand for stock, and historically has been high in times of geopolitical tension in the Middle East. In particular, political events have a strong influence on the oil price and it is this aspect that we will concentrate on.
Crude oil is produced by many countries around the world; there are however, three main benchmarks that are used by producers for pricing:
Algeria, Angola, Ecuador, Gabon, Indonesia, Iran, Iraq, Kuwait, Libya, Nigeria, Qatar, Saudi Arabia, UAE, and Venezuela
It becomes clear that the first thing we need to do is to obtain the historical pricing data for the three baselines. By searching the Internet, downloadable data can be found in many places, for example:
Now we know that oil prices are primarily determined by supply and demand, our first hypothesis will be that the supply and demand is affected, to a greater extent, by world events and thus we can predict what that supply and demand is likely to be.
We want to try and determine whether the oil price will rise or fall during the next day, week, or month and, as we have used GDELT throughout the book, we will take that knowledge and expand it to run some very large processing jobs. Before we start, it's worth discussing the path we are going to take, and the reasons for the decisions made. The first area of concern is how GDELT relates to oil; this will define the scope of the initial work, and provide a base upon which we can build later. It is important here that we decide how to leverage GDELT and what the consequences of that decision will be; for example, we could decide to use all of the data for all of the time, but the processing time required for that is very large indeed since just one day of GDELT events data can average 15 MB, and 1.5 GB for GKG. Therefore, we should analyze the contents of the two sets and try to establish what our initial data input will be.
Looking through the GDELT schema, there are a number of points that could be useful; the events schema primarily revolves around identifying the two primary actors in a story and relating an event to them. There is also the ability to look at events at different levels, so we will have good flexibility to work at higher or lower levels of complexity, depending upon how our results work out. For example:
The EventCode
field is a CAMEO action code: 0251 (Appeal for easing of administrative sanctions) and can also be used at the levels 02 (Appeal) and 025 (Appeal to yield).
Our second hypothesis is therefore, that the level of detail of the event will provide better or worse accuracy from our algorithm.
Other interesting labels are GoldsteinScale
, NumMentions
and Lat
/Lon
. The GoldsteinScale
label is a number from -10 to +10 and it attempts to capture the theoretical potential impact that type of event can have on the stability of a country; a great match based on what we have already established about the stability of oil prices. The NumMentions
label gives us an indication of how often the event has appeared across all source documents; this could help us to assign an importance to events if we find that we need to reduce the number of assessed events in our processing. For example, we could process the data and find the top 10, 100, or 1000 events in the last hour, day, or week based upon how often they have been mentioned. Finally, the lat
/lon
label information attempts to assign a geographical point of reference for the event, making this very useful for when we want to produce maps in GeoMesa.
The GKG schema is related to summarizing the content of the events and providing enhanced information specific to that content. Areas of interest for our purposes include Counts
, Themes
, GCAM
, and Locations
; the Counts
field maps any numeric mention, thus potentially allowing us to calculate a severity, for example KILLS=47. The Themes
field lists all of the themes based on the GDELT category list; this could help us to machine learn particular areas, over time, which affect oil prices. The GCAM
field is the result of content analysis of the event; a quick perusal of the GCAM list shows us that there are some possibly useful dimensions to look out for:
c9.366 9 366 WORDCOUNT eng Roget's Thesaurus 1911 Edition CLASS III - RELATED TO MATTER/3.2 INORGANIC MATTER/3.2.3 IMPERFECT FLUIDS/366 OIL c18.172 18 172 WORDCOUNT eng GDELT GKG Themes ENV_OIL c18.314 18 314 WORDCOUNT eng GDELT GKG Themes ECON_OILPRICE
And finally, we have the Locations
field, which provides similar information to the Events, and thus also can be used for visualization of maps.
3.14.144.216