CHAPTER 3
Alternative Data Risks and Challenges

3.1 LEGAL ASPECTS OF DATA

Recently new legislation, like the EU General Data Protection Regulation (GDPR),1 has been enacted. The aim of GDPR is to protect all EU citizens from privacy and data breaches and to give them control over their personal data. Hence, GDPR is already impacting how investors can obtain and use alternative data in those cases where data contains what is possibly considered the personal data of individuals in the European Union. Indeed, many alternative datasets contain personal information (e.g. credit card panel data and location). Therefore, their usage for investing must be always preceded by some due diligence checks.

Let's first more rigorously define what GDPR defines as “personal data.” It is different and broader than the US definition of “personally identifiable information” (PII). In the EU, a key question to ask when defining “personal data” is whether a person can be identified based on that data. This means whether it is possible to reverse-engineer the data, maybe by combining it with other data sources, and to be able to uniquely identify that person. Hence, according to the European Commission definition, For data to be truly anonymized, the anonymization must be irreversible.” For example, if the name was removed from a dataset of individuals but the address remained, it would be fairly straightforward to derive the name (or least narrow it down to a household) by joining with a dataset of addresses and names.

If we take a very broad attribute, such as the sex of the individual, this will obviously split a population into two groups, and this will be insufficient to be a unique characteristic. However, if we then add more attributes, such as date of birth, then the combination of the attributes can become more unique, even if any particular characteristic is not in isolation. The more demographic attributes are associated with an individual, then the more “unique” that record would be. Furthermore, we need to ask whether collecting certain attributes is absolutely necessary and could be viewed as contentious and unwarranted.

Rocher, Hendrickx, and Montjoye (2019) flag various instances where supposedly anonymized datasets have been reverse engineered. They create a generative model to reidentify individuals from a dataset. Using their model, they note that with 15 demographic attributes, it is possible to render 99.98% of the people as unique in Massachusetts. Most of the attributes are relatively common, such as date of birth, gender, ZIP code, and so on, and wouldn't necessarily be classified as alternative data.

Montjoye, Hidalgo, Verleysen, and Blondel (2013) give an example of how uniqueness of individuals can be derived from an alternative dataset. They use a dataset of 15 months of human location data, derived from mobile phones. They note that when this location data is hourly and if it is of suitable resolution, it is sufficient to identify 95% of unique people.

In the United States, PII is more limited to categories such as names, addresses, telephone numbers, and the like, unlike GDPR, according to which personal data can also additionally include IP addresses, location, web cookies, photographs, and so on. Hence, all PII is personal data but not all personal data is considered PII.

Across the world, local laws regulate data protection to a different degree. We cannot detail all of them here, but Figure 3.1 shows the levels of enforcement of data protection laws in all the jurisdictions worldwide at the moment of writing.

Data protection laws restrict the amount of alternative data that can be used. Onboarding data must then come after a careful due diligence check of whether it contains personal data. Assurances from data vendors cannot offload this burden from the shoulders of the data buyers and appropriate procedures, and internal controls must be put in place to ensure that data protection laws are not breached. Insurance policies can be used as part of the risk mitigation methods to handle the financial costs of data breach risks. However, it should be noted that insurance may not offset all costs, which are difficult to quantify, such as reputational damage.

Map depicts the comparison of data protection laws around the world.

FIGURE 3.1 Comparison of data protection laws around the world.

Source: DLA Piper.

The limitations of what data we can use means that we cannot always have a complete picture in principle, say, of the potential earnings of an EU company or a non-EU company with regard to their operations in the EU,2 if we use personal data to infer them (e.g. the people who bought the products of a company). Luckily, we do not always need to pinpoint information down to the person level. Instead, what we need is a more aggregated view. For example, the number of people who visited a shopping mall each day of the year is an aggregated metric that will suffice to predict sales and earnings. Therefore, whenever we do not need to buy person-level information, we can only require anonymized aggregated counts directly from the data vendor, instead of buying granular data and doing the aggregation ourselves. Whatever the caveat to get the information an investor needs, it is without doubt that data protection laws, in general, pose a constraint that could in principle reduce (but not eliminate!) the usability of alternative data.

Web scraping is another area where legal questions may arise. A lot of data on the web appears on private websites and behind paywalls. However, many web pages are publicly accessible. Does this mean that we can freely reuse the content that is viewable by users on a public website? Each website has its own terms of usage, which in some cases may prohibit web scraping of content. In many instances, firms seek to monetize the content of their websites by doing their own internal analysis, which is repackaged for clients to access. Alternatively, firms may be selling machine-readable access via APIs, either to the raw data or a structured representation. It is therefore perhaps unsurprising that many firms seek to prevent web scraping of their web content through their terms of usage. At the time of writing there is a lawsuit on the use of web-scraped data, which is being closely watched by hedge funds (Saacks, 2019). In September 2019, the Ninth Circuit Court of Appeals sided with hiQ against LinkedIn. LinkedIn had been seeking to prevent hiQ web scraping publicly accessible LinkedIn user pages (see Condon, 2019). hiQ had been using the data to provide services for HR professionals. Condon notes that the “judge concluded that, even if LinkedIn users had some interest in withholding their publicly-available data, those interests did not outweigh hiQ's interest in continuing its business.” The ruling was seen as a positive development for firms sourcing data from the web.

Another legal issue associated with alternative data is whether a particular dataset constitutes material non-public information (MNPI). Deloitte (2017) notes that just because data might be accessible, such as certain content on the web, which might be tricky to find without the use of advanced coding techniques, does not necessarily make it public. In some cases, they note that certain firms might be less willing to purchase data that appears particularly predictable of information that is embargoed till official release time, such as quarterly earnings.

This leads us again to the concept of exclusivity for datasets. Theoretically, if a dataset is more exclusive, we might conjecture that it is less likely to suffer from alpha decay, particularly if it is most likely to be traded for strategies that have a low capacity. Hence, typically, such datasets are likely to be much more expensive. Fortado, Wigglesworth, and Scannell (2017) note that exclusive datasets can be a “double-edged sword,” quoting Rado Lipuš of Neudata, and that some large funds prefer to avoid them. This is not only related to their expense of such datasets, but also to avoid any potential legal risks associated with them. They also note that in the past New York's attorney general has intervened to stop a data vendor distributing exclusive content to premium subscribers. We have already discussed auctioning datasets and giving to the winners of the auction a restricted access to the data (or low-latency access) to avoid overcrowding and maximize the revenues of the data vendor. It is important for a vendor to investigate if such auctions could be fit if data is considered MNPI. Currently this is still a legally blurred area.

The legal aspects of data do not purely govern whether data can be purchased. Data users often face legal restrictions in how they can use purchased data and this is related to the data license. Is the data license firm-wide, or only restricted to a small number of users? Does the data license restrict its redistribution in raw form or derived indices? All these contractual limitations can also influence the decision of whether to acquire a dataset.

3.2 RISKS OF USING ALTERNATIVE DATA

There are many risks associated with using alternative data, which are discussed by Deloitte (2017). Some of these risks are likely to be faced most by the early adopters. Some of these might be related to the legal risks, which we have discussed earlier. These might be related to privacy issues like GDPR. Alternatively, it could be the case that the data is being collected in a way that violates a website's terms of usage, such as through web scraping, as already mentioned. It should be noted that traditional datasets can also have similar issues. For example, a license may allow for internal usage of a certain common market dataset; however, this does not automatically mean it can be repackaged and used in datasets sold externally.

Other risks might relate to the quality of the data or its validity, a matter we touched upon when discussing the many Vs of Big Data. Admittedly, data quality and validity has also been an issue for traditional datasets. Even with market data, we might have fat-finger values, missing values, and so on. However, with alternative data, we face additional issues. In particular, if we think about social media, a large amount of content is not neutral and may be totally false. As with more traditional datasets, it can also be the case that certain alternative datasets disappear over time. If our models are heavily dependent on such datasets, it will make a strategy more difficult to maintain (see Section 5.2.10) and audit. There might be many reasons for this to happen, such as data vendors that close down. Or it can simply be the case that the raw data is no longer available because it has been discontinued by the vendor. There have been instances where changes in law, such as GDPR, have resulted in the disappearance of certain datasets.

Further risks include employee turnover, which can result in leakage of intellectual property. This has always been an issue with financial markets, where firms have sought to protect themselves from employees moving with particular knowledge of intellectual property. This has resulted in noncompete clauses being enforced. This is no different for dealing with alternative data, which requires specialist skills that are difficult to source. Potentially, one way to reduce employee turnover is to continually train employees so they can build their skillsets and also become more productive in the process. This is especially relevant in a fast-evolving area, such as alternative data.

However, those starting to use alternative data even after many of these issues have been resolved face other risks. Deloitte (2017) points out that these firms will essentially have to be playing catchup with established players in the field. As we noted earlier, developing a strategy for alternative data does not purely involve hiring a few data scientists. It requires data strategists, data scientists, and data engineers. It also requires the business to be able to utilize these resources and have the right processes in place. Creating such a framework takes time and cannot be done overnight. It is also difficult to execute successfully.

Those late to using alternative data might face “blind spots,” as certain alternative datasets that they do not yet know how to use become common. Indeed, this can already be observed with some alternative datasets that have become more ubiquitous, such as consumer transaction data and estimated quarterly earnings for US retailers. For those firms late to the area, it could also result in a loss of assets under management, as investors see them as firms that are behind the curve. In substance, latecomers face a strategic extinction risk.

3.3 CHALLENGES OF USING ALTERNATIVE DATA

Starting to use alternative data might not be that straightforward. First, it could come in an unstructured form. If this is the case, being able to use it warrants first creating a structured dataset from which a model can be built and tested. Subsequently, unstructured data must be continuously converted into structured data in order to feed in the model at the production stage. Second, data might contain streaks of missing values, outliers, and other anomalies. These should be treated before any modeling is attempted unless we have a strong reason to believe that their amount is negligible. Third, in many applications, data from multiple sources must be integrated in order to enrich the feature set and hence do more powerful data mining and predictions than analyzing single sources in isolation. Aggregating diverse data sources comes with some practical challenges as well. Data from different sources is seldom in the same format and frequency; it could come with different delays, and identifiers between different data sources could require some treatment before being matched with a good level of confidence. Let's examine these issues in more detail.

In substance, the steps that data should be subjected to before the modeling stage (not necessarily in the following sequence) are:

  1. Matching entity identifiers between different data sources
  2. Treating missing data
  3. Converting unstructured data into structured
  4. Treating outliers in the data

In what follows, we examine these steps in more detail. We will dedicate separate chapters to missing data (Chapters 7 and 8) and outliers (Chapter 9).

3.3.1 Entity Matching

One of the biggest hurdles in matching different datasets is the fact that the name of an entity3 can be different in different sources because of the multitude of ways to spell it or because of typographical errors. Take, for example, the simple case of the abbreviation for limited companies, which could have a number of different variations, such as limited, LTD, Ltd, or the like. This problem is not static and is not purely limited to the model training phase. Indeed, it will resurface live in production as new entities appear in the data sources (e.g. new companies being registered and as companies disappear through events such as takeovers). In the later section on natural language processing, we discuss many other examples to illustrate the importance of entity matching. Recently, advances have been made in the area of record linkage, especially since 2000, and now a variety of techniques and libraries are widely available. Luckily for tickers, there is the common CUSIP standard, which can be used to join together datasets by ticker. This can be particularly useful if we want to join up many different alternative datasets that might refer to a specific company.

However, for entities such as people and organizations, even once we might have detected them, many different standards might be used by data vendors. This makes it tricky to join together these datasets by entity. To alleviate this problem, Refiniv have open sourced their PermIDs for many different types of entities such as people and organizations. These are available from https://permid.org/. Very granular entries, such as subsidiaries, are available on a subscription basis.

As Christen (2012) explains, integrating data from different sources consists of three tasks. The first one is “schema matching.” It is concerned with identifying database tables, attributes, and conceptual structures (such as ontologies, XML schemas, and UML diagrams) from disparate databases that contain data that correspond to the same type of information. The second is “data matching.” It consists of identifying and matching individual records from disparate databases that refer to the same entities. The third task, known as “data fusion,” is the process of merging pairs or groups of records that have been classified as matches (i.e. that are assumed to refer to the same entity) into a clean and consistent record that represents an entity. We should note, however, that some alternative data may have no particular schema, because it may be unstructured.

Data matching itself is divided into five steps: data preprocessing, indexing, record comparison, classification, and evaluation. There is also a human review step, if necessary.

The aim of data preprocessing is to ensure that that the attributes used for the matching have the same structure, and their content follows the same formats. This means cleaning and standardizing the data into well-defined and consistent formats. Inconsistencies in the way information is represented and encoded also need to be resolved. Data preprocessing thus deals with removing unwanted characters and words, expanding abbreviations and correcting misspellings, segmenting attributes into well-defined and consistent output attributes (e.g. splitting an address into street name, number, postcode etc.), and verifying the correctness of the attribute values (e.g. correcting company names from an external database).

Once the database tables have been cleansed and standardized, they are ready to be matched. This means potentially comparing each pair of records in the two tables. If each table contains one million records, this translates into one trillion records, which can take several days of computing time. Indexing is a way to reduce the number of comparison operations by filtering out pairs that are unlikely to be a match and by creating candidate records. Several techniques exist to do so, and blocking is one of the most used ones.

In the record comparison step, the candidate records generated in the previous step are compared more in detail by taking into account all the attributes (e.g. additional fields containing the address of the company or its activity). Rather than exact matching, which could miss many entities that are the same but appear slightly different due to things like typographical mistakes, an approximate matching is usually conducted. This is done by generating a similarity score between records, which is a number between 0 and 1. Similarity of 1.0 would correspond to an exact match between two values. By contrast, a similarity of 0.0 corresponds to a total dissimilarity between two values. Scores between 0.0 and 1.0 would correspond to some degree of similarity between two values. For each candidate record pair several attributes are generally compared, resulting in a vector of numerical similarity values for each pair. These vectors are called comparison vectors.

Once the comparison vectors have been calculated, pairs of entities have to be assigned to a class: match, non-match, or a potential match. In the latter, a human can be used to resolve the uncertainty and assign a match or non-match class manually. This can be done by thresholding the sum of the elements of the comparison vectors. For example, if the comparison vectors have 10 attributes, then the sum of their elements must be in the interval [0,10]. A thresholding can be defined as follows: [0,4] non-match, [4,6] potential match, [6,10] match. A potential match is escalated for manual review, but we must say that this can be a slow process and prone to errors. An external service such as Amazon Mechanic Turk can be used to outsource this process by crowdsourcing it. We must stress that any sort of manual process like this, whether done internally or externally, needs to have clear and definable criteria outlined, otherwise the accuracy is likely to be very low.

The last step is concerned with the evaluation of the quality of the matches and non-matches. Techniques like F-score borrowed from the machine learning field are commonly used. The quality of the matching is influenced by all the steps described above. The preprocessing step helps make two different values similar. The indexing step leaves out very dissimilar records. The algorithms in the data matching steps and the thresholds and the manual process in the classification steps also have an influence on the final results.

We also note that how we store the matching results is important, especially when it comes to backtesting investment strategies. In this case, we want to make sure that at any point in time of the backtest, we are not inadvertently using data from the future. This can introduce an upward bias to our results and make our backtest unrepresentative. Essentially, data can “leak” from the future to our backtest.

We will make the distinction at this point between transaction time and belief time. A transaction time denotes when a record was inserted into the database. It is usually recorded automatically as a timestamp by the database system and cannot be modified. Belief time refers to the time when the fact inserted into the database is valid.4 For example, we might believe that country X has a GDP5 figure for 2015 of, say, $1 trillion. We might have this belief and insert it as a record as of December 31, 2016. We might then update our belief on January 31, 2017, and insert it as a new record with the new GDP figure. Belief times, in general, can be intervals, points in time, or a series of points in time.

Constructing the database in such a (bi-temporal) way means that we can now find out what our belief time was for any given past transaction time (e.g. what was our belief as of January 15, 2016, with regard to the GDP of country X). Thus, bi-temporal databases of this kind allow retroactive updates coming into effect after the period of time the data is referencing. They also support proactive updates coming into effect before the period of time the data is referencing.

The results from entity matching should be stored in a way such that there are bi-temporal relationships between a permanent entity identifier and the entity attributes used in the matching process. This enables point-in-time or as-of queries to be used and allows for historical analysis without bias. This issue about point-in-time recording is also applicable to the underlying dataset itself, in addition to any history of the entity relationships.

3.3.2 Missing Data

Across many different fields, ranging from finance and economics to energy and transportation, to geophysical, meteorological, and sensor data, one of the challenges when working with data is that it is rarely complete. For instance, about 28% of publications in finance between 1995 and 1999 are reported to contain on average about 20% missing values (see Kofman, 2003). As analyzed in Rezvan et al. (2015), a sample of more than 100 papers in medical research between 2008 and 2013 typically contain missingness fractions exceeding 20%. The reasons for data to be incomplete are manifold and usually domain specific. Possibilities include faulty sensors or processes, incomplete records, mistakes in data collection, unavailability to report certain information, or other very specific reasons. Often it is also not known exactly why data is missing. In most cases it is not possible to recover missing values through additional data collection or measurements. Therefore, when building data applications, one has to accept incomplete data as the norm and devise appropriate strategies for dealing with it. We will dedicate one full chapter (Chapter 7) to missing data and will present detailed case studies in Chapter 8.

3.3.3 Structuring the Data

According to widely cited statistics, 80%–95% of the data in the world comes in unstructured form: text, images, videos, and the like. Data can be also semi-structured like, for example, XML files containing both text and tags. Regardless of the origin of data (individuals, institutions, and sensors), making it useable requires it to be converted to a structured form, sharing a common format. Once it is in a structured form, it becomes easier to analyze.

There are some necessary steps for this to happen. Once data has been captured into a raw digital format, it needs to be preprocessed and validated at every step. Quite often, data can be of such low quality that it makes no sense to use it any further. Therefore, at each major stage of preprocessing it is logical to perform a validation check that would filter only the data that is good enough to proceed to more downstream tasks. When reading documents electronically, for example, it would be important to perform quality checks on PDFs first to assess whether they are “extractable.” These checks can include assessing whether PDFs have sufficient contrast, reasonable DPI, lack of noise, and so on. If the quality is very bad, then it is logical to drop these specific observations. If the quality is average, then we can try to fix. If we assess that the quality is good enough after these various preprocessing steps, we can start doing Optical Character Recognition (OCR). After performing OCR and before trying to process the extracted information, we can do additional checks, this time, for example, on the tables/text specific to the business case at hand.

In the case of web text, preprocessing might also involve removing data that is superfluous for deciphering any meaning, such as HTML tags and other code. These parts of the text are primarily for a computer to interpret and do not aid human interpretation. It also means removing sections of the text that are human-readable but are unlikely to be of interest, such as the navigation bars, page numbers, and disclaimers. By the end of this step, we should be left with the body text of the article. This body text can be structured using NLP (Natural Language Processing) to add additional metadata to help with interpretation. Earlier stages of NLP will include steps such as word segmentation to pick out individual words. Downstream from that, part-of-speech tagging can be applied to identify which words are verbs and nouns, for example. The final structured output can be viewed as a summary of the raw data, which could be more easily stored in a database and analyzed than the original unstructured dataset.

Later on, the text may be classified to identify the overall topic. Name entity recognition is also key to identifying proper nouns of interest, such as people, places, and brands. This is usually combined with entity matching, too, so entities tagged in text can be mapped in tradable instruments. Sentiment analysis can be used to understand how positive or negative the text is. For speech data, we also have the additional step of applying speech recognition in order to transcribe the actual audio into written text.

The equivalent of NLP for images is computer vision. Just as with NLP, the goal of computer vision is to get an understanding of the data from a human perspective. It encompasses a number of different methods. Like text, images need to be cleaned before any further higher-level steps are taken for interpretation. The first step for images will include image processing, such as changing the contrast and sharpening, as well as the removal of noise. Other tasks include edge detection and image segmentation to split an image into various regions or to simplify it; these tasks are tackled by convolutional neural networks (CNN). These image preprocessing steps are essential preparation for higher levels of analysis later.

From a higher-level perspective, computer vision tries to interpret an image to add additional metadata to it and to structure it. These computer vision tasks include image recognition or classification for the entire image. It could also be to pick out specific objects in an image, namely object detection, where we seek to create a bounded box around objects. This includes object classification and object identification. One simple example of object classification could be to classify a “burger” and then identification of its specific type, such as “Whopper.” We could view facial recognition as a very specific example of object identification. In recent years, machine learning, and in particular deep learning techniques, have become very suitable for tasks within computer vision such as image classification. The use of machine learning has not been confined to the higher-level tasks only. It has also been helpful for a number of image processing tasks, such as image colorization and removing blurring from an image. While many of the tasks associated with computer vision are also applicable for video, some are very specific to video, such as object movement tracking or lip reading.

Computer vision can also be used as part of an NLP task when our input text is not already in a digitized text format, but it is instead within an image. This can occur when the input text consists of handwriting. We can use OCR to pick out printed text not only from documents discussed earlier but also when reading road signs for self-driving cars. We discuss the structuring of images and computer vision in Section 4.5 and use cases in more detail in Chapter 13 and natural language processing in Section 4.6 and use cases in Chapter 15.

Even if data has a relatively common structure already, such as trade transaction data, we might still want to add other fields to help with additional classification of the dataset. In the case of transaction data, this is likely to involve adding tags to describe the general type of counterparties, such as understanding whether they are on the sell side, buy side, or a corporate firm, for example. As with many types of structuring, this will involve joining it with other datasets.

3.3.4 Treatment of Outliers6

Data, even if structured, is invariably fraught with records that could substantially deviate from expected patterns. As with missing data, the primary cause of such technical outliers could be faulty sensors, processes, or mistakes in data collection. These technical outliers can also be referred to as unwanted anomalies or noise. As Huber (1974) puts it, noise accommodation refers to immunizing a statistical model estimation against anomalous observations. Other outliers are not technical but something that is inherent in the data itself and that we actually want to model (e.g. credit card fraud transactions, insurance claims, extreme events in financial time series, or cyber-breaches).

Three types of outliers detection techniques exist7 – supervised, semi-supervised, and unsupervised:

  • Supervised anomaly detection assumes the existence of a labeled dataset of outliers versus normal observations on which a classifier can be trained. Then the model is used on new data records to determine which class they belong to.
  • Semi-supervised anomaly detection assumes the existence of a labeled dataset only for the normal class. A model is then built for the class corresponding to normal behavior, and used to identify outliers in the test data.
  • Unsupervised anomaly detection means that a labeled dataset is not required, which makes this the most widely used approach. The techniques in this category make the implicit assumption that normal instances are far more frequent than anomalies in the test data.

According to the domain and nature of the data, the type of anomaly, and the challenges associated with anomaly detection, different techniques may be applicable. We will discuss those in much greater detail in Chapter 9.

3.4 AGGREGATING THE DATA

Let's say we have already structured the data to some extent and we have already flagged and treated the outliers. Whatever input data we have, whether images or text, are now in a standardized format. Our dataset is also tagged with metadata fields to help describe the data. Some of these might be text based (like tickers) or numerical. The numerical fields might be car counts, sentiment, and so on.

The next step is to aggregate the data to make it more readily available for use in a trading strategy or a financial model. Typically, time series derived from our alternative data might be available on an irregular frequency while our financial model might be expecting data that has a regular frequency (e.g. every minute, or daily). Hence, we should think about resampling our dataset to fit. If we are getting high-frequency observations from news data, we can think about getting a summary statistic to describe the whole day, whether a mean, median, or some range. Obviously, this resampling will necessitate the loss of some information, but it is essential to creating useful information that can be incorporated into a comprehensive model. The final output is likely to be an index of some sort that can be used as an input into another model.

We could employ many other types of aggregation, in addition to frequency. Another common type of aggregation is that based upon the ticker and also the location or indeed any other category style tags. Indeed, many of the use cases later in the book employ alternative data that has been aggregated by category or ticker. In some cases, it might be a legal requirement to aggregate parts of the dataset, to ensure that specific people or counterparties are not identifiable, before distribution (see Section 3.1).

3.5 SUMMARY

In the alternative data investment-driven process, there are some potential risks and pitfalls that we have pointed out so far. First, many data sources could contain rapidly decaying signals, no signals at all, or are simply too expensive compared to the strength of the signal that can be extracted. Second, even if there is a signal as of today, there is no guarantee that it will persist long enough in the future to justify the initial investment (price of the data and infrastructure costs). Third, finding talent with the right skillset and domain knowledge is still a challenge at the moment of writing. This could be a significant source of model risk. Finally, in a rapidly evolving world, new laws could emerge daily in different geographies and this can all of a sudden preclude the use of some types of alternative data (e.g. personal data).

We will show in what follows that having the right approach and strategy to navigate the complexities deriving from the use of alternative data is an absolute necessity if one wants to reap rewards hidden in it. Although this sounds like a difficult journey, we believe that in the end it will be worth the effort. But before that let's turn to discuss some methodological challenges that can be encountered along the way.

In this chapter, we also talked about many of the challenges associated with alternative datasets. One of these is entity matching. This involves being able to convert references to entities such as brands or people to traded assets. These references need to be recorded in a point-in-time format. More broadly alternative datasets need to be structured. Often they can be in forms such as images and text, without a common format. We need to convert alternative datasets such as images and text into a more readily consumable form for investors, such as numerical time series. Other challenges we mentioned are not exclusive to alternative datasets, such as being able to deal with missing data and also being able to pick out outliers. We will discuss those in greater detail in Chapters 7, 8, and 9.

NOTES

  1. 1   It took effect on May 25, 2018.
  2. 2   In the case of extra-EU companies, the amount of their EU operations could be limited so a better estimate of their earnings could be possible provided that the local data protection laws they operated under are not that stringent as the GDPR.
  3. 3   An entity can be a company, a person, a product, or a security, for example.
  4. 4   This type of database is called temporal.
  5. 5   Past GDP figures of countries are frequently revised months after they are first officially released.
  6. 6   We will use the words “anomaly” and “outlier” interchangeably.
  7. 7   See Chandola (2009).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.50.156