Index

  • 1M implied volatility
    • regressing news volume, contrast, 315f
    • USD/JPY news volume, contrast, 315f
  • 50-day SMA, 82
  •  
  • A/B test, 35
  • accounts payable (accpayable), financial statement item, 241
  • ADP private payroll change, nonfarm payrolls (US change) (contrast), 45f
  • ALFRED time series, 306
  • Algorithms
    • feature detection algorithms, properties, 89f
    • features/feature detection algorithm, 87–89
    • selection decision, 86–87
  • Alpha, 45–46
    • capture data, 256
  • Alternative data, 3
    • adoption curve, 16f
    • brands, association, 24f
    • buy side total spend, 25f
    • capacity, 16–19
    • characteristics, 6
    • collection, cost, 6
    • defining, 5–7
    • dimensions, 19–23
    • forecasts, 339f
    • history, shortness, 6
    • inputs, usage, 127–128
    • maintenance process, 113–114
    • process, 105–114
    • reasons, 11–14
    • risks/challenges, 47
    • segmentation, 7–9, 8t
    • strategies, 35–39, 229t
    • survey data, comparison, 245–247
    • team usage, structuring, 114–115
    • usage, 6, 14f, 50–57
    • use processes, 105
    • users, identification, 15–16
    • value, 27
    • vendors, identification, 23–24
  • Alternative datasets
    • buy side usage, 24–26
    • commercial release frequency, 23f
    • derivation, web scraping (usage), 25f
    • maturing alternative datasets, advantages, 45–46
    • usage, 226t–227t
  • Amazon
    • Comprehend, 102
    • Mechanic Turk, 53
    • revenue, actual revenue changes (contrast/alternative data forecasts), 339f
  • Amazon Web Services (AWS), 79, 338
  • Amelia II (CRAN), 147
  • Amelia imputed time series, examples, 168f, 172f
  • Amelia+MSSA, 165, 167
  • Angle-Based Outlier Factor (ABOF), 203–204
  • Anomalies, 181
    • point anomalies, 184
  • Apache MXNet, 79
  • APIs, usage, 30, 71, 79, 91, 112
  • Approximate methods, rank, 142f
  • Approximate models, 139, 140
  • Arbitrage pricing theory (APT), 122–123
  • Area under the ROC curve (AUC), mean/standard deviation/MSE values, 148f
  • Articles per ticker, average daily count, 311f
  • Artificial Intelligence (AI), 4
  • Asian Crisis (1997), 86
  • Asset class
    • breadth/depth, 20
    • coverage, 20
    • relevance, 19–20
  • Asset price, signals, 29
  • Assets under management (AUM), 16, 360
  • Auctions
    • types, 43
    • usage, 42
  • Auto-associative neural network (AANN), 143
  • Auto-correlation, absence, 64
  • Autoencoders, 71
    • neural networks, 76–77
  • Automated identification system (AIS)
    • crude oil exports, comparison, 286f
    • data, collection, 285
    • transmitters/messages, usage, 284
  • Automatic Identification System (AIS), 117
  • Automotive company data
    • alternative data strategies, CAGR basis, 229t
    • company list, 240
    • core factors, 223
      • aggregation, 224
    • delayed data, usage, 224
    • direct approach, 223–238
    • factors, 226t–227t, 229–238
    • factors CAGRs, 239t
    • financial statement items, description, 241–242
    • freshest automotive factors summary statistics, 228t
    • freshest data, usage, 224
    • Gaussian processes, example, 238–239
    • long top 33% strategy excess returns, equal weighted benchmark (contrast/Pearson correlations), 233t–234t
    • Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
    • ratios, usage, 242–243
    • reporting delays, country ranking, 244
    • stocks, holding (heatmap), 222f
    • time averaged Spearman rank correlations, 236t–237t
  • Automotive fundamental data, 205, 206–211
    • book-to-market ranking, 214
    • Chevrolet Cruze, country unit sales/registration, 210t
    • equal weighted benchmarks, 219t
    • IHS Markit databases, usage, 206–207
    • indirect approach, 211–222
    • information, examples, 217–218
    • long portfolio, creation, 214
    • non-quarterly reporting companies, examination, 218
    • process, 209f
    • production volume, mean percent, 208f
    • sales volume, mean percent, 208f
    • Stage 1 process, 213–215
    • stages, 213–223
    • stocks, ranking, 214
    • strategies, CAGR ranking, 219t, 220t
    • supporting statistics, 217
    • Tesla, value, 218
    • tradeable companies, long/short-portfolio sizes, 218t
    • transaction costs, 217–218
    • universe, assumption, 215
  • Average percentage derivation, usage, 270t
  • Azure clusters, 79
  •  
  • Backtesting, usage/nonusage, 35–39
  • Backtests, 54, 213
  • Bagging, 67
  • Bag-of-words, 94
  • Bayesian principal component analysis (BPCA), 140
  • Bayes theorem, usage, 69–70
  • BeautifulSoup, 100
  • Benchmark model (BM), 263
  • Berners-Lee, Tim, 299
  • Bias, 60–61, 61f
  • Bidirectional Encoder Representations from Transformers (BERT), 95, 101–102
  • Big Data, 9–11, 50
  • Billion Prices Project, The, 321–322
  • Binned R2 analyses, usage, 109
  • Bitly, 324–325
  • Blob-based feature detectors, 88f
  • Blockchain technology, usage, 31
  • Bloomberg News article count, S&P500 (contrast), 310f
  • Book-to-market ratios, 123
  • Brands, alternative data (association), 24f
  • Brazil
    • English attention, local content (comparison), 332f
    • YoY retail sales, SpendingPulse Brazil retail sales YoY (contrast), 337f
  • Buyers, data pricing perspective, 40–41
  •  
  • Caffe, 80
  • CAPEX, 38
  • Capital, allocation (increase), 18
  • Capital Asset Pricing Model (CAPM), 119–124
  • Car counts, 271–277
    • basis, steps, 272–273
    • data, 275f, 276f
    • earnings, contrast, 274f
  • Carhart model, 124
  • Car parks
    • data, DINEOF imputation (example), 175f
    • image, 175f–178f
  • Carry-based factor model, 125
  • Causality (machine learning assumption/limitation), 84–85
  • Central bank intervention, modeling, 346–348
  • Chevrolet Cruze, country unit sales/registration, 210t
  • China
    • GDP growth rate, PMI (contrast), 13f
    • SpaceKnow satellite manufacturing index, Chinese/Caixin PMI manufacturing (contrast), 279f
  • China PMI
    • China GDP QoQ, contrast, 13f
    • manufacturing (surprises), consensus/SMI/hybrid (contrast), 280f
    • manufacturing (measurement), satellite data (usage), 277–280
  • Chinese yuan (CNY) intervention/official data (contrast), model estimates (comparison), 347, 348f
  • Clairvoyance, 211–212
    • impact, 213, 215–216
    • Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
    • Q_pct_delta_ffo returns plot, quarterly benchmark (contrast), 222f
  • Classification methods, 139–140
  • CLS Group, establishment, 352
  • Cluster-based outlier factor (CBLOF) algorithm, 188
  • Clustering-based unsupervised machine learning techniques, 70–71
  • Collective outliers, 184
  • Commodity trading advisor (CTA), risk-adjusted returns, 18
  • Common equity (equity), financial statement item, 241
  • Company event study (pooled survey), case study, 249–252
  • Company removal, example, 220
  • Compounded annual growth rate (CAGR), 211–212, 217–222, 225, 229–230
    • factors CAGRs, 239t
    • production, 235
    • Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
    • ranking, 219t
  • Concept most common/average (CMC), 140, 142
  • Consensus estimates (independent variable), 292
  • Consumer Price inflation, measurement, 321–322
  • Consumer receipts, 337–340
  • Consumer transactions, 335
  • Content, topic/sentiment identification, 93
  • Continuous bag of words (CBOW), 95
  • Convolutional neural networks (CNNs), 56, 69, 76, 272
    • convolutional/flat layers, inclusion, 76f
    • usage, 89–90
  • CoreNLP, 101
  • Corner feature detectors, 88f
  • Corporate aircraft, takeover target visits, 297f
  • Corporate data, 341
  • Corporate jet location data, 296–298
  • Corporate Sustainability Assessment (CSA), RobecoSAM creation, 129
  • Corpus of Contemporary American (COCA) English, 98–99
  • Cost of Goods Sold (cogs), financial statement item, 241
  • Cost value (CV), 34
  • Cox regression models, 72
  • Credit default swap (CDS), 136, 151
    • data, usage, 111, 154–157
    • time series data, clustering, 156f
  • Credit transaction data, 336–337
  • Critical line approach (Markowitz), 71
  • Cross-sectional trading approach, time series trading approach (contrast), 126
  • Cross-validation (CV), 61–63
  • Crowdsourced data, 245
    • case studies, 249–254
    • contributors, hierarchy, 246f
    • product, 247–249
    • usage, 247
  • Crowdsourcing analyst estimates survey, 255
  • Crude oil production, OPEC ranking, 253f
  • Crude oil supplies (tracking), shipping data (usage), 283–286
  • Cryptocurrency price actions (understanding), Wikipedia (usage), 330
  • CScores, 198–199
    • histogram plot, 197f
  • Currency Composition of Official Foreign Exchange Reserve (COFER) data, 347f
  • Currency crisis risk, quantification, 344–346
  • Currency markets, central bank intervention (modeling), 346–348
  • Currency pair, purchase/sale, 354
  • Current liabilities (currliab), financial statement item, 241
  • CUSIP standard, 52
  •  
  • Daily flow returns, 355f
  • Data
    • aggregation, 57–58
    • assets, 33, 105
    • availability, 21–22
    • bias, 21
    • clarity, 111
    • delayed data, usage, 224
    • external consistency, 111
    • external marketing value, 44–45
    • free data, 20
    • frequency, 20–21
    • freshest data, usage, 224
    • fusion, 52
    • internal consistency, 111
    • legal aspects, 47–50
    • markets, 29–31
    • mining, 124–126
    • missing data, 51, 54
    • monetary value, 31–35, 39–45
    • onboarding, performing, 106, 110
    • originality, 22
    • outliers, treatment, 51
    • points, distinction, 182
    • preprocessing, performing, 106, 110–111
    • pricing, perspective, 40–45
    • protection laws, comparison, 48f
    • quality, 21, 111
    • science team, setup cost, 116f
    • services, 30
    • sources, entity identifiers (matching), 51
    • strategies, evaluation, 35–39
    • structuring, 55–56
    • team (creation), big bang hiring strategy, 115
    • test data generation, 154–157
    • timeliness/completeness, 111
    • transformation, stages, 9f
    • underusage, 15
    • uniqueness, 111
    • unstructured data, conversion, 51
    • upside sharing, external sales, 43–44
    • usage, limitations, 49
    • validity/veracity, 111
    • values, 32–33, 136
    • vendors, 116–117
    • view, representation, 184
  • Data-as-a-Service (DaaS), 117
  • Data interpolation with empirical orthogonal functions (DINEOF), 152–153, 160–162
  • Datasets
    • identification, 107–108
    • price, assignation, 27
    • restricted information set, 86
    • shift, types, 85
    • time stamps, 110
    • traditional datasets, 320–321
    • usage, 186t, 269
  • Debit card transaction data, 336–337
  • Decision boundary, example, 68f
  • Decision trees, 67
  • Deep learning (DL), 72–80, 82–83
    • defining, 77
    • examples, 73–74
    • high-level deep learning libraries, 79
    • libraries, 77–80
    • low-level deep learning libraries, 77–79
    • middle-level deep learning libraries, 79
    • usage, 89–90
  • Deletion (missing data treatment), 137–138, 143
  • Density-based techniques, 203
  • Deterministic techniques, usage, 160–164
  • Directionality factor, 82
  • Direct prediction, 129–132
  • Discretionary investors, 38–39
  • Distance-based techniques, 202–203
  • Diversification, factor investing benefit, 127
  • Do not impute (DNI), 140
  • Due diligence, performing, 105, 108
  • Dutch auction, 43
  • Dwell time, 288
  •  
  • Earnings before interest and taxes (ebit), financial statement item, 241
  • Earnings, car counts (contrast), 274f
  • Earnings per share (EPS), 271–277
    • estimation, mobile phone location data (usage), 291–295
    • examples, 275f, 276f
    • news/Twitter data, contrast, 294f
    • regressing footfall, contrast, 294f
  • EBIT and depreciation (opincome), financial statement item, 242
  • EBIT-to-EV ratio, 216, 242
  • Economic Sentiment Indicator (ESI), 260–262
  • Economic theory, test, 220
  • Economic value (EV), 35
  • Edge feature detectors, 88f
  • Efficient Market Hypothesis (EMH), 27
  • Emerging Market (EM) currencies basket (trading), macro-economy attention (usage), 333f
  • Emerging Market Foreign Exchange (EMFX), 323, 330–333
  • Empirical orthogonal function (EOF), 160–164
  • English auction, 43
  • Enhanced Vegetation Index (EVI), 278
  • Entities, identification, 93
  • Entity identifiers, matching, 51
  • Entity matching, 52–54
  • Environmental Social Governance (ESG) factors, 128–129
  • Equal weighted benchmark, long top 33% strategy excess returns, (contrast/Pearson correlations), 233t–234t
  • Equal weighted benchmarks, 219t
  • Equities (trading), innovation measures (usage), 342–344
  • Errors, types, 64
  • Eurozone
  • EUR/USD
    • bid/ask spread, 356f, 357f
    • daily abs net flow, 353f
    • daily volume, 352f
    • ON implied volatility, FOMC news volume (contrast), 317f
    • index, EUR/USD fund flow score (contrast), 354f
    • overnight volatility, 317f, 318f
    • trading, intraday basis, 308f
    • ON volatility levels, 317f
  • Exhaust data, 7
  • Expectation conditional maximization (ECM), 149
  • Expectation maximization (EM) procedure, 143, 159–160
  • Explorer VI satellite, 267, 268f
  • Exponential MACD, 82
  • Exports/lights/GDP, annual correlation, 270t
  •  
  • Factor
    • CAGRs, 239t
    • correlations, 232–238
    • factor-based strategies, 126–127
    • generation, 224–225
    • identification, 212
    • modeling/forecasting, 212
    • performance, 225–229
    • removal, 224
  • Factor investing, 119
    • benefits, 127
    • cost, reduction, 127
    • usage, reasons, 126–127
  • Factor models, 120–126
    • approaches, 125–126
    • definition, 120
    • modeling sequences, examples, 130f–131f
    • types, 121–122
  • Fama-French 3-factor model, 123–124
  • Fear gauge, 328
  • Feature detection algorithms, properties, 89f
  • Feature detectors, types, 88f
  • Features/feature detection algorithm, 87–89
  • Fed communications, 316–320
  • Fed communications index
    • categorical/continuous variables, mixture, 199
    • CScores, histogram plot, 197f
    • event types, 196f, 200f
    • fields, tagging, 194
    • input variables, usage, 199
    • log(text length), histogram plot, 195f
    • outlier detection, case study, 194
    • rules-based approaches, 198
    • speakers, talkativeness (ranking), 197f
  • Federal Open Market Committee (FOMC), 111, 183, 194–198
    • communications, availability, 319
    • EUR/USD ON volatility levels, 317f
    • meetings, 66, 295–296, 316
    • news volume, EUR/USD ON implied volatility (contrast), 317f
    • sentiment index, 320f
    • stock market reaction forecast, Twitter data (usage), 308–309
  • Feed forward neural networks, 75–76
  • Financial markets
    • alternative data, relationship, 6
    • PMI, impact, 263–265
  • Financial problems, modeling techniques (suggestions), 83t
  • Financial ratios, usage, 129
  • First-Price Sealed-Bid auction, 43
  • Flat-fee models, 30
  • Footfall
    • regressing footfall, reported EPS (contrast), 294f
    • reported EPS, contrast, 293f
    • score (independent variable), 292
  • Foreign Exchange (FX), 5, 341
    • average crisis rates, 346f
    • daily flow returns, 355f
    • data, 6
    • flow data, institutional FX flow data (relationship), 351–355
    • spot returns, net flow (multiple regressions), 353f
    • trading, machine-readable news (usage), 310–316
    • trend returns, 355f
    • trend strategies/daily flow-based strategies, risk-adjusted returns, 354f
    • volatility (understanding), machine-readable news (usage), 310–316
  • Free data, presence, 20
  • Freemium models, free services/value-added services (combination), 30
  • Free services, value-added services (combination), 30
  • Freshest automotive factors summary statistics, 228t
  • Fundamental factor model, 121, 122
  • Funds from operations (ffo), financial statement item, 241
  • Fuzzy k-means clustering (FKMI), 140, 142
  • FX Risk Tool (Oxford Economics), 345
  •  
  • Gaussian distributions, 202
  • Gaussian Finite Mixture Models, 185
  • Gaussian mixture model (GMM), 143
  • Gaussian processes (GPs), 80–82
    • example, 238–239
    • orthogonality/nonlinearity, 238
    • representation, 81
  • Gaussian Process Regression (GPR), 238
  • GBP/USD intraday volatility, UK PMI Services (basis), 265f
  • General Data Protection Regulation (GDPR), 47, 50, 287
  • General partners, AUM ranking, 361f
  • Generative adversarial neural networks (GANs), 63, 77
  • Gensim, 101
  • Geospatial Insight dataset, usage, 272
  • GitHub, 79
  • Glmnet, 72
  • Global outliers, local outliers (contrast), 184
  • Global Vectors for Word Representation (GloVe), 95
  • Google
    • Cloud Natural Language, 102
    • Cloud Speech-to-Text, 102
    • Domestic Trend, 325–326
    • regressing Google domestic trend indices, 326f
    • search volume, example, 326f
    • Shock Sentiment, 326, 327f
    • trends data, usage, 325–327
  • Government data, 341
  • Grapedata, 247–256
  • Gross Domestic Product (GDP), 259
    • exports/lights, annual correlation, 270t
    • growth correlations, 262t
    • proxying, 270
    • release, 11
  •  
  • Hang Seng index, share price (performance), 252f
  • Happiness Sentiment Index, 304, 305f
  • Hedonometer
    • average score, 304f
    • happiest/saddest words (ranking), 302f
    • Index, 302–305, 303f, 322
  • Heuristics-based approaches, 203–204
  • Hierarchical clustering, 70–71
  • Hierarchical density-based spatial clustering of applications with noise (HDBSCAN), 199
  • High-capacity strategies, properties, 18
  • High-frequency data, usage, 355–357
  • High-level deep learning libraries, 79
  • High-level neural network libraries, 79
  • Histogram-based outlier score (HBOS), 198
  • Histogram-based statistical outlier (HBOS) detector, 188
  • Holding period, usage, 213
  • Homoscedastic errors, 64
  • HTML tags, removal, 300
  • Hyperspace, contents, 89
  •  
  • I/B/E/S dataset, 255
  • Ignore missing (IM), 140
  • IHS Markit (IHSM), 23, 259, 285
    • databases, usage, 206–207
    • data features, 243
    • process, 209f
  • Images
    • classification, deep learning/CNNs (usage), 89–90
    • features/feature detection algorithm, 87–89
    • imaging tools, 91
    • satellite image data, dataset augmentation, 90–91
    • structuring, 87–91
  • Imputation methods, 152
    • multiple imputation (MI) methods, 157–160
    • ranking, 143f
    • values, computation, 148f
  • Imputation metrics, 154
  • Imputation-posterior (I-P) form, 158
  • Imputation step (I-step), 158
  • Imputation technique, classifiers (rank), 141f
  • Index market, evolution, 127–128
  • Indicator computation, car counts basis (steps), 272–273
  • Indirect prediction, 129–132
  • Induction learning methods, 140
  • Industrial data, 341
  • Information coefficient (IC), 217, 230
  • Information ratios, 312
  • Innovation measures, usage, 342–344
  • Input dataset error rates, LERS new classification (usage), 146f–147f
  • Institutional FX flow data, FX spot (relationship), 351–355
  • Interest rate swaps (IRSs), 71
  • Internal exhaust source, requirements, 24
  • inventory, financial statement item, 241
  • Investment
    • capacity, increase, 127
    • management constituents, phase identification, 16f
    • strategy, 22, 105
    • value, decay, 27–29
  • Investopedia search data, usage, 328–329, 328f
  • Investor Anxiety Index (IAI), 328
    • usage, 329f
    • Volatility Index (VIX), contrast, 328f, 329f
  • Investors
    • anxiety (measurement), Investopedia search data (usage), 328–329, 328f
    • attention, 323–325
    • discretionary investors, 38–39
    • systematic investors, 36–38
  • Isolation-Based Outliers, 204
  • Isolation forest (ISO), 199
  •  
  • Joint Organizations Data Initiative (JODI) Oli World Database, usage, 285
  • JX Mobile III, 249
    • launch, payment willingness, 251f
    • test version, usage question, 250f
  • JX PC III, monthly spending question, 251f
  •  
  • KDB, usage, 110
  • Keras, 79
  • Kernel Density Estimation, 185
  • Kernels, usage, 81
  • Kernel trick, example, 69f
  • Kingsoft, 250
    • share price, performance, 252f
    • survey, questions, 256–257
  • k-means (K-means), 70, 143, 198
  • k-means clustering information (KMI), 140, 142, 149
  • k-nearest neighbors (KNN) (KNNI), 140, 147–149, 187, 199
    • regression/classification, 149
    • usage, 143
  • Kriging, 80–81
  •  
  • Lasagne, 79
  • Latency arbitrage, 307
  • Latent Dirichlet Allocation (LDA), 97
  • Latent semantic analysis (LSA), 97
  • Lazy learning, 139, 140
    • methods, rank, 142f
  • Learning from Examples based on Rough Sets (LERS), 146–147
  • LERS new classification, usage, 146f–147f
  • Licensees, number, 35
  • Lights/exports/GDP, annual correlation, 270t
  • Light vehicle production (IHS Markit database), 206
  • Light vehicle sales (IHS Markit database), 207
  • Linear regression (LR), 64–65, 84, 87, 348
    • neural network function, 73, 73f
    • visualization, 65f
  • Liquidity
    • social media, impact, 309
    • understanding, high-frequency data (usage), 355–357
  • Local least squares imputation (LLSI), 140
  • Locally linear reconstruction (LLR), 149
  • Local outlier factor (LOF), 187, 203
    • score visualization, example, 187f
  • Local outliers, global outliers (contrast), 184
  • Location data, 283
  • Logistic regression, 65–67, 82
    • neural network function, visualization, 74f
    • single class logistic regression, neural network function, 73
    • visualization, 66f
  • Log-likelihood, 159
  • Long-only portfolios (derivation), visa/patent data (usage), 343f
    • in sample/out-of-sample, 344f
  • Long portfolio, creation, 214
  • Long short-term memory (LSTM), 76, 87
  • Long threshold, usage, 213
  • Long top 33% strategy excess returns, equal weighted benchmark (contrast/Pearson correlations), 233t–234t
  • Low-capacity strategies, 18
  • Low-level deep learning libraries, 77–79
  • Low-level neural network libraries, 77–79
  •  
  • Machine learning (ML), 4
    • algorithms, calibration, 61
    • bias/variance/noise, 60–61
    • clustering-based unsupervised machine learning techniques, 70–71
    • cross-validation (CV), 61–62
    • deep learning, 72–80
    • definitions, 60
    • examination, 62–63
    • fit, expected error (equation), 60
    • Gaussian processes (GPs), 80–82
    • libraries, 71–72
    • neural networks, 72–80
    • procedures, usage, 143
    • processing layers, involvement, 40
    • reinforcement learning, 63
    • supervised learning, 62
    • supervised machine learning techniques, 64–70
    • techniques, 59, 60, 82–87
    • unsupervised learning, 63
    • unsupervised machine learning techniques, 71
  • Machine-readable news, usage, 310–316
  • Macro data, forecasting, 129–130
  • Macroeconomic factor model, 121, 122
  • Macro-economy attention, usage, 333f
  • Malls, visits (comparison), 290f
  • Market data, 351
  • Market participants, alternative data usage, 6
  • Market themes (measurement), Google trends data (usage), 325–327
  • Market value (MV), 34–35
  • Markov Chain Monte Carlo sampling, 157
  • Marks & Spencer, car count/earnings (contrast), 274f
  • Material non-public information (MNPI), 49, 109
  • Matlab ports, 72
  • Matplotlib, 91
  • Matrix factorization, 162–166
  • Maturing alternative datasets, advantages, 45–46
  • Maximization step (M-step), 159
  • Maximum likelihood estimation (MLE), 159
  • Mean absolute percentage error (MAPE), 154
  • Mean quintile gap (MQG), 217, 225, 230
  • Mean relative deviation (MRD), 154
    • metrics, summary statistics, 166t–167t
  • Mergers and acquisitions (M&As), 296–298
  • Metadata
    • addition, 93
    • identification, 300
  • Micro-clusters, 184
  • Middle-level deep learning libraries, 79
  • Middle-level neural network libraries, 79
  • Misclassification error rate, examples, 145f–146f
  • MissForest: Random Forest imputation, 180
  • Missing at Random (MAR), 137
  • Missing Completely at Random (MCAR), 136, 137, 148, 155–157
  • Missing data, 54, 135
    • case studies, 151
    • classification, 136–138, 143
    • classifier design, deletion, 143
    • deletion, 137–138
    • distinctions, 136–137
    • fraction, usage, 153
    • imputation/estimation, 143
    • inclusion, 144f
    • incomplete cases, deletion, 143
    • misclassification error rate, 145f–146f
    • predictive imputation, 138
    • replacement, 138
  • Missing data treatments, 51, 137–138
    • Farhangfar et al perspective, 148
    • Garcia-Laencina et al perspective, 143–146
    • Grzymala-Busse et al perspective, 146–147
    • Jerez et al perspective, 147–148
    • Kang et al perspective, 149
    • literature overview, 139–149
    • Luengo et al perspective, 139–143
    • Zou et al perspective, 147
  • Missingness patterns
    • imposition, example, 164f
    • occurrence, number (histogram), 156f
  • Missing Not at Random (MNAR), 137
  • Missing values
    • consecutive missing values, length statistics (usage), 153
    • total fraction, usage, 153
  • Mixture of Gaussians (MoG), 149
  • Mobile phone location data
    • independent variables, 292
    • usage, 287–295
  • Model backtesting, 213
  • Model-based nowcast, 307
  • Model-based procedures, usage, 143
  • Model-based techniques, 202
  • Model forecasts, comparison, 270t
  • Monopoly, impact, 42–43
  • Montreal Institute for Learning Algorithms (MILA) Theano development (cessation), 78
  • Multicollinearity, presence, 64
  • Multi-layer perceptron (MLP), 140, 143
    • hidden layer, inclusion, 75f
    • neural network, 75
  • Multiple imputation (MI) methods, 137, 138, 148, 157–160
  • Multiple imputation with chained equations (MICE), 153, 157
    • imputed time series, 168f
    • package, norm, 158
    • procedure, description, 178–179
    • usage, 179
  • Multiple singular spectral analysis (MSSA), 152–153, 162–164
    • imputation, example, 170f
    • imputed time series, example, 173f
    • usage, 180
  • Multi-task learning (MTL), 143
  • Multivariate credit default swap time series
    • CDS data, 154–157
    • deterministic techniques, 160–164
    • EOF-based techniques, 160–164
    • imputation metrics, 154
    • missing data classification, 153–154
    • missing values, imputing, 152
    • MRD metrics, summary statistics, 166t–167t
    • results, 164–173
    • test data generation, 1540157
  • Multi-variate normal (MVN)
  • MXNet (Apache), 79
  •  
  • Naïve Bayes (NB), 69–70, 140
  • Named entity recognition, 92–93
  • Natural language processing (NLP), 55, 78, 91–102
    • challenges, 97–98
    • defining, 91–93
    • languages/texts, differences, 98–99
    • normalization, 93–94
    • speech, involvement, 99–100
    • tasks, classification problem, 96
    • tools, 100–102
    • word embeddings, creation, 94–96
  • NDAs, negotiation, 30
  • Negative change, ratios, 242–243
  • Net flow, spot returns (multiple regressions), 353f
  • Net income (netincome), financial statement item, 242
  • Net-Income-to-EV ratio, 216, 242
  • Neural networks (NNs), 72–80, 184
    • examples, 73–74
    • frameworks, 79–80
    • high-level neural network libraries, 79
    • libraries, 77–80
    • low-level neural network libraries, 77–79
    • middle-level neural network libraries, 79
    • types, 75–77
  • News, 309–320
    • articles per ticker, average daily count, 311f
    • Bloomberg News article count, S&P500 (contrast), 310f
    • trend correlation, contrast, 313f
    • trend information ratio, contrast, 313f
    • trend model returns, contrast, 314f
    • trend model YoY returns, contrast, 314f
  • News data, 299
    • reported EPS, contrast, 294f
  • newspaper3k, 101
  • News score (independent variable), 292
  • New York Fed meetings, 295–296
  • NLTK, 101
  • No-free-lunch (NFL) theorems, 82
  • Noise, 60–61, 88, 182
    • cause, 60
  • Nonfarm payrolls (NFPs), US change
    • ADP private payroll change, contrast, 45f
    • Twitter-based forecast, actual release/Bloomberg consensus survey (contrast), 307f
    • Twitter data, usage, 305
  • Nonfarm payrolls (surprise), USD/JPY 1-minute move (contrast), 306f
  • Non-negative matrix factorization (NMF), 97
  • Non-problems, modeling techniques (suggestions), 83t
  • Non-quarterly reporting companies, examination, 218
  • Non-stationarity (machine learning assumption/limitation), 85–86
  • Norges Bank Investment Management, 128
  • Normalization, 93–94
  • Normalized Difference Vegetation Index (NDVI), 278
  • Normally distributed errors, 64
  • Normal neighborhood, selection (difficulties), 190f
  • Nowcasting
    • Eurozone (EZ) GDP growth, 260f
    • GDP growth, 262–263
  • NumPy, 77, 80
  •  
  • Official Foreign Exchange Reserve, currency composition (COFER data), 347f
  • Oil and gas production (Q&A survey), case study, 252–254
  • Oil prices/supply changes, contrast, 254f
  • One-class SVM, 188
  • OPEC, 252
    • crude oil production estimates, 253f
    • oil supply changes, oil prices changes (contrast), 254f
  • OpenCV, 91
  • OpenSky dataset, usage, 297
  • OPEX, 38
  • Optical Character Recognition (OCR), 55
  • Original Equipment Manufacturers (OEM), decision-making, 206–207
  • Outliers
    • anomalies, 181
    • definition/classification, 182–183
    • flagging, 200f
    • global outliers, local outliers (contrast), 184
    • local outlier factor (LOF), 187
    • temporal structure, 183
    • treatment, 51, 56–57
  • Outliers, detection
    • algorithms, comparative evaluation, 185–188, 186t
    • approaches, 182–183
    • case study, 194
    • density-based techniques, 203
    • distance-based techniques, 202–203
    • heuristics-based approaches, 203–204
    • model-based techniques, 202
    • problem, setup, 184–185
    • techniques, 57
    • unsupervised ML techniques, usage, 198–199
  • Outliers, explanations
    • Angiulli et al. explanation, 192–193
    • approaches, 189–193
    • Duan et al. explanation, 191–192
    • Micenkova et al. explanation, 189–190
    • rank statistic, usage (problem), 191f
  •  
  • Packaging models, 30
  • Pandas, 80
  • Passive investing, 127
  • Passive strategies, 126
  • pattern, 101
  • Pattern classification methods, missing data (inclusion), 144f
  • Pay-per-use models, 40
  • Payroll readership, usage, 323–325
  • PDFMiner, 101
  • Percentage ratios, 243
  • Personal data, definition, 47
  • Pillow, 91
  • Point anomalies, 184
  • Point-of-sale (POS) devices, usage, 338
  • Poisson regression models, 72
  • Pooled surveys
    • company event study (pooled survey), case study, 249–252
    • usage, 247
  • Portfolio, effects, 22
  • Posterior step (P-step), 158
  • Predicted R squared coefficient, true R squared coefficient (contrast), 154
  • Predictive imputation (missing data treatment), 138
  • Predictive mean matching (PMM), 158, 165
  • Pricing
    • discriminatory pricing mechanisms, 42f
    • equation, 40
  • Principal component analysis (PCA), 71, 76
  • Principle Component Regression (PCR), 238
  • Private equity
  • Private firms, performance (understanding), 363
  • Private markets, alternative data, 359
  • Probabilistic Graphical Model (PGM), example, 130f
  • Processed data (data transformation stage), 9f
  • Process expense, 34
  • Processing level, 21
  • Processing libraries, 80
  • prod_volume_prev_1m_pct_change_prev_2m_mean, 232, 235, 238
  • Proof-of-concept (POC), 106, 112
  • Pseudo-time, basis, 262
  • Publishing lag, 21
  • Purchasing Managers Indexes (PMI), 259
    • China PMI, China GDP QoQ (contrast), 13f
    • impact, 263–265
    • indicators, appropriateness, 108
    • manufacturing (measurement), satellite data (usage), 277–280
    • performance, 261–262
    • release, 11–12
    • US GDP growth rate, contrast, 12f
  • Python ports, 72
  • PyTorch, 78
  •  
  • Q&A surveys
    • oil and gas production (Q&A survey), case study, 252–254
    • usage, 247
  • Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
  • Q_pct_delta_ffo returns plot, quarterly benchmark (contrast), 222f
  • Q_pct_delta_ffo, stocks holding (heatmap), 222f
  • Quarterly benchmark
    • Q_pct_delta_ffo returns plot, contrast, 222f
    • revenues_sales_prev_3m_sum_prev_1m_pct_change, contrast, 230f
    • usa_sales_volume_prev_12m_sum_prev_3m_pct_change returns plot, contrast, 232f
    • ww_market_share_prev_1m_pct_change returns plot, contrast, 231f
  •  
  • Radial basis function network (RBFN), 140
  • Random forest (RF), 67–68, 184
  • Ranking factor, usage, 213, 216–217
  • Raw data (data transformation stage), 9f
  • Real estate investment trust (REIT) ETF (trading), mobile phone location data (usage), 288–291
  • Rectified Linear Unit (RELU), 90
  • Recurrent neural networks (RNNs), 76, 87, 143
  • Regressing consensus, 275f
    • estimates/footfall, reported EPS (contrast), 293f
  • Regressing footfall, reported EPS (contrast), 294f
  • Regressing Google domestic trend indices, 326f
  • Regressing news
    • sentiment, 276f
    • volume, 1M implied volatility (contrast), 315f
  • Regression, 62
    • linear regression, 64–65
    • logistic regression, 65–66
    • models, 72
    • softmax regression, 67
  • Reinforcement learning, 63
  • Replacement (missing data treatment), 138
  • Reported EPS, regressing consensus estimates/footfall (contrast), 293f
  • Reports, usage, 247
  • Research cost, 21
  • REST API, usage, 102
  • Restricted information set, 86
  • Retail activity (understanding), mobile phone location data (usage), 287–295
  • Retailers, car counts/EPS, 271–277
  • Returns, sensitivity, 18
  • Revenue, maximization (equation), 42
  • revenues_sales_prev_3m_sum_prev_1m_pct_change, 232, 238
    • quarterly sales volume, monthly change, 229–230
    • quintile CAGR, 230f
    • returns plot, quarterly benchmark (contrast), 230f
  • RIPPER, 148
  • Risk-adjusted returns, 312
  • Risk managers, 39
  • Risk metrics, 39
  • Risks, pre-assessment, 106, 109
  • Risk tolerance levels, 36
  • Root mean square error (RMSE), 154
    • computation, 161
  • Root Mean Square Forecasting Errors (RMSFE), 2643
  • Ross, Stephen, 122
  • R squared coefficient, differences, 154
  • RSSA, 180
  • Rule induction learning, 139
    • methods, rank, 141f
  •  
  • Sales/revenue (sales), financial statement item, 242
  • Sales-to-EV ratio, 216, 242
  • sales_volume_prev_1m_pct_change_prev_2m_mean, 232, 235
  • Satellite data, usage, 277–280
  • Satellite images
    • aerial photography, 267
    • analysis, process (steps), 174
    • case study, 173
    • data, dataset augmentation, 90–91
  • Satellite manufacturing index, Chinese/Caixin PMI manufacturing (contrast), 279f
  • scikit-image, 77, 91
  • scikit-learn, 59, 71–72, 77, 101
  • SciPy, 77, 80
  • SciPy.ndimage, 91
  • Scrapy, 100
  • Search volume, example, 326f
  • Self-organizing map (SOM), 143
  • Sellers, data pricing perspective, 41–45
  • Semi-supervised anomaly detection, 57
  • Sentiment
    • analysis, classification problem, 96
    • identification, 93
    • social media, impact, 309
  • Sequential minimal optimization (SMO), 140
  • Sharpe ratio, 128, 212, 239
  • Shipping data, usage, 283–286
  • Short threshold, usage, 213
  • Shure, Sennheiser (MoM) (Amazon spend comparison), 339f
  • Signals
    • data transformation stage, 9f
    • existence, pre-assessment, 106, 109–110
    • extraction, performing, 106, 111–112
  • SimpleCV, 91
  • Simple moving average (SMA), application, 331–332
  • Single class logistic regression, neural network function, 73
  • Singular spectral analysis (SSA), 162
  • Singular value decomposition (SVD), 71, 160
  • Singular value decomposition imputation (SVDI), 140
  • sinkr, 180
  • Siri, usage, 99–100
  • Smart beta indices, alternative data inputs (usage), 127–128
  • Social media, 300–309
    • data, 299
    • Hedonometer index, 302–305
  • Social Media Analytics, 301
  • Soft data, 260
  • Softmax regression, 67
    • neural network function, 73–74, 74f
  • Software libraries, usage, 179–180
  • spaCy, 101
  • Spearman correlation, 235
  • Speech, involvement, 99–100
  • SpeechRecognition, 102
  • SpendingPulse Brazil retail sales YoY, Brazil YoY retail sales (contrast), 337f
  • SpendingPulse index (MasterCard), 337
  • Spot returns, net flow (multiple regressions), 353f
  • Standard and Poor's 500 (S&P500), 77, 265
    • Bloomberg News article count, contrast, 310f
    • Google Shock Sentiment, contrast, 327f
    • Google Shock Sentiment scatter, contrast, 327f
    • Happiness Sentiment Index, contrast, 305f
    • trading, IAI/VIX (usage), 329f
  • Standard & Poor's 500 (S&P500)
    • returns, 82
  • Statistical factor model, 121–122
  • Stochastic discount factor, definition/equation, 41
  • Stocks
    • exchanges, stock ranking, 211
    • heatmap, 222f
    • market reaction (forecast), Twitter data (usage), 308–309
  • Stocktwits data/sentiment factor, 82, 309
  • Strategic risks, impact, 11
  • Strategy
    • capacity, 16–19
    • data transformation stage, 9f
    • high-capacity strategies, properties, 18
    • investment strategy, time frequency, 22
    • loss making, 18
    • low-capacity strategies, 18
    • setup, 105, 106–107
  • Stride, 90
  • Structuring level, 21
  • Subject matter experts (SMEs), impact/usage, 107, 112
  • Supervised anomaly detection, 57
  • Supervised learning, 62
  • Supervised machine learning techniques, 64–70
    • assumptions, 64
  • Support vector machine (SVM) (SVMI), 68–69, 140, 142, 148, 162, 184
  • Survey data, 245
    • alternative data use, 245–247
    • case studies, 249–254
    • contributors, hierarchy, 246f
    • product, 247–249
    • usage, 247
  • Surveys
    • crowdsourcing analyst estimates survey, 255
    • process, 249f
    • technical considerations, 254–255
    • timeline, example, 248f
  • Synthetic 2D data, DINEOF imputation (example), 161f
  • Systematic investors, 36–38
  •  
  • tabula-py, 101
  • Taxi ride data, 295–296
  • Technology, score, 22
  • TensorFlow, 59, 77–79, 101
    • Tutorials, 94
  • Tesla, value, 218
  • TextBlob, 101
  • Text data, 299
  • TF-IDF, 94
  • TF Learn, 79
  • Thasos Mall Foot Traffic Index, 288
    • YoY, US retail sales YoY (contrast), 289f
  • Theano, 77, 78
  • The many, Big Data (contrast), 9–11
  • Tickers, usage (change), 18
  • TickerTags, 98
  • Time
    • frame, impact, 217
    • removal, 219
  • Time averaged Spearman rank correlations, 236t–237t
  • Time series data, examples, 164f, 171f
  • Time series trading approach, cross-sectional trading approach (contrast), 126
  • Topic
    • identification, 93
    • modeling, 96–97
  • Total assets (totassets), financial statement item, 242
  • Tradeable companies, long/short-portfolio sizes, 218t
  • Transaction cost analysis (TCA), 356
  • Transaction costs, 217–218
    • impact, 18f
    • increase, impact, 18
  • Transaction time, 54
  • Trend returns, 355f
  • Trend strategies/daily flow-based strategies, risk-adjusted returns, 354f
  • Trial availability, 22
  • True R squared coefficient, predicted R squared coefficient (contrast), 154
  • t-statistics, report, 314
  • Turkey PVIX indicator, USD/TRY 1M implied volatility (contrast), 331f
  • Twitter data, 309
    • reported EPS, contrast, 294f
    • usage, 305–308
  • Twitter mood data, usage, 15
  • Twitter score (independent variable), 292
  • Two-part-tariff models, 31
  •  
  • Unstructured data, conversion, 51
  • Unsupervised anomaly detection, 57
  • Unsupervised learning, 63
  • Unsupervised ML techniques, usage, 198–199
  • usa_sales_volume_prev_12m_sum_prev_3m_pct_change, 229, 232, 238
    • quintile CAGR, 232f
    • returns plot, quarterly benchmark (contrast), 232f
    • yearly US sales volume, quarterly change, 231
  • USD/JPY
    • bid/ask spread, 357f
    • news sentiment score, weekly returns (contrast), 312f
    • news volume, 1M implied volatility (contrast), 315f
    • trading, intraday basis, 308f
  • USD/TRY 1M implied volatility, Turkey PVIX indicator (contrast), 331f
  • US employment report, payrolls clicks, 324f
  • US export growth, forecasting, 269–271
  • US GDP growth rate, PMI (contrast), 12f
  • US/global new vehicle registration/sales ((IHS Markit database)), 207
  • US ISM, US GDP QoQ (contrast), 12f
  • US retail sales YoY, Thasos Foot Traffic Index YoY (contrast), 289f
  • UST 10Y yield changes, 320f
  • US Treasury yields, 316–320
  •  
  • Value-at-Risk (VaR), enhancement, 346
  • Value-of-information, 199
  • Variance, 60–61
    • bias, balance, 61f
    • cause, 60
  • Vendors
    • due diligence, performing, 105, 108
    • identification, 23–24
    • monopoly, impact, 42–43
  • Venture capital
    • firms, defining, 360
    • transactions, database charting, 362
  • Vickrey auction, 43
  • Viscosity factor, 82
  • Vision, setup, 105, 106–107
  • Volatility Index (VIX), 63, 77
    • Investor Anxiety Index (IAI), contrast, 328f, 329f
    • usage, 329f
  • Volume, Variety, Velocity, Variability, Veracity, Validity, Value (Big Data characteristics), 9–10
  •  
  • Walmart, earnings per share (consensus/footfall contrast), 291f
  • Web data, 299
    • collection, 299–300
    • search volume example, 326f
  • Web page
    • body text, capture, 300
    • content, downloading, 300
    • time stamp, assignation, 300
  • Web scraping, usage, 25f, 300
  • Web sources, 320–322
  • Weighted k-NN (WKNNI), 140
  • Wikipedia, usage, 330
  • Word2vec, 94–96
  • Word embeddings, creation, 94–96
  • Words, frequency (example), 99f
  • Word tokenization/segmentation, usage, 92
  • Wrappers, writing, 112
  • ww_market_share_prev_1m_pct_change, 229, 232, 235
    • quintile CAR, 231f
    • returns plot, quarterly benchmark (contrast), 231f
    • worldwide market shares, monthly change, 230
  •  
  • XLNet, 96
  • XRT, returns/trading (Thasos Mall Foot Traffic index basis), 288–289, 290f
  •  
  • YARN clusters, 79
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.2.240