- 1M implied volatility
- regressing news volume, contrast, 315f
- USD/JPY news volume, contrast, 315f
- 50-day SMA, 82
-
- A/B test, 35
- accounts payable (accpayable), financial statement item, 241
- ADP private payroll change, nonfarm payrolls (US change) (contrast), 45f
- ALFRED time series, 306
- Algorithms
- feature detection algorithms, properties, 89f
- features/feature detection algorithm, 87–89
- selection decision, 86–87
- Alpha, 45–46
- Alternative data, 3
- adoption curve, 16f
- brands, association, 24f
- buy side total spend, 25f
- capacity, 16–19
- characteristics, 6
- collection, cost, 6
- defining, 5–7
- dimensions, 19–23
- forecasts, 339f
- history, shortness, 6
- inputs, usage, 127–128
- maintenance process, 113–114
- process, 105–114
- reasons, 11–14
- risks/challenges, 47
- segmentation, 7–9, 8t
- strategies, 35–39, 229t
- survey data, comparison, 245–247
- team usage, structuring, 114–115
- usage, 6, 14f, 50–57
- use processes, 105
- users, identification, 15–16
- value, 27
- vendors, identification, 23–24
- Alternative datasets
- buy side usage, 24–26
- commercial release frequency, 23f
- derivation, web scraping (usage), 25f
- maturing alternative datasets, advantages, 45–46
- usage, 226t–227t
- Amazon
- Comprehend, 102
- Mechanic Turk, 53
- revenue, actual revenue changes (contrast/alternative data forecasts), 339f
- Amazon Web Services (AWS), 79, 338
- Amelia II (CRAN), 147
- Amelia imputed time series, examples, 168f, 172f
- Amelia+MSSA, 165, 167
- Angle-Based Outlier Factor (ABOF), 203–204
- Anomalies, 181
- Apache MXNet, 79
- APIs, usage, 30, 71, 79, 91, 112
- Approximate methods, rank, 142f
- Approximate models, 139, 140
- Arbitrage pricing theory (APT), 122–123
- Area under the ROC curve (AUC), mean/standard deviation/MSE values, 148f
- Articles per ticker, average daily count, 311f
- Artificial Intelligence (AI), 4
- Asian Crisis (1997), 86
- Asset class
- breadth/depth, 20
- coverage, 20
- relevance, 19–20
- Asset price, signals, 29
- Assets under management (AUM), 16, 360
- Auctions
- Auto-associative neural network (AANN), 143
- Auto-correlation, absence, 64
- Autoencoders, 71
- Automated identification system (AIS)
- crude oil exports, comparison, 286f
- data, collection, 285
- transmitters/messages, usage, 284
- Automatic Identification System (AIS), 117
- Automotive company data
- alternative data strategies, CAGR basis, 229t
- company list, 240
- core factors, 223
- delayed data, usage, 224
- direct approach, 223–238
- factors, 226t–227t, 229–238
- factors CAGRs, 239t
- financial statement items, description, 241–242
- freshest automotive factors summary statistics, 228t
- freshest data, usage, 224
- Gaussian processes, example, 238–239
- long top 33% strategy excess returns, equal weighted benchmark (contrast/Pearson correlations), 233t–234t
- Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
- ratios, usage, 242–243
- reporting delays, country ranking, 244
- stocks, holding (heatmap), 222f
- time averaged Spearman rank correlations, 236t–237t
- Automotive fundamental data, 205, 206–211
- book-to-market ranking, 214
- Chevrolet Cruze, country unit sales/registration, 210t
- equal weighted benchmarks, 219t
- IHS Markit databases, usage, 206–207
- indirect approach, 211–222
- information, examples, 217–218
- long portfolio, creation, 214
- non-quarterly reporting companies, examination, 218
- process, 209f
- production volume, mean percent, 208f
- sales volume, mean percent, 208f
- Stage 1 process, 213–215
- stages, 213–223
- stocks, ranking, 214
- strategies, CAGR ranking, 219t, 220t
- supporting statistics, 217
- Tesla, value, 218
- tradeable companies, long/short-portfolio sizes, 218t
- transaction costs, 217–218
- universe, assumption, 215
- Average percentage derivation, usage, 270t
- Azure clusters, 79
-
- Backtesting, usage/nonusage, 35–39
- Backtests, 54, 213
- Bagging, 67
- Bag-of-words, 94
- Bayesian principal component analysis (BPCA), 140
- Bayes theorem, usage, 69–70
- BeautifulSoup, 100
- Benchmark model (BM), 263
- Berners-Lee, Tim, 299
- Bias, 60–61, 61f
- Bidirectional Encoder Representations from Transformers (BERT), 95, 101–102
- Big Data, 9–11, 50
- Billion Prices Project, The, 321–322
- Binned R2 analyses, usage, 109
- Bitly, 324–325
- Blob-based feature detectors, 88f
- Blockchain technology, usage, 31
- Bloomberg News article count, S&P500 (contrast), 310f
- Book-to-market ratios, 123
- Brands, alternative data (association), 24f
- Brazil
- English attention, local content (comparison), 332f
- YoY retail sales, SpendingPulse Brazil retail sales YoY (contrast), 337f
- Buyers, data pricing perspective, 40–41
-
- Caffe, 80
- CAPEX, 38
- Capital, allocation (increase), 18
- Capital Asset Pricing Model (CAPM), 119–124
- Car counts, 271–277
- Carhart model, 124
- Car parks
- data, DINEOF imputation (example), 175f
- image, 175f–178f
- Carry-based factor model, 125
- Causality (machine learning assumption/limitation), 84–85
- Central bank intervention, modeling, 346–348
- Chevrolet Cruze, country unit sales/registration, 210t
- China
- GDP growth rate, PMI (contrast), 13f
- SpaceKnow satellite manufacturing index, Chinese/Caixin PMI manufacturing (contrast), 279f
- China PMI
- China GDP QoQ, contrast, 13f
- manufacturing (surprises), consensus/SMI/hybrid (contrast), 280f
- manufacturing (measurement), satellite data (usage), 277–280
- Chinese yuan (CNY) intervention/official data (contrast), model estimates (comparison), 347, 348f
- Clairvoyance, 211–212
- impact, 213, 215–216
- Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
- Q_pct_delta_ffo returns plot, quarterly benchmark (contrast), 222f
- Classification methods, 139–140
- CLS Group, establishment, 352
- Cluster-based outlier factor (CBLOF) algorithm, 188
- Clustering-based unsupervised machine learning techniques, 70–71
- Collective outliers, 184
- Commodity trading advisor (CTA), risk-adjusted returns, 18
- Common equity (equity), financial statement item, 241
- Company event study (pooled survey), case study, 249–252
- Company removal, example, 220
- Compounded annual growth rate (CAGR), 211–212, 217–222, 225, 229–230
- factors CAGRs, 239t
- production, 235
- Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
- ranking, 219t
- Concept most common/average (CMC), 140, 142
- Consensus estimates (independent variable), 292
- Consumer Price inflation, measurement, 321–322
- Consumer receipts, 337–340
- Consumer transactions, 335
- Content, topic/sentiment identification, 93
- Continuous bag of words (CBOW), 95
- Convolutional neural networks (CNNs), 56, 69, 76, 272
- convolutional/flat layers, inclusion, 76f
- usage, 89–90
- CoreNLP, 101
- Corner feature detectors, 88f
- Corporate aircraft, takeover target visits, 297f
- Corporate data, 341
- Corporate jet location data, 296–298
- Corporate Sustainability Assessment (CSA), RobecoSAM creation, 129
- Corpus of Contemporary American (COCA) English, 98–99
- Cost of Goods Sold (cogs), financial statement item, 241
- Cost value (CV), 34
- Cox regression models, 72
- Credit default swap (CDS), 136, 151
- data, usage, 111, 154–157
- time series data, clustering, 156f
- Credit transaction data, 336–337
- Critical line approach (Markowitz), 71
- Cross-sectional trading approach, time series trading approach (contrast), 126
- Cross-validation (CV), 61–63
- Crowdsourced data, 245
- case studies, 249–254
- contributors, hierarchy, 246f
- product, 247–249
- usage, 247
- Crowdsourcing analyst estimates survey, 255
- Crude oil production, OPEC ranking, 253f
- Crude oil supplies (tracking), shipping data (usage), 283–286
- Cryptocurrency price actions (understanding), Wikipedia (usage), 330
- CScores, 198–199
- Currency Composition of Official Foreign Exchange Reserve (COFER) data, 347f
- Currency crisis risk, quantification, 344–346
- Currency markets, central bank intervention (modeling), 346–348
- Currency pair, purchase/sale, 354
- Current liabilities (currliab), financial statement item, 241
- CUSIP standard, 52
-
- Daily flow returns, 355f
- Data
- aggregation, 57–58
- assets, 33, 105
- availability, 21–22
- bias, 21
- clarity, 111
- delayed data, usage, 224
- external consistency, 111
- external marketing value, 44–45
- free data, 20
- frequency, 20–21
- freshest data, usage, 224
- fusion, 52
- internal consistency, 111
- legal aspects, 47–50
- markets, 29–31
- mining, 124–126
- missing data, 51, 54
- monetary value, 31–35, 39–45
- onboarding, performing, 106, 110
- originality, 22
- outliers, treatment, 51
- points, distinction, 182
- preprocessing, performing, 106, 110–111
- pricing, perspective, 40–45
- protection laws, comparison, 48f
- quality, 21, 111
- science team, setup cost, 116f
- services, 30
- sources, entity identifiers (matching), 51
- strategies, evaluation, 35–39
- structuring, 55–56
- team (creation), big bang hiring strategy, 115
- test data generation, 154–157
- timeliness/completeness, 111
- transformation, stages, 9f
- underusage, 15
- uniqueness, 111
- unstructured data, conversion, 51
- upside sharing, external sales, 43–44
- usage, limitations, 49
- validity/veracity, 111
- values, 32–33, 136
- vendors, 116–117
- view, representation, 184
- Data-as-a-Service (DaaS), 117
- Data interpolation with empirical orthogonal functions (DINEOF), 152–153, 160–162
- Datasets
- identification, 107–108
- price, assignation, 27
- restricted information set, 86
- shift, types, 85
- time stamps, 110
- traditional datasets, 320–321
- usage, 186t, 269
- Debit card transaction data, 336–337
- Decision boundary, example, 68f
- Decision trees, 67
- Deep learning (DL), 72–80, 82–83
- defining, 77
- examples, 73–74
- high-level deep learning libraries, 79
- libraries, 77–80
- low-level deep learning libraries, 77–79
- middle-level deep learning libraries, 79
- usage, 89–90
- Deletion (missing data treatment), 137–138, 143
- Density-based techniques, 203
- Deterministic techniques, usage, 160–164
- Directionality factor, 82
- Direct prediction, 129–132
- Discretionary investors, 38–39
- Distance-based techniques, 202–203
- Diversification, factor investing benefit, 127
- Do not impute (DNI), 140
- Due diligence, performing, 105, 108
- Dutch auction, 43
- Dwell time, 288
-
- Earnings before interest and taxes (ebit), financial statement item, 241
- Earnings, car counts (contrast), 274f
- Earnings per share (EPS), 271–277
- estimation, mobile phone location data (usage), 291–295
- examples, 275f, 276f
- news/Twitter data, contrast, 294f
- regressing footfall, contrast, 294f
- EBIT and depreciation (opincome), financial statement item, 242
- EBIT-to-EV ratio, 216, 242
- Economic Sentiment Indicator (ESI), 260–262
- Economic theory, test, 220
- Economic value (EV), 35
- Edge feature detectors, 88f
- Efficient Market Hypothesis (EMH), 27
- Emerging Market (EM) currencies basket (trading), macro-economy attention (usage), 333f
- Emerging Market Foreign Exchange (EMFX), 323, 330–333
- Empirical orthogonal function (EOF), 160–164
- English auction, 43
- Enhanced Vegetation Index (EVI), 278
- Entities, identification, 93
- Entity identifiers, matching, 51
- Entity matching, 52–54
- Environmental Social Governance (ESG) factors, 128–129
- Equal weighted benchmark, long top 33% strategy excess returns, (contrast/Pearson correlations), 233t–234t
- Equal weighted benchmarks, 219t
- Equities (trading), innovation measures (usage), 342–344
- Errors, types, 64
- Eurozone
- EUR/USD
- bid/ask spread, 356f, 357f
- daily abs net flow, 353f
- daily volume, 352f
- ON implied volatility, FOMC news volume (contrast), 317f
- index, EUR/USD fund flow score (contrast), 354f
- overnight volatility, 317f, 318f
- trading, intraday basis, 308f
- ON volatility levels, 317f
- Exhaust data, 7
- Expectation conditional maximization (ECM), 149
- Expectation maximization (EM) procedure, 143, 159–160
- Explorer VI satellite, 267, 268f
- Exponential MACD, 82
- Exports/lights/GDP, annual correlation, 270t
-
- Factor
- CAGRs, 239t
- correlations, 232–238
- factor-based strategies, 126–127
- generation, 224–225
- identification, 212
- modeling/forecasting, 212
- performance, 225–229
- removal, 224
- Factor investing, 119
- benefits, 127
- cost, reduction, 127
- usage, reasons, 126–127
- Factor models, 120–126
- approaches, 125–126
- definition, 120
- modeling sequences, examples, 130f–131f
- types, 121–122
- Fama-French 3-factor model, 123–124
- Fear gauge, 328
- Feature detection algorithms, properties, 89f
- Feature detectors, types, 88f
- Features/feature detection algorithm, 87–89
- Fed communications, 316–320
- Fed communications index
- categorical/continuous variables, mixture, 199
- CScores, histogram plot, 197f
- event types, 196f, 200f
- fields, tagging, 194
- input variables, usage, 199
- log(text length), histogram plot, 195f
- outlier detection, case study, 194
- rules-based approaches, 198
- speakers, talkativeness (ranking), 197f
- Federal Open Market Committee (FOMC), 111, 183, 194–198
- communications, availability, 319
- EUR/USD ON volatility levels, 317f
- meetings, 66, 295–296, 316
- news volume, EUR/USD ON implied volatility (contrast), 317f
- sentiment index, 320f
- stock market reaction forecast, Twitter data (usage), 308–309
- Feed forward neural networks, 75–76
- Financial markets
- alternative data, relationship, 6
- PMI, impact, 263–265
- Financial problems, modeling techniques (suggestions), 83t
- Financial ratios, usage, 129
- First-Price Sealed-Bid auction, 43
- Flat-fee models, 30
- Footfall
- regressing footfall, reported EPS (contrast), 294f
- reported EPS, contrast, 293f
- score (independent variable), 292
- Foreign Exchange (FX), 5, 341
- average crisis rates, 346f
- daily flow returns, 355f
- data, 6
- flow data, institutional FX flow data (relationship), 351–355
- spot returns, net flow (multiple regressions), 353f
- trading, machine-readable news (usage), 310–316
- trend returns, 355f
- trend strategies/daily flow-based strategies, risk-adjusted returns, 354f
- volatility (understanding), machine-readable news (usage), 310–316
- Free data, presence, 20
- Freemium models, free services/value-added services (combination), 30
- Free services, value-added services (combination), 30
- Freshest automotive factors summary statistics, 228t
- Fundamental factor model, 121, 122
- Funds from operations (ffo), financial statement item, 241
- Fuzzy k-means clustering (FKMI), 140, 142
- FX Risk Tool (Oxford Economics), 345
-
- Gaussian distributions, 202
- Gaussian Finite Mixture Models, 185
- Gaussian mixture model (GMM), 143
- Gaussian processes (GPs), 80–82
- example, 238–239
- orthogonality/nonlinearity, 238
- representation, 81
- Gaussian Process Regression (GPR), 238
- GBP/USD intraday volatility, UK PMI Services (basis), 265f
- General Data Protection Regulation (GDPR), 47, 50, 287
- General partners, AUM ranking, 361f
- Generative adversarial neural networks (GANs), 63, 77
- Gensim, 101
- Geospatial Insight dataset, usage, 272
- GitHub, 79
- Glmnet, 72
- Global outliers, local outliers (contrast), 184
- Global Vectors for Word Representation (GloVe), 95
- Google
- Cloud Natural Language, 102
- Cloud Speech-to-Text, 102
- Domestic Trend, 325–326
- regressing Google domestic trend indices, 326f
- search volume, example, 326f
- Shock Sentiment, 326, 327f
- trends data, usage, 325–327
- Government data, 341
- Grapedata, 247–256
- Gross Domestic Product (GDP), 259
- exports/lights, annual correlation, 270t
- growth correlations, 262t
- proxying, 270
- release, 11
-
- Hang Seng index, share price (performance), 252f
- Happiness Sentiment Index, 304, 305f
- Hedonometer
- Heuristics-based approaches, 203–204
- Hierarchical clustering, 70–71
- Hierarchical density-based spatial clustering of applications with noise (HDBSCAN), 199
- High-capacity strategies, properties, 18
- High-frequency data, usage, 355–357
- High-level deep learning libraries, 79
- High-level neural network libraries, 79
- Histogram-based outlier score (HBOS), 198
- Histogram-based statistical outlier (HBOS) detector, 188
- Holding period, usage, 213
- Homoscedastic errors, 64
- HTML tags, removal, 300
- Hyperspace, contents, 89
-
- I/B/E/S dataset, 255
- Ignore missing (IM), 140
- IHS Markit (IHSM), 23, 259, 285
- databases, usage, 206–207
- data features, 243
- process, 209f
- Images
- classification, deep learning/CNNs (usage), 89–90
- features/feature detection algorithm, 87–89
- imaging tools, 91
- satellite image data, dataset augmentation, 90–91
- structuring, 87–91
- Imputation methods, 152
- multiple imputation (MI) methods, 157–160
- ranking, 143f
- values, computation, 148f
- Imputation metrics, 154
- Imputation-posterior (I-P) form, 158
- Imputation step (I-step), 158
- Imputation technique, classifiers (rank), 141f
- Index market, evolution, 127–128
- Indicator computation, car counts basis (steps), 272–273
- Indirect prediction, 129–132
- Induction learning methods, 140
- Industrial data, 341
- Information coefficient (IC), 217, 230
- Information ratios, 312
- Innovation measures, usage, 342–344
- Input dataset error rates, LERS new classification (usage), 146f–147f
- Institutional FX flow data, FX spot (relationship), 351–355
- Interest rate swaps (IRSs), 71
- Internal exhaust source, requirements, 24
- inventory, financial statement item, 241
- Investment
- capacity, increase, 127
- management constituents, phase identification, 16f
- strategy, 22, 105
- value, decay, 27–29
- Investopedia search data, usage, 328–329, 328f
- Investor Anxiety Index (IAI), 328
- usage, 329f
- Volatility Index (VIX), contrast, 328f, 329f
- Investors
- anxiety (measurement), Investopedia search data (usage), 328–329, 328f
- attention, 323–325
- discretionary investors, 38–39
- systematic investors, 36–38
- Isolation-Based Outliers, 204
- Isolation forest (ISO), 199
-
- Joint Organizations Data Initiative (JODI) Oli World Database, usage, 285
- JX Mobile III, 249
- launch, payment willingness, 251f
- test version, usage question, 250f
- JX PC III, monthly spending question, 251f
-
- KDB, usage, 110
- Keras, 79
- Kernel Density Estimation, 185
- Kernels, usage, 81
- Kernel trick, example, 69f
- Kingsoft, 250
- share price, performance, 252f
- survey, questions, 256–257
- k-means (K-means), 70, 143, 198
- k-means clustering information (KMI), 140, 142, 149
- k-nearest neighbors (KNN) (KNNI), 140, 147–149, 187, 199
- regression/classification, 149
- usage, 143
- Kriging, 80–81
-
- Lasagne, 79
- Latency arbitrage, 307
- Latent Dirichlet Allocation (LDA), 97
- Latent semantic analysis (LSA), 97
- Lazy learning, 139, 140
- Learning from Examples based on Rough Sets (LERS), 146–147
- LERS new classification, usage, 146f–147f
- Licensees, number, 35
- Lights/exports/GDP, annual correlation, 270t
- Light vehicle production (IHS Markit database), 206
- Light vehicle sales (IHS Markit database), 207
- Linear regression (LR), 64–65, 84, 87, 348
- neural network function, 73, 73f
- visualization, 65f
- Liquidity
- social media, impact, 309
- understanding, high-frequency data (usage), 355–357
- Local least squares imputation (LLSI), 140
- Locally linear reconstruction (LLR), 149
- Local outlier factor (LOF), 187, 203
- score visualization, example, 187f
- Local outliers, global outliers (contrast), 184
- Location data, 283
- Logistic regression, 65–67, 82
- neural network function, visualization, 74f
- single class logistic regression, neural network function, 73
- visualization, 66f
- Log-likelihood, 159
- Long-only portfolios (derivation), visa/patent data (usage), 343f
- in sample/out-of-sample, 344f
- Long portfolio, creation, 214
- Long short-term memory (LSTM), 76, 87
- Long threshold, usage, 213
- Long top 33% strategy excess returns, equal weighted benchmark (contrast/Pearson correlations), 233t–234t
- Low-capacity strategies, 18
- Low-level deep learning libraries, 77–79
- Low-level neural network libraries, 77–79
-
- Machine learning (ML), 4
- algorithms, calibration, 61
- bias/variance/noise, 60–61
- clustering-based unsupervised machine learning techniques, 70–71
- cross-validation (CV), 61–62
- deep learning, 72–80
- definitions, 60
- examination, 62–63
- fit, expected error (equation), 60
- Gaussian processes (GPs), 80–82
- libraries, 71–72
- neural networks, 72–80
- procedures, usage, 143
- processing layers, involvement, 40
- reinforcement learning, 63
- supervised learning, 62
- supervised machine learning techniques, 64–70
- techniques, 59, 60, 82–87
- unsupervised learning, 63
- unsupervised machine learning techniques, 71
- Machine-readable news, usage, 310–316
- Macro data, forecasting, 129–130
- Macroeconomic factor model, 121, 122
- Macro-economy attention, usage, 333f
- Malls, visits (comparison), 290f
- Market data, 351
- Market participants, alternative data usage, 6
- Market themes (measurement), Google trends data (usage), 325–327
- Market value (MV), 34–35
- Markov Chain Monte Carlo sampling, 157
- Marks & Spencer, car count/earnings (contrast), 274f
- Material non-public information (MNPI), 49, 109
- Matlab ports, 72
- Matplotlib, 91
- Matrix factorization, 162–166
- Maturing alternative datasets, advantages, 45–46
- Maximization step (M-step), 159
- Maximum likelihood estimation (MLE), 159
- Mean absolute percentage error (MAPE), 154
- Mean quintile gap (MQG), 217, 225, 230
- Mean relative deviation (MRD), 154
- metrics, summary statistics, 166t–167t
- Mergers and acquisitions (M&As), 296–298
- Metadata
- addition, 93
- identification, 300
- Micro-clusters, 184
- Middle-level deep learning libraries, 79
- Middle-level neural network libraries, 79
- Misclassification error rate, examples, 145f–146f
- MissForest: Random Forest imputation, 180
- Missing at Random (MAR), 137
- Missing Completely at Random (MCAR), 136, 137, 148, 155–157
- Missing data, 54, 135
- case studies, 151
- classification, 136–138, 143
- classifier design, deletion, 143
- deletion, 137–138
- distinctions, 136–137
- fraction, usage, 153
- imputation/estimation, 143
- inclusion, 144f
- incomplete cases, deletion, 143
- misclassification error rate, 145f–146f
- predictive imputation, 138
- replacement, 138
- Missing data treatments, 51, 137–138
- Farhangfar et al perspective, 148
- Garcia-Laencina et al perspective, 143–146
- Grzymala-Busse et al perspective, 146–147
- Jerez et al perspective, 147–148
- Kang et al perspective, 149
- literature overview, 139–149
- Luengo et al perspective, 139–143
- Zou et al perspective, 147
- Missingness patterns
- imposition, example, 164f
- occurrence, number (histogram), 156f
- Missing Not at Random (MNAR), 137
- Missing values
- consecutive missing values, length statistics (usage), 153
- total fraction, usage, 153
- Mixture of Gaussians (MoG), 149
- Mobile phone location data
- independent variables, 292
- usage, 287–295
- Model backtesting, 213
- Model-based nowcast, 307
- Model-based procedures, usage, 143
- Model-based techniques, 202
- Model forecasts, comparison, 270t
- Monopoly, impact, 42–43
- Montreal Institute for Learning Algorithms (MILA) Theano development (cessation), 78
- Multicollinearity, presence, 64
- Multi-layer perceptron (MLP), 140, 143
- hidden layer, inclusion, 75f
- neural network, 75
- Multiple imputation (MI) methods, 137, 138, 148, 157–160
- Multiple imputation with chained equations (MICE), 153, 157
- imputed time series, 168f
- package, norm, 158
- procedure, description, 178–179
- usage, 179
- Multiple singular spectral analysis (MSSA), 152–153, 162–164
- imputation, example, 170f
- imputed time series, example, 173f
- usage, 180
- Multi-task learning (MTL), 143
- Multivariate credit default swap time series
- CDS data, 154–157
- deterministic techniques, 160–164
- EOF-based techniques, 160–164
- imputation metrics, 154
- missing data classification, 153–154
- missing values, imputing, 152
- MRD metrics, summary statistics, 166t–167t
- results, 164–173
- test data generation, 1540157
- Multi-variate normal (MVN)
- MXNet (Apache), 79
-
- Naïve Bayes (NB), 69–70, 140
- Named entity recognition, 92–93
- Natural language processing (NLP), 55, 78, 91–102
- challenges, 97–98
- defining, 91–93
- languages/texts, differences, 98–99
- normalization, 93–94
- speech, involvement, 99–100
- tasks, classification problem, 96
- tools, 100–102
- word embeddings, creation, 94–96
- NDAs, negotiation, 30
- Negative change, ratios, 242–243
- Net flow, spot returns (multiple regressions), 353f
- Net income (netincome), financial statement item, 242
- Net-Income-to-EV ratio, 216, 242
- Neural networks (NNs), 72–80, 184
- examples, 73–74
- frameworks, 79–80
- high-level neural network libraries, 79
- libraries, 77–80
- low-level neural network libraries, 77–79
- middle-level neural network libraries, 79
- types, 75–77
- News, 309–320
- articles per ticker, average daily count, 311f
- Bloomberg News article count, S&P500 (contrast), 310f
- trend correlation, contrast, 313f
- trend information ratio, contrast, 313f
- trend model returns, contrast, 314f
- trend model YoY returns, contrast, 314f
- News data, 299
- reported EPS, contrast, 294f
- newspaper3k, 101
- News score (independent variable), 292
- New York Fed meetings, 295–296
- NLTK, 101
- No-free-lunch (NFL) theorems, 82
- Noise, 60–61, 88, 182
- Nonfarm payrolls (NFPs), US change
- ADP private payroll change, contrast, 45f
- Twitter-based forecast, actual release/Bloomberg consensus survey (contrast), 307f
- Twitter data, usage, 305
- Nonfarm payrolls (surprise), USD/JPY 1-minute move (contrast), 306f
- Non-negative matrix factorization (NMF), 97
- Non-problems, modeling techniques (suggestions), 83t
- Non-quarterly reporting companies, examination, 218
- Non-stationarity (machine learning assumption/limitation), 85–86
- Norges Bank Investment Management, 128
- Normalization, 93–94
- Normalized Difference Vegetation Index (NDVI), 278
- Normally distributed errors, 64
- Normal neighborhood, selection (difficulties), 190f
- Nowcasting
- Eurozone (EZ) GDP growth, 260f
- GDP growth, 262–263
- NumPy, 77, 80
-
- Official Foreign Exchange Reserve, currency composition (COFER data), 347f
- Oil and gas production (Q&A survey), case study, 252–254
- Oil prices/supply changes, contrast, 254f
- One-class SVM, 188
- OPEC, 252
- crude oil production estimates, 253f
- oil supply changes, oil prices changes (contrast), 254f
- OpenCV, 91
- OpenSky dataset, usage, 297
- OPEX, 38
- Optical Character Recognition (OCR), 55
- Original Equipment Manufacturers (OEM), decision-making, 206–207
- Outliers
- anomalies, 181
- definition/classification, 182–183
- flagging, 200f
- global outliers, local outliers (contrast), 184
- local outlier factor (LOF), 187
- temporal structure, 183
- treatment, 51, 56–57
- Outliers, detection
- algorithms, comparative evaluation, 185–188, 186t
- approaches, 182–183
- case study, 194
- density-based techniques, 203
- distance-based techniques, 202–203
- heuristics-based approaches, 203–204
- model-based techniques, 202
- problem, setup, 184–185
- techniques, 57
- unsupervised ML techniques, usage, 198–199
- Outliers, explanations
- Angiulli et al. explanation, 192–193
- approaches, 189–193
- Duan et al. explanation, 191–192
- Micenkova et al. explanation, 189–190
- rank statistic, usage (problem), 191f
-
- Packaging models, 30
- Pandas, 80
- Passive investing, 127
- Passive strategies, 126
- pattern, 101
- Pattern classification methods, missing data (inclusion), 144f
- Pay-per-use models, 40
- Payroll readership, usage, 323–325
- PDFMiner, 101
- Percentage ratios, 243
- Personal data, definition, 47
- Pillow, 91
- Point anomalies, 184
- Point-of-sale (POS) devices, usage, 338
- Poisson regression models, 72
- Pooled surveys
- company event study (pooled survey), case study, 249–252
- usage, 247
- Portfolio, effects, 22
- Posterior step (P-step), 158
- Predicted R squared coefficient, true R squared coefficient (contrast), 154
- Predictive imputation (missing data treatment), 138
- Predictive mean matching (PMM), 158, 165
- Pricing
- discriminatory pricing mechanisms, 42f
- equation, 40
- Principal component analysis (PCA), 71, 76
- Principle Component Regression (PCR), 238
- Private equity
- Private firms, performance (understanding), 363
- Private markets, alternative data, 359
- Probabilistic Graphical Model (PGM), example, 130f
- Processed data (data transformation stage), 9f
- Process expense, 34
- Processing level, 21
- Processing libraries, 80
- prod_volume_prev_1m_pct_change_prev_2m_mean, 232, 235, 238
- Proof-of-concept (POC), 106, 112
- Pseudo-time, basis, 262
- Publishing lag, 21
- Purchasing Managers Indexes (PMI), 259
- China PMI, China GDP QoQ (contrast), 13f
- impact, 263–265
- indicators, appropriateness, 108
- manufacturing (measurement), satellite data (usage), 277–280
- performance, 261–262
- release, 11–12
- US GDP growth rate, contrast, 12f
- Python ports, 72
- PyTorch, 78
-
- Q&A surveys
- oil and gas production (Q&A survey), case study, 252–254
- usage, 247
- Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
- Q_pct_delta_ffo returns plot, quarterly benchmark (contrast), 222f
- Q_pct_delta_ffo, stocks holding (heatmap), 222f
- Quarterly benchmark
- Q_pct_delta_ffo returns plot, contrast, 222f
- revenues_sales_prev_3m_sum_prev_1m_pct_change, contrast, 230f
- usa_sales_volume_prev_12m_sum_prev_3m_pct_change returns plot, contrast, 232f
- ww_market_share_prev_1m_pct_change returns plot, contrast, 231f
-
- Radial basis function network (RBFN), 140
- Random forest (RF), 67–68, 184
- Ranking factor, usage, 213, 216–217
- Raw data (data transformation stage), 9f
- Real estate investment trust (REIT) ETF (trading), mobile phone location data (usage), 288–291
- Rectified Linear Unit (RELU), 90
- Recurrent neural networks (RNNs), 76, 87, 143
- Regressing consensus, 275f
- estimates/footfall, reported EPS (contrast), 293f
- Regressing footfall, reported EPS (contrast), 294f
- Regressing Google domestic trend indices, 326f
- Regressing news
- sentiment, 276f
- volume, 1M implied volatility (contrast), 315f
- Regression, 62
- linear regression, 64–65
- logistic regression, 65–66
- models, 72
- softmax regression, 67
- Reinforcement learning, 63
- Replacement (missing data treatment), 138
- Reported EPS, regressing consensus estimates/footfall (contrast), 293f
- Reports, usage, 247
- Research cost, 21
- REST API, usage, 102
- Restricted information set, 86
- Retail activity (understanding), mobile phone location data (usage), 287–295
- Retailers, car counts/EPS, 271–277
- Returns, sensitivity, 18
- Revenue, maximization (equation), 42
- revenues_sales_prev_3m_sum_prev_1m_pct_change, 232, 238
- quarterly sales volume, monthly change, 229–230
- quintile CAGR, 230f
- returns plot, quarterly benchmark (contrast), 230f
- RIPPER, 148
- Risk-adjusted returns, 312
- Risk managers, 39
- Risk metrics, 39
- Risks, pre-assessment, 106, 109
- Risk tolerance levels, 36
- Root mean square error (RMSE), 154
- Root Mean Square Forecasting Errors (RMSFE), 2643
- Ross, Stephen, 122
- R squared coefficient, differences, 154
- RSSA, 180
- Rule induction learning, 139
-
- Sales/revenue (sales), financial statement item, 242
- Sales-to-EV ratio, 216, 242
- sales_volume_prev_1m_pct_change_prev_2m_mean, 232, 235
- Satellite data, usage, 277–280
- Satellite images
- aerial photography, 267
- analysis, process (steps), 174
- case study, 173
- data, dataset augmentation, 90–91
- Satellite manufacturing index, Chinese/Caixin PMI manufacturing (contrast), 279f
- scikit-image, 77, 91
- scikit-learn, 59, 71–72, 77, 101
- SciPy, 77, 80
- SciPy.ndimage, 91
- Scrapy, 100
- Search volume, example, 326f
- Self-organizing map (SOM), 143
- Sellers, data pricing perspective, 41–45
- Semi-supervised anomaly detection, 57
- Sentiment
- analysis, classification problem, 96
- identification, 93
- social media, impact, 309
- Sequential minimal optimization (SMO), 140
- Sharpe ratio, 128, 212, 239
- Shipping data, usage, 283–286
- Short threshold, usage, 213
- Shure, Sennheiser (MoM) (Amazon spend comparison), 339f
- Signals
- data transformation stage, 9f
- existence, pre-assessment, 106, 109–110
- extraction, performing, 106, 111–112
- SimpleCV, 91
- Simple moving average (SMA), application, 331–332
- Single class logistic regression, neural network function, 73
- Singular spectral analysis (SSA), 162
- Singular value decomposition (SVD), 71, 160
- Singular value decomposition imputation (SVDI), 140
- sinkr, 180
- Siri, usage, 99–100
- Smart beta indices, alternative data inputs (usage), 127–128
- Social media, 300–309
- data, 299
- Hedonometer index, 302–305
- Social Media Analytics, 301
- Soft data, 260
- Softmax regression, 67
- neural network function, 73–74, 74f
- Software libraries, usage, 179–180
- spaCy, 101
- Spearman correlation, 235
- Speech, involvement, 99–100
- SpeechRecognition, 102
- SpendingPulse Brazil retail sales YoY, Brazil YoY retail sales (contrast), 337f
- SpendingPulse index (MasterCard), 337
- Spot returns, net flow (multiple regressions), 353f
- Standard and Poor's 500 (S&P500), 77, 265
- Bloomberg News article count, contrast, 310f
- Google Shock Sentiment, contrast, 327f
- Google Shock Sentiment scatter, contrast, 327f
- Happiness Sentiment Index, contrast, 305f
- trading, IAI/VIX (usage), 329f
- Standard & Poor's 500 (S&P500)
- Statistical factor model, 121–122
- Stochastic discount factor, definition/equation, 41
- Stocks
- exchanges, stock ranking, 211
- heatmap, 222f
- market reaction (forecast), Twitter data (usage), 308–309
- Stocktwits data/sentiment factor, 82, 309
- Strategic risks, impact, 11
- Strategy
- capacity, 16–19
- data transformation stage, 9f
- high-capacity strategies, properties, 18
- investment strategy, time frequency, 22
- loss making, 18
- low-capacity strategies, 18
- setup, 105, 106–107
- Stride, 90
- Structuring level, 21
- Subject matter experts (SMEs), impact/usage, 107, 112
- Supervised anomaly detection, 57
- Supervised learning, 62
- Supervised machine learning techniques, 64–70
- Support vector machine (SVM) (SVMI), 68–69, 140, 142, 148, 162, 184
- Survey data, 245
- alternative data use, 245–247
- case studies, 249–254
- contributors, hierarchy, 246f
- product, 247–249
- usage, 247
- Surveys
- crowdsourcing analyst estimates survey, 255
- process, 249f
- technical considerations, 254–255
- timeline, example, 248f
- Synthetic 2D data, DINEOF imputation (example), 161f
- Systematic investors, 36–38
-
- tabula-py, 101
- Taxi ride data, 295–296
- Technology, score, 22
- TensorFlow, 59, 77–79, 101
- Tesla, value, 218
- TextBlob, 101
- Text data, 299
- TF-IDF, 94
- TF Learn, 79
- Thasos Mall Foot Traffic Index, 288
- YoY, US retail sales YoY (contrast), 289f
- Theano, 77, 78
- The many, Big Data (contrast), 9–11
- Tickers, usage (change), 18
- TickerTags, 98
- Time
- frame, impact, 217
- removal, 219
- Time averaged Spearman rank correlations, 236t–237t
- Time series data, examples, 164f, 171f
- Time series trading approach, cross-sectional trading approach (contrast), 126
- Topic
- identification, 93
- modeling, 96–97
- Total assets (totassets), financial statement item, 242
- Tradeable companies, long/short-portfolio sizes, 218t
- Transaction cost analysis (TCA), 356
- Transaction costs, 217–218
- impact, 18f
- increase, impact, 18
- Transaction time, 54
- Trend returns, 355f
- Trend strategies/daily flow-based strategies, risk-adjusted returns, 354f
- Trial availability, 22
- True R squared coefficient, predicted R squared coefficient (contrast), 154
- t-statistics, report, 314
- Turkey PVIX indicator, USD/TRY 1M implied volatility (contrast), 331f
- Twitter data, 309
- reported EPS, contrast, 294f
- usage, 305–308
- Twitter mood data, usage, 15
- Twitter score (independent variable), 292
- Two-part-tariff models, 31
-
- Unstructured data, conversion, 51
- Unsupervised anomaly detection, 57
- Unsupervised learning, 63
- Unsupervised ML techniques, usage, 198–199
- usa_sales_volume_prev_12m_sum_prev_3m_pct_change, 229, 232, 238
- quintile CAGR, 232f
- returns plot, quarterly benchmark (contrast), 232f
- yearly US sales volume, quarterly change, 231
- USD/JPY
- bid/ask spread, 357f
- news sentiment score, weekly returns (contrast), 312f
- news volume, 1M implied volatility (contrast), 315f
- trading, intraday basis, 308f
- USD/TRY 1M implied volatility, Turkey PVIX indicator (contrast), 331f
- US employment report, payrolls clicks, 324f
- US export growth, forecasting, 269–271
- US GDP growth rate, PMI (contrast), 12f
- US/global new vehicle registration/sales ((IHS Markit database)), 207
- US ISM, US GDP QoQ (contrast), 12f
- US retail sales YoY, Thasos Foot Traffic Index YoY (contrast), 289f
- UST 10Y yield changes, 320f
- US Treasury yields, 316–320
-
- Value-at-Risk (VaR), enhancement, 346
- Value-of-information, 199
- Variance, 60–61
- bias, balance, 61f
- cause, 60
- Vendors
- due diligence, performing, 105, 108
- identification, 23–24
- monopoly, impact, 42–43
- Venture capital
- firms, defining, 360
- transactions, database charting, 362
- Vickrey auction, 43
- Viscosity factor, 82
- Vision, setup, 105, 106–107
- Volatility Index (VIX), 63, 77
- Investor Anxiety Index (IAI), contrast, 328f, 329f
- usage, 329f
- Volume, Variety, Velocity, Variability, Veracity, Validity, Value (Big Data characteristics), 9–10
-
- Walmart, earnings per share (consensus/footfall contrast), 291f
- Web data, 299
- collection, 299–300
- search volume example, 326f
- Web page
- body text, capture, 300
- content, downloading, 300
- time stamp, assignation, 300
- Web scraping, usage, 25f, 300
- Web sources, 320–322
- Weighted k-NN (WKNNI), 140
- Wikipedia, usage, 330
- Word2vec, 94–96
- Word embeddings, creation, 94–96
- Words, frequency (example), 99f
- Word tokenization/segmentation, usage, 92
- Wrappers, writing, 112
- ww_market_share_prev_1m_pct_change, 229, 232, 235
- quintile CAR, 231f
- returns plot, quarterly benchmark (contrast), 231f
- worldwide market shares, monthly change, 230
-
- XLNet, 96
- XRT, returns/trading (Thasos Mall Foot Traffic index basis), 288–289, 290f
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.