Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Index

1M implied volatility
- regressing news volume, contrast, 315f
- USD/JPY news volume, contrast, 315f
50-day SMA, 82

A/B test, 35
accounts payable (accpayable), financial statement item, 241
ADP private payroll change, nonfarm payrolls (US change) (contrast), 45f
ALFRED time series, 306
Algorithms
- feature detection algorithms, properties, 89f
- features/feature detection algorithm, 87–89
- selection decision, 86–87
Alpha, 45–46
- capture data, 256
Alternative data, 3
- adoption curve, 16f
- brands, association, 24f
- buy side total spend, 25f
- capacity, 16–19
- characteristics, 6
- collection, cost, 6
- defining, 5–7
- dimensions, 19–23
- forecasts, 339f
- history, shortness, 6
- inputs, usage, 127–128
- maintenance process, 113–114
- process, 105–114
- reasons, 11–14
- risks/challenges, 47
- segmentation, 7–9, 8t
- strategies, 35–39, 229t
- survey data, comparison, 245–247
- team usage, structuring, 114–115
- usage, 6, 14f, 50–57
- use processes, 105
- users, identification, 15–16
- value, 27
- vendors, identification, 23–24
Alternative datasets
- buy side usage, 24–26
- commercial release frequency, 23f
- derivation, web scraping (usage), 25f
- maturing alternative datasets, advantages, 45–46
- usage, 226t–227t
Amazon
- Comprehend, 102
- Mechanic Turk, 53
- revenue, actual revenue changes (contrast/alternative data forecasts), 339f
Amazon Web Services (AWS), 79, 338
Amelia II (CRAN), 147
- usage, 180
Amelia imputed time series, examples, 168f, 172f
Amelia+MSSA, 165, 167
Angle-Based Outlier Factor (ABOF), 203–204
Anomalies, 181
- point anomalies, 184
Apache MXNet, 79
APIs, usage, 30, 71, 79, 91, 112
Approximate methods, rank, 142f
Approximate models, 139, 140
Arbitrage pricing theory (APT), 122–123
Area under the ROC curve (AUC), mean/standard deviation/MSE values, 148f
Articles per ticker, average daily count, 311f
Artificial Intelligence (AI), 4
Asian Crisis (1997), 86
Asset class
- breadth/depth, 20
- coverage, 20
- relevance, 19–20
Asset price, signals, 29
Assets under management (AUM), 16, 360
- ranking, 361f
Auctions
- types, 43
- usage, 42
Auto-associative neural network (AANN), 143
Auto-correlation, absence, 64
Autoencoders, 71
- neural networks, 76–77
Automated identification system (AIS)
- crude oil exports, comparison, 286f
- data, collection, 285
- transmitters/messages, usage, 284
Automatic Identification System (AIS), 117
Automotive company data
- alternative data strategies, CAGR basis, 229t
- company list, 240
- core factors, 223
  - aggregation, 224
- delayed data, usage, 224
- direct approach, 223–238
- factors, 226t–227t, 229–238
- factors CAGRs, 239t
- financial statement items, description, 241–242
- freshest automotive factors summary statistics, 228t
- freshest data, usage, 224
- Gaussian processes, example, 238–239
- long top 33% strategy excess returns, equal weighted benchmark (contrast/Pearson correlations), 233t–234t
- Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
- ratios, usage, 242–243
- reporting delays, country ranking, 244
- stocks, holding (heatmap), 222f
- time averaged Spearman rank correlations, 236t–237t
Automotive fundamental data, 205, 206–211
- book-to-market ranking, 214
- Chevrolet Cruze, country unit sales/registration, 210t
- equal weighted benchmarks, 219t
- IHS Markit databases, usage, 206–207
- indirect approach, 211–222
- information, examples, 217–218
- long portfolio, creation, 214
- non-quarterly reporting companies, examination, 218
- process, 209f
- production volume, mean percent, 208f
- sales volume, mean percent, 208f
- Stage 1 process, 213–215
- stages, 213–223
- stocks, ranking, 214
- strategies, CAGR ranking, 219t, 220t
- supporting statistics, 217
- Tesla, value, 218
- tradeable companies, long/short-portfolio sizes, 218t
- transaction costs, 217–218
- universe, assumption, 215
Average percentage derivation, usage, 270t
Azure clusters, 79

Backtesting, usage/nonusage, 35–39
Backtests, 54, 213
Bagging, 67
Bag-of-words, 94
Bayesian principal component analysis (BPCA), 140
Bayes theorem, usage, 69–70
BeautifulSoup, 100
Benchmark model (BM), 263
Berners-Lee, Tim, 299
Bias, 60–61, 61f
Bidirectional Encoder Representations from Transformers (BERT), 95, 101–102
Big Data, 9–11, 50
Billion Prices Project, The, 321–322
Binned R² analyses, usage, 109
Bitly, 324–325
Blob-based feature detectors, 88f
Blockchain technology, usage, 31
Bloomberg News article count, S&P500 (contrast), 310f
Book-to-market ratios, 123
Brands, alternative data (association), 24f
Brazil
- English attention, local content (comparison), 332f
- YoY retail sales, SpendingPulse Brazil retail sales YoY (contrast), 337f
Buyers, data pricing perspective, 40–41

Caffe, 80
CAPEX, 38
Capital, allocation (increase), 18
Capital Asset Pricing Model (CAPM), 119–124
Car counts, 271–277
- basis, steps, 272–273
- data, 275f, 276f
- earnings, contrast, 274f
Carhart model, 124
Car parks
- data, DINEOF imputation (example), 175f
- image, 175f–178f
Carry-based factor model, 125
Causality (machine learning assumption/limitation), 84–85
Central bank intervention, modeling, 346–348
Chevrolet Cruze, country unit sales/registration, 210t
China
- GDP growth rate, PMI (contrast), 13f
- SpaceKnow satellite manufacturing index, Chinese/Caixin PMI manufacturing (contrast), 279f
China PMI
- China GDP QoQ, contrast, 13f
- manufacturing (surprises), consensus/SMI/hybrid (contrast), 280f
- manufacturing (measurement), satellite data (usage), 277–280
Chinese yuan (CNY) intervention/official data (contrast), model estimates (comparison), 347, 348f
Clairvoyance, 211–212
- impact, 213, 215–216
- Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
- Q_pct_delta_ffo returns plot, quarterly benchmark (contrast), 222f
Classification methods, 139–140
CLS Group, establishment, 352
Cluster-based outlier factor (CBLOF) algorithm, 188
Clustering-based unsupervised machine learning techniques, 70–71
Collective outliers, 184
Commodity trading advisor (CTA), risk-adjusted returns, 18
Common equity (equity), financial statement item, 241
Company event study (pooled survey), case study, 249–252
Company removal, example, 220
Compounded annual growth rate (CAGR), 211–212, 217–222, 225, 229–230
- factors CAGRs, 239t
- production, 235
- Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
- ranking, 219t
Concept most common/average (CMC), 140, 142
Consensus estimates (independent variable), 292
Consumer Price inflation, measurement, 321–322
Consumer receipts, 337–340
Consumer transactions, 335
Content, topic/sentiment identification, 93
Continuous bag of words (CBOW), 95
Convolutional neural networks (CNNs), 56, 69, 76, 272
- convolutional/flat layers, inclusion, 76f
- usage, 89–90
CoreNLP, 101
Corner feature detectors, 88f
Corporate aircraft, takeover target visits, 297f
Corporate data, 341
Corporate jet location data, 296–298
Corporate Sustainability Assessment (CSA), RobecoSAM creation, 129
Corpus of Contemporary American (COCA) English, 98–99
Cost of Goods Sold (cogs), financial statement item, 241
Cost value (CV), 34
Cox regression models, 72
Credit default swap (CDS), 136, 151
- data, usage, 111, 154–157
- time series data, clustering, 156f
Credit transaction data, 336–337
Critical line approach (Markowitz), 71
Cross-sectional trading approach, time series trading approach (contrast), 126
Cross-validation (CV), 61–63
Crowdsourced data, 245
- case studies, 249–254
- contributors, hierarchy, 246f
- product, 247–249
- usage, 247
Crowdsourcing analyst estimates survey, 255
Crude oil production, OPEC ranking, 253f
Crude oil supplies (tracking), shipping data (usage), 283–286
Cryptocurrency price actions (understanding), Wikipedia (usage), 330
CScores, 198–199
- histogram plot, 197f
Currency Composition of Official Foreign Exchange Reserve (COFER) data, 347f
Currency crisis risk, quantification, 344–346
Currency markets, central bank intervention (modeling), 346–348
Currency pair, purchase/sale, 354
Current liabilities (currliab), financial statement item, 241
CUSIP standard, 52

Daily flow returns, 355f
Data
- aggregation, 57–58
- assets, 33, 105
- availability, 21–22
- bias, 21
- clarity, 111
- delayed data, usage, 224
- external consistency, 111
- external marketing value, 44–45
- free data, 20
- frequency, 20–21
- freshest data, usage, 224
- fusion, 52
- internal consistency, 111
- legal aspects, 47–50
- markets, 29–31
- mining, 124–126
- missing data, 51, 54
- monetary value, 31–35, 39–45
- onboarding, performing, 106, 110
- originality, 22
- outliers, treatment, 51
- points, distinction, 182
- preprocessing, performing, 106, 110–111
- pricing, perspective, 40–45
- protection laws, comparison, 48f
- quality, 21, 111
- science team, setup cost, 116f
- services, 30
- sources, entity identifiers (matching), 51
- strategies, evaluation, 35–39
- structuring, 55–56
- team (creation), big bang hiring strategy, 115
- test data generation, 154–157
- timeliness/completeness, 111
- transformation, stages, 9f
- underusage, 15
- uniqueness, 111
- unstructured data, conversion, 51
- upside sharing, external sales, 43–44
- usage, limitations, 49
- validity/veracity, 111
- values, 32–33, 136
- vendors, 116–117
- view, representation, 184
Data-as-a-Service (DaaS), 117
Data interpolation with empirical orthogonal functions (DINEOF), 152–153, 160–162
- application, 174
- imputation, 161f, 170f, 173, 175f
- usage, 180
Datasets
- identification, 107–108
- price, assignation, 27
- restricted information set, 86
- shift, types, 85
- time stamps, 110
- traditional datasets, 320–321
- usage, 186t, 269
Debit card transaction data, 336–337
Decision boundary, example, 68f
Decision trees, 67
Deep learning (DL), 72–80, 82–83
- defining, 77
- examples, 73–74
- high-level deep learning libraries, 79
- libraries, 77–80
- low-level deep learning libraries, 77–79
- middle-level deep learning libraries, 79
- usage, 89–90
Deletion (missing data treatment), 137–138, 143
Density-based techniques, 203
Deterministic techniques, usage, 160–164
Directionality factor, 82
Direct prediction, 129–132
Discretionary investors, 38–39
Distance-based techniques, 202–203
Diversification, factor investing benefit, 127
Do not impute (DNI), 140
Due diligence, performing, 105, 108
Dutch auction, 43
Dwell time, 288

Earnings before interest and taxes (ebit), financial statement item, 241
Earnings, car counts (contrast), 274f
Earnings per share (EPS), 271–277
- estimation, mobile phone location data (usage), 291–295
- examples, 275f, 276f
- news/Twitter data, contrast, 294f
- regressing footfall, contrast, 294f
EBIT and depreciation (opincome), financial statement item, 242
EBIT-to-EV ratio, 216, 242
Economic Sentiment Indicator (ESI), 260–262
Economic theory, test, 220
Economic value (EV), 35
Edge feature detectors, 88f
Efficient Market Hypothesis (EMH), 27
Emerging Market (EM) currencies basket (trading), macro-economy attention (usage), 333f
Emerging Market Foreign Exchange (EMFX), 323, 330–333
Empirical orthogonal function (EOF), 160–164
English auction, 43
Enhanced Vegetation Index (EVI), 278
Entities, identification, 93
Entity identifiers, matching, 51
Entity matching, 52–54
Environmental Social Governance (ESG) factors, 128–129
Equal weighted benchmark, long top 33% strategy excess returns, (contrast/Pearson correlations), 233t–234t
Equal weighted benchmarks, 219t
Equities (trading), innovation measures (usage), 342–344
Errors, types, 64
Eurozone
- Composite PMI, 261, 261f
- GDP, 261f
- model performance, 263t
EUR/USD
- bid/ask spread, 356f, 357f
- daily abs net flow, 353f
- daily volume, 352f
- ON implied volatility, FOMC news volume (contrast), 317f
- index, EUR/USD fund flow score (contrast), 354f
- overnight volatility, 317f, 318f
- trading, intraday basis, 308f
- ON volatility levels, 317f
Exhaust data, 7
Expectation conditional maximization (ECM), 149
Expectation maximization (EM) procedure, 143, 159–160
Explorer VI satellite, 267, 268f
Exponential MACD, 82
Exports/lights/GDP, annual correlation, 270t

Factor
- CAGRs, 239t
- correlations, 232–238
- factor-based strategies, 126–127
- generation, 224–225
- identification, 212
- modeling/forecasting, 212
- performance, 225–229
- removal, 224
Factor investing, 119
- benefits, 127
- cost, reduction, 127
- usage, reasons, 126–127
Factor models, 120–126
- approaches, 125–126
- definition, 120
- modeling sequences, examples, 130f–131f
- types, 121–122
Fama-French 3-factor model, 123–124
Fear gauge, 328
Feature detection algorithms, properties, 89f
Feature detectors, types, 88f
Features/feature detection algorithm, 87–89
Fed communications, 316–320
Fed communications index
- categorical/continuous variables, mixture, 199
- CScores, histogram plot, 197f
- event types, 196f, 200f
- fields, tagging, 194
- input variables, usage, 199
- log(text length), histogram plot, 195f
- outlier detection, case study, 194
- rules-based approaches, 198
- speakers, talkativeness (ranking), 197f
Federal Open Market Committee (FOMC), 111, 183, 194–198
- communications, availability, 319
- EUR/USD ON volatility levels, 317f
- meetings, 66, 295–296, 316
- news volume, EUR/USD ON implied volatility (contrast), 317f
- sentiment index, 320f
- stock market reaction forecast, Twitter data (usage), 308–309
Feed forward neural networks, 75–76
Financial markets
- alternative data, relationship, 6
- PMI, impact, 263–265
Financial problems, modeling techniques (suggestions), 83t
Financial ratios, usage, 129
First-Price Sealed-Bid auction, 43
Flat-fee models, 30
Footfall
- regressing footfall, reported EPS (contrast), 294f
- reported EPS, contrast, 293f
- score (independent variable), 292
Foreign Exchange (FX), 5, 341
- average crisis rates, 346f
- daily flow returns, 355f
- data, 6
- flow data, institutional FX flow data (relationship), 351–355
- spot returns, net flow (multiple regressions), 353f
- trading, machine-readable news (usage), 310–316
- trend returns, 355f
- trend strategies/daily flow-based strategies, risk-adjusted returns, 354f
- volatility (understanding), machine-readable news (usage), 310–316
Free data, presence, 20
Freemium models, free services/value-added services (combination), 30
Free services, value-added services (combination), 30
Freshest automotive factors summary statistics, 228t
Fundamental factor model, 121, 122
Funds from operations (ffo), financial statement item, 241
Fuzzy k-means clustering (FKMI), 140, 142
FX Risk Tool (Oxford Economics), 345

Gaussian distributions, 202
Gaussian Finite Mixture Models, 185
Gaussian mixture model (GMM), 143
Gaussian processes (GPs), 80–82
- example, 238–239
- orthogonality/nonlinearity, 238
- representation, 81
Gaussian Process Regression (GPR), 238
GBP/USD intraday volatility, UK PMI Services (basis), 265f
General Data Protection Regulation (GDPR), 47, 50, 287
General partners, AUM ranking, 361f
Generative adversarial neural networks (GANs), 63, 77
Gensim, 101
Geospatial Insight dataset, usage, 272
GitHub, 79
Glmnet, 72
Global outliers, local outliers (contrast), 184
Global Vectors for Word Representation (GloVe), 95
Google
- Cloud Natural Language, 102
- Cloud Speech-to-Text, 102
- Domestic Trend, 325–326
- regressing Google domestic trend indices, 326f
- search volume, example, 326f
- Shock Sentiment, 326, 327f
- trends data, usage, 325–327
Government data, 341
Grapedata, 247–256
Gross Domestic Product (GDP), 259
- exports/lights, annual correlation, 270t
- growth correlations, 262t
- proxying, 270
- release, 11

Hang Seng index, share price (performance), 252f
Happiness Sentiment Index, 304, 305f
Hedonometer
- average score, 304f
- happiest/saddest words (ranking), 302f
- Index, 302–305, 303f, 322
Heuristics-based approaches, 203–204
Hierarchical clustering, 70–71
Hierarchical density-based spatial clustering of applications with noise (HDBSCAN), 199
High-capacity strategies, properties, 18
High-frequency data, usage, 355–357
High-level deep learning libraries, 79
High-level neural network libraries, 79
Histogram-based outlier score (HBOS), 198
Histogram-based statistical outlier (HBOS) detector, 188
Holding period, usage, 213
Homoscedastic errors, 64
HTML tags, removal, 300
Hyperspace, contents, 89

I/B/E/S dataset, 255
Ignore missing (IM), 140
IHS Markit (IHSM), 23, 259, 285
- databases, usage, 206–207
- data features, 243
- process, 209f
Images
- classification, deep learning/CNNs (usage), 89–90
- features/feature detection algorithm, 87–89
- imaging tools, 91
- satellite image data, dataset augmentation, 90–91
- structuring, 87–91
Imputation methods, 152
- multiple imputation (MI) methods, 157–160
- ranking, 143f
- values, computation, 148f
Imputation metrics, 154
Imputation-posterior (I-P) form, 158
Imputation step (I-step), 158
Imputation technique, classifiers (rank), 141f
Index market, evolution, 127–128
Indicator computation, car counts basis (steps), 272–273
Indirect prediction, 129–132
Induction learning methods, 140
Industrial data, 341
Information coefficient (IC), 217, 230
Information ratios, 312
Innovation measures, usage, 342–344
Input dataset error rates, LERS new classification (usage), 146f–147f
Institutional FX flow data, FX spot (relationship), 351–355
Interest rate swaps (IRSs), 71
Internal exhaust source, requirements, 24
inventory, financial statement item, 241
Investment
- capacity, increase, 127
- management constituents, phase identification, 16f
- strategy, 22, 105
- value, decay, 27–29
Investopedia search data, usage, 328–329, 328f
Investor Anxiety Index (IAI), 328
- usage, 329f
- Volatility Index (VIX), contrast, 328f, 329f
Investors
- anxiety (measurement), Investopedia search data (usage), 328–329, 328f
- attention, 323–325
- discretionary investors, 38–39
- systematic investors, 36–38
Isolation-Based Outliers, 204
Isolation forest (ISO), 199

Joint Organizations Data Initiative (JODI) Oli World Database, usage, 285
JX Mobile III, 249
- launch, payment willingness, 251f
- test version, usage question, 250f
JX PC III, monthly spending question, 251f

KDB, usage, 110
Keras, 79
Kernel Density Estimation, 185
Kernels, usage, 81
Kernel trick, example, 69f
Kingsoft, 250
- share price, performance, 252f
- survey, questions, 256–257
k-means (K-means), 70, 143, 198
k-means clustering information (KMI), 140, 142, 149
k-nearest neighbors (KNN) (KNNI), 140, 147–149, 187, 199
- regression/classification, 149
- usage, 143
Kriging, 80–81

Lasagne, 79
Latency arbitrage, 307
Latent Dirichlet Allocation (LDA), 97
Latent semantic analysis (LSA), 97
Lazy learning, 139, 140
- methods, rank, 142f
Learning from Examples based on Rough Sets (LERS), 146–147
LERS new classification, usage, 146f–147f
Licensees, number, 35
Lights/exports/GDP, annual correlation, 270t
Light vehicle production (IHS Markit database), 206
Light vehicle sales (IHS Markit database), 207
Linear regression (LR), 64–65, 84, 87, 348
- neural network function, 73, 73f
- visualization, 65f
Liquidity
- social media, impact, 309
- understanding, high-frequency data (usage), 355–357
Local least squares imputation (LLSI), 140
Locally linear reconstruction (LLR), 149
Local outlier factor (LOF), 187, 203
- score visualization, example, 187f
Local outliers, global outliers (contrast), 184
Location data, 283
Logistic regression, 65–67, 82
- neural network function, visualization, 74f
- single class logistic regression, neural network function, 73
- visualization, 66f
Log-likelihood, 159
Long-only portfolios (derivation), visa/patent data (usage), 343f
- in sample/out-of-sample, 344f
Long portfolio, creation, 214
Long short-term memory (LSTM), 76, 87
Long threshold, usage, 213
Long top 33% strategy excess returns, equal weighted benchmark (contrast/Pearson correlations), 233t–234t
Low-capacity strategies, 18
Low-level deep learning libraries, 77–79
Low-level neural network libraries, 77–79

Machine learning (ML), 4
- algorithms, calibration, 61
- bias/variance/noise, 60–61
- clustering-based unsupervised machine learning techniques, 70–71
- cross-validation (CV), 61–62
- deep learning, 72–80
- definitions, 60
- examination, 62–63
- fit, expected error (equation), 60
- Gaussian processes (GPs), 80–82
- libraries, 71–72
- neural networks, 72–80
- procedures, usage, 143
- processing layers, involvement, 40
- reinforcement learning, 63
- supervised learning, 62
- supervised machine learning techniques, 64–70
- techniques, 59, 60, 82–87
- unsupervised learning, 63
- unsupervised machine learning techniques, 71
Machine-readable news, usage, 310–316
Macro data, forecasting, 129–130
Macroeconomic factor model, 121, 122
Macro-economy attention, usage, 333f
Malls, visits (comparison), 290f
Market data, 351
Market participants, alternative data usage, 6
Market themes (measurement), Google trends data (usage), 325–327
Market value (MV), 34–35
Markov Chain Monte Carlo sampling, 157
Marks & Spencer, car count/earnings (contrast), 274f
Material non-public information (MNPI), 49, 109
Matlab ports, 72
Matplotlib, 91
Matrix factorization, 162–166
Maturing alternative datasets, advantages, 45–46
Maximization step (M-step), 159
Maximum likelihood estimation (MLE), 159
Mean absolute percentage error (MAPE), 154
Mean quintile gap (MQG), 217, 225, 230
Mean relative deviation (MRD), 154
- metrics, summary statistics, 166t–167t
Mergers and acquisitions (M&As), 296–298
Metadata
- addition, 93
- identification, 300
Micro-clusters, 184
Middle-level deep learning libraries, 79
Middle-level neural network libraries, 79
Misclassification error rate, examples, 145f–146f
MissForest: Random Forest imputation, 180
Missing at Random (MAR), 137
Missing Completely at Random (MCAR), 136, 137, 148, 155–157
Missing data, 54, 135
- case studies, 151
- classification, 136–138, 143
- classifier design, deletion, 143
- deletion, 137–138
- distinctions, 136–137
- fraction, usage, 153
- imputation/estimation, 143
- inclusion, 144f
- incomplete cases, deletion, 143
- misclassification error rate, 145f–146f
- predictive imputation, 138
- replacement, 138
Missing data treatments, 51, 137–138
- Farhangfar et al perspective, 148
- Garcia-Laencina et al perspective, 143–146
- Grzymala-Busse et al perspective, 146–147
- Jerez et al perspective, 147–148
- Kang et al perspective, 149
- literature overview, 139–149
- Luengo et al perspective, 139–143
- Zou et al perspective, 147
Missingness patterns
- imposition, example, 164f
- occurrence, number (histogram), 156f
Missing Not at Random (MNAR), 137
Missing values
- consecutive missing values, length statistics (usage), 153
- total fraction, usage, 153
Mixture of Gaussians (MoG), 149
Mobile phone location data
- independent variables, 292
- usage, 287–295
Model backtesting, 213
Model-based nowcast, 307
Model-based procedures, usage, 143
Model-based techniques, 202
Model forecasts, comparison, 270t
Monopoly, impact, 42–43
Montreal Institute for Learning Algorithms (MILA) Theano development (cessation), 78
Multicollinearity, presence, 64
Multi-layer perceptron (MLP), 140, 143
- hidden layer, inclusion, 75f
- neural network, 75
Multiple imputation (MI) methods, 137, 138, 148, 157–160
Multiple imputation with chained equations (MICE), 153, 157
- imputed time series, 168f
- package, norm, 158
- procedure, description, 178–179
- usage, 179
Multiple singular spectral analysis (MSSA), 152–153, 162–164
- imputation, example, 170f
- imputed time series, example, 173f
- usage, 180
Multi-task learning (MTL), 143
Multivariate credit default swap time series
- CDS data, 154–157
- deterministic techniques, 160–164
- EOF-based techniques, 160–164
- imputation metrics, 154
- missing data classification, 153–154
- missing values, imputing, 152
- MRD metrics, summary statistics, 166t–167t
- results, 164–173
- test data generation, 1540157
Multi-variate normal (MVN)
- assumption, 158–159
- distribution, 155, 157
- test, 155
MXNet (Apache), 79

Naïve Bayes (NB), 69–70, 140
Named entity recognition, 92–93
Natural language processing (NLP), 55, 78, 91–102
- challenges, 97–98
- defining, 91–93
- languages/texts, differences, 98–99
- normalization, 93–94
- speech, involvement, 99–100
- tasks, classification problem, 96
- tools, 100–102
- word embeddings, creation, 94–96
NDAs, negotiation, 30
Negative change, ratios, 242–243
Net flow, spot returns (multiple regressions), 353f
Net income (netincome), financial statement item, 242
Net-Income-to-EV ratio, 216, 242
Neural networks (NNs), 72–80, 184
- examples, 73–74
- frameworks, 79–80
- high-level neural network libraries, 79
- libraries, 77–80
- low-level neural network libraries, 77–79
- middle-level neural network libraries, 79
- types, 75–77
News, 309–320
- articles per ticker, average daily count, 311f
- Bloomberg News article count, S&P500 (contrast), 310f
- trend correlation, contrast, 313f
- trend information ratio, contrast, 313f
- trend model returns, contrast, 314f
- trend model YoY returns, contrast, 314f
News data, 299
- reported EPS, contrast, 294f
newspaper3k, 101
News score (independent variable), 292
New York Fed meetings, 295–296
NLTK, 101
No-free-lunch (NFL) theorems, 82
Noise, 60–61, 88, 182
- cause, 60
Nonfarm payrolls (NFPs), US change
- ADP private payroll change, contrast, 45f
- Twitter-based forecast, actual release/Bloomberg consensus survey (contrast), 307f
- Twitter data, usage, 305
Nonfarm payrolls (surprise), USD/JPY 1-minute move (contrast), 306f
Non-negative matrix factorization (NMF), 97
Non-problems, modeling techniques (suggestions), 83t
Non-quarterly reporting companies, examination, 218
Non-stationarity (machine learning assumption/limitation), 85–86
Norges Bank Investment Management, 128
Normalization, 93–94
Normalized Difference Vegetation Index (NDVI), 278
Normally distributed errors, 64
Normal neighborhood, selection (difficulties), 190f
Nowcasting
- Eurozone (EZ) GDP growth, 260f
- GDP growth, 262–263
NumPy, 77, 80

Official Foreign Exchange Reserve, currency composition (COFER data), 347f
Oil and gas production (Q&A survey), case study, 252–254
Oil prices/supply changes, contrast, 254f
One-class SVM, 188
OPEC, 252
- crude oil production estimates, 253f
- oil supply changes, oil prices changes (contrast), 254f
OpenCV, 91
OpenSky dataset, usage, 297
OPEX, 38
Optical Character Recognition (OCR), 55
Original Equipment Manufacturers (OEM), decision-making, 206–207
Outliers
- anomalies, 181
- definition/classification, 182–183
- flagging, 200f
- global outliers, local outliers (contrast), 184
- local outlier factor (LOF), 187
- temporal structure, 183
- treatment, 51, 56–57
Outliers, detection
- algorithms, comparative evaluation, 185–188, 186t
- approaches, 182–183
- case study, 194
- density-based techniques, 203
- distance-based techniques, 202–203
- heuristics-based approaches, 203–204
- model-based techniques, 202
- problem, setup, 184–185
- techniques, 57
- unsupervised ML techniques, usage, 198–199
Outliers, explanations
- Angiulli et al. explanation, 192–193
- approaches, 189–193
- Duan et al. explanation, 191–192
- Micenkova et al. explanation, 189–190
- rank statistic, usage (problem), 191f

Packaging models, 30
Pandas, 80
Passive investing, 127
Passive strategies, 126
pattern, 101
Pattern classification methods, missing data (inclusion), 144f
Pay-per-use models, 40
Payroll readership, usage, 323–325
PDFMiner, 101
Percentage ratios, 243
Personal data, definition, 47
Pillow, 91
Point anomalies, 184
Point-of-sale (POS) devices, usage, 338
Poisson regression models, 72
Pooled surveys
- company event study (pooled survey), case study, 249–252
- usage, 247
Portfolio, effects, 22
Posterior step (P-step), 158
Predicted R squared coefficient, true R squared coefficient (contrast), 154
Predictive imputation (missing data treatment), 138
Predictive mean matching (PMM), 158, 165
Pricing
- discriminatory pricing mechanisms, 42f
- equation, 40
Principal component analysis (PCA), 71, 76
Principle Component Regression (PCR), 238
Private equity
- datasets, 362
- defining, 360
Private firms, performance (understanding), 363
Private markets, alternative data, 359
Probabilistic Graphical Model (PGM), example, 130f
Processed data (data transformation stage), 9f
Process expense, 34
Processing level, 21
Processing libraries, 80
prod_volume_prev_1m_pct_change_prev_2m_mean, 232, 235, 238
Proof-of-concept (POC), 106, 112
Pseudo-time, basis, 262
Publishing lag, 21
Purchasing Managers Indexes (PMI), 259
- China PMI, China GDP QoQ (contrast), 13f
- impact, 263–265
- indicators, appropriateness, 108
- manufacturing (measurement), satellite data (usage), 277–280
- performance, 261–262
- release, 11–12
- US GDP growth rate, contrast, 12f
Python ports, 72
PyTorch, 78

Q&A surveys
- oil and gas production (Q&A survey), case study, 252–254
- usage, 247
Q_pct_delta_ffo quintile CAGRs, 3-months clairvoyance, 221f
Q_pct_delta_ffo returns plot, quarterly benchmark (contrast), 222f
Q_pct_delta_ffo, stocks holding (heatmap), 222f
Quarterly benchmark
- Q_pct_delta_ffo returns plot, contrast, 222f
- revenues_sales_prev_3m_sum_prev_1m_pct_change, contrast, 230f
- usa_sales_volume_prev_12m_sum_prev_3m_pct_change returns plot, contrast, 232f
- ww_market_share_prev_1m_pct_change returns plot, contrast, 231f

Radial basis function network (RBFN), 140
Random forest (RF), 67–68, 184
- comparison, 166t
- imputation, 169f, 180
Ranking factor, usage, 213, 216–217
Raw data (data transformation stage), 9f
Real estate investment trust (REIT) ETF (trading), mobile phone location data (usage), 288–291
Rectified Linear Unit (RELU), 90
Recurrent neural networks (RNNs), 76, 87, 143
Regressing consensus, 275f
- estimates/footfall, reported EPS (contrast), 293f
Regressing footfall, reported EPS (contrast), 294f
Regressing Google domestic trend indices, 326f
Regressing news
- sentiment, 276f
- volume, 1M implied volatility (contrast), 315f
Regression, 62
- linear regression, 64–65
- logistic regression, 65–66
- models, 72
- softmax regression, 67
Reinforcement learning, 63
Replacement (missing data treatment), 138
Reported EPS, regressing consensus estimates/footfall (contrast), 293f
Reports, usage, 247
Research cost, 21
REST API, usage, 102
Restricted information set, 86
Retail activity (understanding), mobile phone location data (usage), 287–295
Retailers, car counts/EPS, 271–277
Returns, sensitivity, 18
Revenue, maximization (equation), 42
revenues_sales_prev_3m_sum_prev_1m_pct_change, 232, 238
- quarterly sales volume, monthly change, 229–230
- quintile CAGR, 230f
- returns plot, quarterly benchmark (contrast), 230f
RIPPER, 148
Risk-adjusted returns, 312
Risk managers, 39
Risk metrics, 39
Risks, pre-assessment, 106, 109
Risk tolerance levels, 36
Root mean square error (RMSE), 154
- computation, 161
Root Mean Square Forecasting Errors (RMSFE), 2643
Ross, Stephen, 122
R squared coefficient, differences, 154
RSSA, 180
Rule induction learning, 139
- methods, rank, 141f

Sales/revenue (sales), financial statement item, 242
Sales-to-EV ratio, 216, 242
sales_volume_prev_1m_pct_change_prev_2m_mean, 232, 235
Satellite data, usage, 277–280
Satellite images
- aerial photography, 267
- analysis, process (steps), 174
- case study, 173
- data, dataset augmentation, 90–91
Satellite manufacturing index, Chinese/Caixin PMI manufacturing (contrast), 279f
scikit-image, 77, 91
scikit-learn, 59, 71–72, 77, 101
SciPy, 77, 80
SciPy.ndimage, 91
Scrapy, 100
Search volume, example, 326f
Self-organizing map (SOM), 143
Sellers, data pricing perspective, 41–45
Semi-supervised anomaly detection, 57
Sentiment
- analysis, classification problem, 96
- identification, 93
- social media, impact, 309
Sequential minimal optimization (SMO), 140
Sharpe ratio, 128, 212, 239
- change, 18
- usage, 217
Shipping data, usage, 283–286
Short threshold, usage, 213
Shure, Sennheiser (MoM) (Amazon spend comparison), 339f
Signals
- data transformation stage, 9f
- existence, pre-assessment, 106, 109–110
- extraction, performing, 106, 111–112
SimpleCV, 91
Simple moving average (SMA), application, 331–332
Single class logistic regression, neural network function, 73
Singular spectral analysis (SSA), 162
Singular value decomposition (SVD), 71, 160
Singular value decomposition imputation (SVDI), 140
sinkr, 180
Siri, usage, 99–100
Smart beta indices, alternative data inputs (usage), 127–128
Social media, 300–309
- data, 299
- Hedonometer index, 302–305
Social Media Analytics, 301
Soft data, 260
Softmax regression, 67
- neural network function, 73–74, 74f
Software libraries, usage, 179–180
spaCy, 101
Spearman correlation, 235
Speech, involvement, 99–100
SpeechRecognition, 102
SpendingPulse Brazil retail sales YoY, Brazil YoY retail sales (contrast), 337f
SpendingPulse index (MasterCard), 337
Spot returns, net flow (multiple regressions), 353f
Standard and Poor's 500 (S&P500), 77, 265
- Bloomberg News article count, contrast, 310f
- Google Shock Sentiment, contrast, 327f
- Google Shock Sentiment scatter, contrast, 327f
- Happiness Sentiment Index, contrast, 305f
- trading, IAI/VIX (usage), 329f
Standard & Poor's 500 (S&P500)
- returns, 82
Statistical factor model, 121–122
Stochastic discount factor, definition/equation, 41
Stocks
- exchanges, stock ranking, 211
- heatmap, 222f
- market reaction (forecast), Twitter data (usage), 308–309
Stocktwits data/sentiment factor, 82, 309
Strategic risks, impact, 11
Strategy
- capacity, 16–19
- data transformation stage, 9f
- high-capacity strategies, properties, 18
- investment strategy, time frequency, 22
- loss making, 18
- low-capacity strategies, 18
- setup, 105, 106–107
Stride, 90
Structuring level, 21
Subject matter experts (SMEs), impact/usage, 107, 112
Supervised anomaly detection, 57
- example, 68f
Supervised learning, 62
Supervised machine learning techniques, 64–70
- assumptions, 64
Support vector machine (SVM) (SVMI), 68–69, 140, 142, 148, 162, 184
- one-class SVM, 185, 188
Survey data, 245
- alternative data use, 245–247
- case studies, 249–254
- contributors, hierarchy, 246f
- product, 247–249
- usage, 247
Surveys
- crowdsourcing analyst estimates survey, 255
- process, 249f
- technical considerations, 254–255
- timeline, example, 248f
Synthetic 2D data, DINEOF imputation (example), 161f
Systematic investors, 36–38

tabula-py, 101
Taxi ride data, 295–296
Technology, score, 22
TensorFlow, 59, 77–79, 101
- Tutorials, 94
Tesla, value, 218
TextBlob, 101
Text data, 299
TF-IDF, 94
TF Learn, 79
Thasos Mall Foot Traffic Index, 288
- YoY, US retail sales YoY (contrast), 289f
Theano, 77, 78
The many, Big Data (contrast), 9–11
Tickers, usage (change), 18
TickerTags, 98
Time
- frame, impact, 217
- removal, 219
Time averaged Spearman rank correlations, 236t–237t
Time series data, examples, 164f, 171f
Time series trading approach, cross-sectional trading approach (contrast), 126
Topic
- identification, 93
- modeling, 96–97
Total assets (totassets), financial statement item, 242
Tradeable companies, long/short-portfolio sizes, 218t
Transaction cost analysis (TCA), 356
Transaction costs, 217–218
- impact, 18f
- increase, impact, 18
Transaction time, 54
Trend returns, 355f
Trend strategies/daily flow-based strategies, risk-adjusted returns, 354f
Trial availability, 22
True R squared coefficient, predicted R squared coefficient (contrast), 154
t-statistics, report, 314
Turkey PVIX indicator, USD/TRY 1M implied volatility (contrast), 331f
Twitter data, 309
- reported EPS, contrast, 294f
- usage, 305–308
Twitter mood data, usage, 15
Twitter score (independent variable), 292
Two-part-tariff models, 31

Unstructured data, conversion, 51
Unsupervised anomaly detection, 57
Unsupervised learning, 63
Unsupervised ML techniques, usage, 198–199
usa_sales_volume_prev_12m_sum_prev_3m_pct_change, 229, 232, 238
- quintile CAGR, 232f
- returns plot, quarterly benchmark (contrast), 232f
- yearly US sales volume, quarterly change, 231
USD/JPY
- bid/ask spread, 357f
- news sentiment score, weekly returns (contrast), 312f
- news volume, 1M implied volatility (contrast), 315f
- trading, intraday basis, 308f
USD/TRY 1M implied volatility, Turkey PVIX indicator (contrast), 331f
US employment report, payrolls clicks, 324f
US export growth, forecasting, 269–271
US GDP growth rate, PMI (contrast), 12f
US/global new vehicle registration/sales ((IHS Markit database)), 207
US ISM, US GDP QoQ (contrast), 12f
US retail sales YoY, Thasos Foot Traffic Index YoY (contrast), 289f
UST 10Y yield changes, 320f
US Treasury yields, 316–320

Value-at-Risk (VaR), enhancement, 346
Value-of-information, 199
Variance, 60–61
- bias, balance, 61f
- cause, 60
Vendors
- due diligence, performing, 105, 108
- identification, 23–24
- monopoly, impact, 42–43
Venture capital
- firms, defining, 360
- transactions, database charting, 362
Vickrey auction, 43
Viscosity factor, 82
Vision, setup, 105, 106–107
Volatility Index (VIX), 63, 77
- Investor Anxiety Index (IAI), contrast, 328f, 329f
- usage, 329f
Volume, Variety, Velocity, Variability, Veracity, Validity, Value (Big Data characteristics), 9–10

Walmart, earnings per share (consensus/footfall contrast), 291f
Web data, 299
- collection, 299–300
- search volume example, 326f
Web page
- body text, capture, 300
- content, downloading, 300
- time stamp, assignation, 300
Web scraping, usage, 25f, 300
Web sources, 320–322
Weighted k-NN (WKNNI), 140
Wikipedia, usage, 330
Word2vec, 94–96
Word embeddings, creation, 94–96
Words, frequency (example), 99f
Word tokenization/segmentation, usage, 92
Wrappers, writing, 112
ww_market_share_prev_1m_pct_change, 229, 232, 235
- quintile CAR, 231f
- returns plot, quarterly benchmark (contrast), 231f
- worldwide market shares, monthly change, 230

XLNet, 96
XRT, returns/trading (Thasos Mall Foot Traffic index basis), 288–289, 290f

YARN clusters, 79

Z-score, 191–192, 204

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Table of Contents for
Index