Appendix. Exercises: Examples and Answers

Answers are listed by chapter below.

Chapter 2

1.

What are three questions you would ask the product designer?

Some questions you might want to ask the product designer before getting started include:

  • What are some examples of forecasts that you would like to be able to provide to FMI’s app users?
  • How do you imagine the users interacting with the forecasts?
  • How do the forecasts fit in with the rest of the components in the app?

2.

What are three good questions you might ask of the data?

Some example questions include:

  • Is it possible to make reliable forecasts into the future?
  • If so, how far into the future are the forecasts reliable?
  • Is it possible to classify transactions such as withdrawals, purchases, and so on into various grades of “good” and “bad” in terms of their effect on the user’s financial health, and to show how they affect forecasts?

3.

What are three possible goals for the project?

Possible goals might include:

  • A forecast that can reliably inform the user when an account balance will probably reach a certain level—for example, a bank account being empty, hitting a savings goal, or a credit card hitting the credit limit.
  • Providing the users with suggestions for financial behavior changes to improve their forecasts.
  • A visualization that conveys forecasts clearly and concisely.

Chapter 3

1.

List three potential data sources that you expect would be good to examine for this project. For each, also list how you would expect to access the data (for example, database, API, and so on).

Data sources may include:

  • FMI’s internal databases; I would guess they are relational, SQL-based databases, but they could also be NoSQL as well.
  • The APIs of the banks and other financial institutions from whom FMI originally pulls its data; these are probably XML- or JSON-based APIs.
  • Users might provide some helpful data voluntarily, such as transactions, categories, or other attributes; these must be designed into the product, would originally be web actions, and could be put into FMI’s internal database.

2.

Consider the three project goals you listed in exercise question 3 in the last chapter. What data would you need in order to achieve them?

The goals would require, respectively:

  • Making reliable forecasts would require several months (minimum) of complete transaction data for each relevant account, plus enough transaction attributes to classify them into various categories such as “repeating” or “one-time” transactions, in order to improve the accuracy of the forecasts.
  • Providing suggestions to users would require the same as forecasts (previous bullet item), plus a more stringent requirement for transaction attributes and the ability to classify them. A longer history of data (one year?) might also be required.
  • A data visualization may not have any special data requirements, but reliable transaction attributes would certainly help in cases where, for example, spending categories or transaction types are part of the visualizations.

Chapter 4

1.

You’re about to begin pulling data from FMI’s internal database, which is a relational database. List three potential problems that you would be wary of while accessing and using the data.

It would be good to watch out for, for example:

  • Missing, invalid, or incorrect values in the database.
  • Names and other identifier-type strings that don’t always match up between fields or tables—for example, having “John Public” in one place and “John Q Public” in another. These are unequal strings that may or may not represent the same people.
  • If you have to join some database tables, it’s best to make sure that the field you are joining on matches the records well. I would check to see how many records from each table have properly matching records from the other table, and vice versa.

2.

In the internal database, for each financial transaction you find a field called description containing a string (plain text) that seems to provide some useful information. But the string doesn’t seem to have a consistent format from entry to entry, and no one you’ve talked to can tell you exactly how that field is generated or any more information about it. What would be your strategy for trying to extract information from this field?

My strategy would be something like:

  1. Pull a manageable collection (a few hundred?) of transactions (a random set would be good) from the database, examine the description fields, and try to notice any patterns.
  2. Ask myself: does the string seem to be mainly comma-separated, semicolon-separated, pipe-separated, JSON, or free-form? What characters or terms are repeated from entry to entry?
  3. If I’ve diagnosed a format after all, check to make sure it applies to the whole collection I’m working with.
  4. If there still doesn’t seem to be a common format, focus on the aspects of the string that are most interesting to me. For example, if I am interested in the ZIP code of the transactions (and they appear nowhere else in the database), I examine how the ZIP codes are represented in the string. If each description string contains a substring such as "ZIP code: XXXXX", then I could write a script that searches for the substring "ZIP code:", captures the next characters, and records them as the ZIP code. I would try to use context (text around desired data) for all interesting nuggets I find within the description string.
  5. If none of the above suffices to extract some specific information from the description string, I would begin to expand the elements of context and format that I use to detect and capture data within the string. If there isn’t an obvious context such as a substring "ZIP code:", I would try to figure out a combination of punctuation, separators, whitespace, and other format elements that distinguish the data from its neighbors. Perhaps the ZIP code is always the first five-digit number that always appears between two commas, as in ",XXXXX," or maybe it always appears directly after a two-letter state abbreviation, such as "OH,XXXXX,". A script can easily find substrings meeting these patterns, but they might also capture unwanted data if I’m not careful.

Chapter 5

1.

Given that a main goal of the app is to provide accurate forecasts, describe three types of descriptive statistics you would want to perform on the data in order to understand it better.

Some useful descriptive statistics might be:

  • The typical, minimum, and maximum number of transactions per month for the set of financial accounts you intend to forecast. More transactions would be more informative for statistical forecasting methods.
  • The variances or some other measures of fluctuation of account balances. Smaller variances likely make it easier to forecast.
  • A set of plots of account balances over time for randomly chosen accounts. Looking at plots can provide far more information than mean and variance. It can also give you a good feeling for account balance behavior and get you thinking about statistical methods that might be helpful for forecasting.

2.

Assume that you’re strongly considering trying to use a statistical model to classify repeating and one-time financial transactions. What are three assumptions you might have about the transactions in one of these two categories?

Some possible assumptions are:

  • Repeating transactions occur weekly, monthly, or on some other regular interval.
  • Repeating transactions have the same label, name, or other identifying attribute in every instance.
  • One-time transactions are usually larger than the average transaction, in cases such as vacations, large purchases, windfalls, and so forth.

Chapter 6

1.

Suppose that your preliminary analyses and descriptive statistics from the previous chapter lead you to believe that you can probably generate some reliable forecasts for active users with financial accounts that have many transactions, but you don’t think you can do the same for users and accounts with relatively few transactions. Translate this finding into the “What is possible? What is valuable? What is efficient?” framework for adjusting goals.

A possible translation:

  • What is possible? It is probably possible to make reliable forecasts for users with the highest-quality data, but almost certainly not possible for users with the lowest-quality data.
  • What is valuable? It is most valuable to provide reliable forecasts for the most active app users. Less active users may not care as much, and if they do, they might be enticed to become more active in the interest of getting access to good forecasts.
  • What is efficient? It is probably most efficient to start with the user accounts with high-quality data, generate reliable forecasts, and then to try to apply the methods to lower-quality account data, tweaking/improving the methods as much as possible until some kind of inflection point in which the work needed to (maybe) make the methods work for lower-quality data is too much to justify.

2.

Based on your answer to the previous question, describe a general plan for generating forecasts within the app.

One possible basic plan is:

  1. Work with the highest-quality account data to develop a statistical method and software for generating reliable forecasts.
  2. If you are unsuccessful at (a), revisit the goals and consider whether it is worth spending more time on this one.
  3. If you are successful with (a), apply the methods and software to lower-quality data and find a sort of minimum quality level for reliable forecasts.
  4. Optionally, improve the forecasting methods to better handle lower-quality data, but weigh the time spent against the marginal benefit. Discuss with the product designer and jointly decide how much work to put in.
  5. When everyone is satisfied with the success, scope, and reliability of the forecasts, refine and prep your software for integration with the web app. Be sure to include some notion of not having enough high-quality data; for example, the software should tell the web app "NOT ENOUGH DATA; NO FORECAST GENERATED" whenever the established data-quality threshold is not met.

Chapter 7

1.

Describe two different statistical models you might use to make forecasts of personal financial accounts. For each, give at least one potential weakness or disadvantage.

Possible statistical models include:

  • Linear regression on end-of-month account balances for the past six months. This would capture the general trajectory of the account, but weaknesses include: it doesn’t capture fluctuations in balances very well, in case spending varies widely; and, if, for example, a large monthly check (income) does not arrive on time, it might fall into the following month, potentially throwing off the end-of-month balance—and thus the regression results—by a lot.
  • Estimate average monthly income and average monthly expenses for the account for the past six months; assume future months will have approximately the same levels of income and expenses. This model likely avoids the preceding problem of a check not arriving on time, but a weakness is that six individual months is not many data points, and the model would be susceptible to large one-time or abnormal transactions anywhere.

2.

Assume you’ve been successful in creating a classifier that can accurately put transactions into the categories regular, one-time, and any other reasonable categories you’ve thought of. Describe a statistical model for forecasting that makes use of these classifications to improve accuracy.

I would also try to create a category for normal expenses, which represents day-to-day expenses like coffee, groceries, dinner, or cocktails but which may not strictly fall into the other two categories previously mentioned, repeating and one-time. Normal expenses probably aren’t as large as one-time transactions, but aren’t considered repeating, either. For example, if I go out to eat a couple of times per week, it’s not really a one-time expense because I do it (predictably) every week, and it’s not repeating, either, because I go to different places and on different days of the week.

I would use monthly amounts falling into each of these three categories to calculate baselines for each of them. For repeating transactions, I would use six to nine months of data, placing higher weight on more recent months. So, I would have a weighted average monthly amount for repeating transactions and normal expenses, butI would leave out one-time transactions. My forecasts would be based on assuming that the sum of repeating transactions and normal expenses would continue into the near future, with the intention of somehow showing the user how one-time transactions can affect their forecasts and their financial situation (but this is more a UX and product question than a statistical one).

Because I am such a fan of acknowledging uncertainty when it exists, I would also likely calculate for my forecasts a notion of variance, with more variance for forecasts farther into the future, and also for accounts with more erratic income and expenses.

Chapter 8

1.

What are your two top choices of software for performing the calculations necessary for forecasting in this project and why? What’s a disadvantage for each of these?

My top two choices would be:

  • Python— I can easily do calculations in Python, but a disadvantage is that if my code is going to be deployed with the live Filthy Money Forecasting app, it will have to be relatively bulletproof and bug-free; I’d probably try to recruit a production developer to help with that. I would also have to find a way to make Python interface well with existing code if it doesn’t already.
  • Whatever language/tool is already doing some calculation in the existing app— I can probably just add a component to the code, but the disadvantages might be, depending on the language, that it might not have good built-in statistical libraries, and I might not know this language yet, so I’d have to learn it.

2.

Do your two choices in question 1 have built-in functions for linear regression or other methods for time-series forecasting? What are they?

Respectively:

  • Python does. Between the packages numpy, scikit-learn, and others, Python has some of the best statistical functionality among programming languages.
  • It depends on the language; I would consult the internet before committing.

Chapter 9

1.

What are three supplementary (not strictly statistical) software products that might be used during this project, and why?

Some products might include:

  • A relational database, because FMI already has one and it contains most of the data you will need.
  • A high-performance compute server or cluster, because there is a lot of data and a lot of computation to do.
  • Cloud computing services, because FMI might not already have the computing resources necessary to perform the forecasting calculations within a reasonable time.

2.

Suppose that FMI’s internal relational database is hosted on a single server, which is backed up every night to a server at an offsite location. Give a reason why this could be a good architecture and one reason why it might be bad.

The good: if the single server is powerful enough and the total data set is small enough for the server to handle everything quickly and efficiently, this architecture is probably easier to manage than a larger, distributed system.

The bad: data sets grow, and at some point the server will be overwhelmed either by the sheer size of the data or by the computational cost of managing and querying it; also, a single server is probably a single point of failure.

Chapter 10

1.

List three people (by role or expertise) at FMI with whom you will probably be talking the most while executing your project plan and briefly state why you will probably talk to them so much.

I would guess that I would talk to these people the most:

  • The web app developer responsible for the FMF app— Integrating the statistical software with the app will probably take significant communication, negotiation, and agreement to make sure the two pieces align properly.
  • The product designer responsible for the app— They will probably have many questions about how to interpret output from your statistical application, and you will want to have a good idea of how they expect the user to interact with the data in order to make sure your statistics are applicable.
  • An internal database expert— If you haven’t already used the database extensively, there are likely nuances in structure, access, and efficiency that may not be immediately clear and that may be very helpful to you.

2.

Suppose that the product designer has spoken with the management team, and they all agree that your statistical application must generate a forecast for all user accounts, including ones with extremely sparse data. Priorities have shifted from making sure the forecasts are good to making sure that every forecast exists. What would you do to address that?

If a forecast must exist no matter what, then I would probably assume that low-data-quality accounts will stay at their current (or month-end) balances for the near future. Without the data to justify forecasting a move up or down, the status quo is probably the best choice. I would, however, talk this through with the product designer and others before implementing it.

Chapter 11

1.

Suppose your boss or another management figure has asked for a report summarizing the results of your work with forecasting. What would you include in the report?

I would include, first, a summary of expected forecast accuracies for user accounts with high-quality data, and follow it with results from lower-quality data. This frames the results in terms of what is possible when users are fully engaged (which is an issue for all of FMI and not just for you) with the main application and also highlights one of the biggest successes of your project. Second, I would discuss the distribution of high- and low-quality data in user accounts in order to illustrate to what extent data quality issues affect the app. Lastly, I would provide a few specific examples (with data) of how the forecasts can be used to influence user behavior, such as warning about a bank account that will soon be empty or a credit limit that might be reached soon. This connects the forecasting project directly to user engagement, which is often the most important thing to management types working with app development.

2.

Suppose the lead product designer for the FMF web app asks you to write a paragraph for the app users, explaining the forecasts generated by your application, specifically regarding reliability and accuracy. What would you write?

Forecasts generated by this app are based on your recent spending and earning habits, as well as on any other transactions or events occurring within any of your accounts that are connected to FMF. The more consistent you are in your financial habits, the better our forecasts become—yet another reason to be in control of your finances! Unfortunately, in some cases we may not have enough data to make a forecast at all; in these cases, you may want to connect more accounts to FMF and engage with the app more regularly in order to have these powerful forecasts at your fingertips and maximize your financial well-being. Thanks for making FMF the most powerful personal finance site on the web!

Chapter 12

1.

Suppose that you’ve finished your statistical software, integrated it with the web app, and most of the forecasts appearing in the app seem to be incorrect. List three good places to check to try to diagnose the problem.

I’d check these first:

  • The output from the statistical application. If it’s outputting correct results, then the problem lies in the app that is making use of them. If not, then the problem is in the statistical part.
  • If the problem seems to be in the web app, I’d check the database internal to that app. If the data is stored incorrectly, then there is likely a problem between data intake and when it is stored. If it is stored correctly, then there is a problem in the way the data is handled between when it is pulled from the app’s internal storage and shown on the screen.
  • If the problem seems to be in the statistical application, I would think about the way the data flows through the statistical model and examine results coming directly out of the statistical model, but before they are transformed into the format in which it is sent to the web application.
  • BONUS 4th ANSWER— Check the initial data going into any part of your application. If there’s something wrong with the database query, nothing that comes after will be correct.

2.

After deploying the app, the product team informs you that it’s about to send selected users a survey regarding the app. You may submit three open-ended questions that will be included in the survey. What would you submit?

I would probably submit the following questions:

  • Have the financial forecasts in the app been informative to you? In what ways?
  • Have you acted upon information within the financial forecasts? How?
  • Do you feel that the financial forecasts are accurate? Why or why not?

Chapter 13

1.

Think of a project (data science or otherwise) you’ve worked on in the past, preferably more than a year ago. Where are the materials and resources related to that project? If someone asked you to repeat, restart, or continue that project today, would you be able to? Could you have done anything back then to make it easier today?

This is mainly a thought experiment. It depends entirely on where you work and what you’re working on.

2.

Consider your current job and place of employment. Where are shared resources kept? Can you find what you’re looking for easily? If it were your job to come up with a detailed plan and policy for archiving completed projects, what would you include?

This is another thought experiment, but useful for improving the outcomes of future projects. Solid plans for documenting and storing projects can be incredibly helpful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.38.41