As with any project, you enter the process with some understanding of what you are trying to build. The better you understand this (the problem), the better you are able to solve it.
After understanding what it is that you're trying to do, your next question (in the context of building a machine learning model) is what data do I need? This includes an exploration into what data is available and what data you may need to generate yourself.
Once you've understood what you're trying to do and what data you need, your next question/task is to decide on what algorithm (or model) is needed. This is obviously dependent on your task and the data you have; in some instances, you may be required to create your own model, but more often than not, there will be an adequate model available for you to use, or at least an architecture you can use with your own data. The following table shows some typical computer vision tasks and their related machine learning counterparts:
Task | Machine learning algorithm |
Label images | Image classification |
Recognize multiple objects and their location | Object detection and semantic segmentation |
Find similar images | Image similarity |
Creating stylized images | Style transfer |
The next step is to train your model; typically, this is an iterative process with a lot of fine-tuning until you have a model that sufficiently achieves its task on data it hadn't been trained on.
Finally, with a trained model, you can deploy and use your model in your application. This process is summarized in the following diagram:
To make the concepts of this chapter more concrete, let's work with the hypothetical brief of having to build a fun application to assist toddlers to learn the names of fruits. You and your team have come up with the concept of a game that asks the toddler to find a specific fruit. The toddler earns points when they correctly identify the fruit using the device's camera. With our task now defined, let's discuss what data we need.