Gene Amdahl was one of the first people who studied parallel processing in the 1960s. He proposed Amdahl's law, which is still applicable today and can become a basis to understand the various trade-offs involved when designing a parallel computing solution. Amdahl's law can be explained as follows:
It is based on the concept that in any computing process, not all of the processes can be executed in parallel. There will be a sequential portion of the process that cannot be parallelized.
Let's look at a particular example. Assume that we want to read a large number of files stored on a computer and want to train a machine learning model using the data found in these files.
This whole process is called P. It is obvious that P can be divided into the following two subprocesses:
- P1: Scan the files in the directory, create a list of filenames that matches the input file, and pass it on.
- P2: Read the files, create the data processing pipeline, process the files, and train the model.