Different types of concurrency

There are various ways to achieve concurrency in Python, including threads, tasks, processes, and so on. First, while we say that a concurrent task occurs simultaneously, this is not always the case. In fact, threads and tasks don't really run concurrently—instead, the CPU can switch between different threads really fast so that they seem to be running in parallel, but it always executes one thread at a time. This is ensured by part of the Python interpreter called Global Interpreter Lock, or GIL. Threading can still boost your code execution, which it does by switching to other threads when the CPU is waiting for data to be loaded from the network—we'll talk about that in a minute.

Even then, there are multiple ways to execute code on one CPU. Python's built-in threading library allows the operating system to stop threads and switch between themthe code itself doesn't need to do anything. The problem with threading, in general, is that the OS can stop threads at any momenteven in the middle of writing or computing dataso you should be extremely careful when sharing any data between threads and not use it anywhere until all of the computations are complete. The problem of shared data is often referred to as thread safety.

Another built-in library, asyncio (one that allows asynchronous functions, which we touched on in Chapter 18, Serving Models with a REST API), works slightly differentlysynchronous tasks declare that they are done or blocked, in which case another task will start (or continue) running. Thus, tasks cannot be switched while the process occurs until you allow that from within the task.

However, you can run parts of your code truly in parallel (this feature is usually called parallelism). There are two ways to do this. First, we can leverage other CPUs on your machinemany modern computers have at least two or four CPUs. In order to do that, you can use the built-in multiprocessing library, or any code/library built on it (for example, Dask can run on multiple CPUs of one machine). While this approach allows you to actually run in parallel, it has a large overhead of copying data and orchestrating the process. Because of that fixed cost, multiprocessing generally does not make sense, except for computationally heavy operations.

Lastly, yet another option is to run code simultaneously on many machines. This option was rarely feasible for ordinary developers, even a few years ago, but with the modern cloud-based infrastructure that we have and software tools such as Kubernetes (which we'll discuss later) that are quite accessible and relatively cheap, this is possible. There is no built-in library for that, but frameworks such as Dask and PySpark can help. Running on multiple machines has the same issues as multiprocessing, to the power of ten—deploying machines, loading data, orchestrating tasks, then pulling results together is a huge overhead! But, for better or for worse, there is simply no alternative for huge computations with large datasets that wouldn't fit into one machine's memory. The good news is that, once running a cluster, you can easily add more and more machines when neededthere is virtually no limit (except for the price, of course).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.64.235