Parallel programming is ubiquitous in many applications in science and engineering, such as aerodynamics, weather and ocean prediction, and machine learning. Parallel programming lets you distribute work between many CPUs, allowing the program to finish sooner. Distributing the work also reduces the amount of memory needed by the program, so parallelism allows running large programs that otherwise wouldn’t fit into the memory of a single computer. Fortran is natively parallel, which means that the syntax used to express parallel programs is built into the language itself.
In chapter 7, your first foray into parallel Fortran programming was through coarrays. They allowed you to distribute the work among multiple CPUs, exchange data between them, and perform the computations faster. In this chapter, we’ll take it a step further and explore three new parallel concepts: teams, events, and collectives. We’ll use these new features toward the final implementation of the tsunami simulator that we’ve been developing in this book.
Teams and events provide advanced means for controlling program flow and synchronization. Collectives allow you to implement common parallel patterns across images without directly invoking coarrays. At the end of the chapter, you’ll walk away with the working knowledge to implement advanced parallel patterns in Fortran from scratch, or use them to augment an existing Fortran application. Together, images, coarrays, teams, events, and collectives provide a comprehensive toolbox to express any parallel algorithm that you can think of. This chapter will show you how.
Chapter 7 introduced the parallel programming concepts in Fortran, including images, synchronization, and data exchange using coarrays. I strongly recommend that you read that chapter before starting this one. Nevertheless, let’s refresh our memory on these concepts before we build further on them.
Fortran refers to any parallel process as an image. Under the hood, an image can be a process running on a dedicated CPU core or a thread implemented by the operating system. A parallel Fortran program runs on all images, and each image loads its own copy of the program in RAM. The built-in functions this_image
and num_images
are available. The former returns the number of the current image, and the latter returns the total number of images that are running the program. Each image runs the program independently from all other images until they’re synchronized using the sync
all
statement. These concepts allow us to inquire about images and synchronize them. However, they don’t help us regarding exchanging data between images. To do this, Fortran has a special data structure called a coarray. A coarray can be coindexed to access data on remote images--we can copy data to and from other images by indexing a coarray with the target image number.
Teams, events, and collectives build directly on these concepts. Teams let you separate groups of images by different roles, while events make communicating status updates between teams (or just images) simple. Consider a weather prediction model, for example. The simulation can’t start without the initial data coming in, and the team that writes data to disk needs to wait for the simulation team to finish their part of the job. Posting and waiting for events from different teams is how we can synchronize them. Finally, collectives will allow you to perform common parallel calculations, such as sum, minimum, or maximum, without directly invoking coarrays.
As we work on implementing these features in the tsunami simulator, we’ll focus mainly on monitoring the time stepping progress of the simulation and extracting some useful statistics about the simulated water height field. Although a real-world application is likely to employ teams, events, and collectives for more complex tasks, such as downloading and processing remote data, writing model output to disk, and serving data to clients, focusing on a simple and minimal task will help us learn and better understand in detail how these features work.
Fortran 2018 introduced teams to allow the programmer to assign different tasks to groups of images. For example, if you’re computing a weather simulation on 16 images, you could assign them different roles (figure 12.1).
In this specific example, the images are distributed in the following setup:
One image queries a remote server and downloads satellite data when available.
Another is in charge of monitoring the progress of the simulation and logging appropriate information to a text file.
Two images are responsible for writing simulation output files to disk.
The remaining 12 images are churning away with the heavy task of simulation, without getting distracted by other chores.
Let’s apply a subset of this pattern to the tsunami simulator we’ve been developing.
In this section, we’ll use teams to augment our tsunami simulator and assign different roles to parallel images working concurrently. For brevity and to not get bogged down in the details of what the specific roles could be in real-world simulation software, we’ll create only two teams: the compute team and the logging team. While the compute team is churning away at the heavy task of number-crunching, the logging team will monitor and report the progress of the simulator. Logging is a relatively lightweight task, so we’ll assign only one image to the logging team, and the rest will go to the compute team. Thus, if we run the program on four parallel images, one will be logging progress, while the remaining three will be crunching numbers. This is a simplified variant of the approach illustrated in figure 12.1.
The updated tsunami program that uses teams will look as shown in listing 12.1. This listing shows only the added code relative to where we left off with the tsunami simulator in chapter 10. Don’t worry about coding this up just yet; here, I’m merely giving you an overview of what’s coming later in the chapter.
program tsunami use iso_fortran_env, only: team_type ❶ ... type(team_type) :: new_team ❷ integer :: team_num ❸ ... team_num = 1 ❹ if (this_image() == 1) team_num = 2 ❺ form team(team_num, new_team) ❻ change team(new_team) ❼ if (team_num == 1) then ... ❽ else if (team_num == 2) then ... ❾ end if end team ❿ end program tsunami
❶ Imports team_type from the iso_fortran_env module
❷ Declares a new team_type instance
❸ Team number variable that we’ll use to identify sibling teams
❹ All images will go to team 1 by default.
❺ Only the first image will go to team 2.
❼ Changes the current team for each image
❽ The original simulator code is assigned to team 1.
❾ The logging code for team 2 goes here.
❿ Closes the change team construct
This listing summarizes the concepts of forming new teams and switching the execution context between them. First, a team is modeled using a new built-in type, team_
type
, available from the the iso_fortran_env
module. To begin working with teams, we import team_type
and declare an instance of it, in this case new_team
. We also need a unique integer scalar to refer to different teams by their number, in this case team_num
. This variable is used to assign images to different teams. In the form
team
statement in this example, we assign all images to team 1, except the first one, which we assign to team 2. The form
team
statement only creates new teams; it doesn’t affect the execution.
This is where the change
team
construct comes in--it instructs all images that execute it to switch to a new team--in this case, new_team
. Note that change
team
is a construct, like an if
block or a do
loop, and is paired with a matching end
team
statement.
Within the change
team
construct, the images are now running in their new teams. We can assign code to be executed to each team by checking the value of the team number. Teams will work on different tasks, and will also need to synchronize and exchange data from time to time.
Figure 12.2 illustrates this process, albeit with a bit different team organization.
The key concepts introduced here are forming new teams (form
team
statement) and changing the current team (change
team
construct). The form
team
statement creates new teams and encodes the information about which image on the current team will belong to which new team. The change
team
construct moves images to the newly created teams. Within the change
team
construct, the images have new image numbers assigned to them. Teams can work independently from one another, synchronize, and even exchange data.
You may also wonder why we need separate statements for forming and changing teams. We need them because these two operations are fundamentally different in nature: form
team
instructs the compiler to define new teams and assign images to them, analogous to defining a new function; change
team
, on the other hand, switches the execution context between already created teams, which is analogous to calling a function. Don’t worry if this seems like a lot and not everything is clear yet. We’ll go over each element in detail as we work through this section.
Note In case you’re familiar with MPI programming (discussed earlier in the book), whether in C or Fortran, teams are analogous to MPI communicators.
Before we dive into the implementation of teams in the tsunami simulator, let’s look at the syntax of forming new teams, which will apply to any parallel program that uses them. In the beginning of the program, there’s only one team, and we’ll refer to it as the initial team. All images that run the program start in the initial team by default. If you intend to work with teams at all, the first thing you’ll do is form new teams within the initial team using the form
team
statement. You can make as many new teams as you want. In this example, we’ll create two new teams--one for the first half of all images and the other for the rest--as shown in the following listing.
program form_team use iso_fortran_env, only: team_type ❶ implicit none type(team_type) :: new_team ❷ integer :: team_num ❸ team_num = 1 ❹ if (this_image() > num_images() / 2) team_num = 2 ❺ form team(team_num, new_team) ❻ end program form_team
❶ Imports team_type from the iso_fortran_env module
❷ Declares a new team_type instance
❸ Team number variable that we’ll use to identify sibling teams
❹ All images will go to team 1 by default.
Besides the basic housekeeping, like importing the team_type
from the iso_fortran _env
module and declaring the team and team number variables, there are two key elements here. First, we decide how many new teams to create and which images will go to each team. We do this by assigning values to the integer variable team_num
on every image. Second, we execute the form
team
statement, which creates new teams and internally assigns the images to them. If you compile and run this program, there will be no output. This is expected, as a form
team
statement on its own doesn’t emit any output.
A form_team
statement must be executed by all images on the current team. The first form
team
statement in the program is thus always executed by all images in the program. This statement also synchronizes all the images on the team, implying a sync
all
under the hood. (See chapter 8 for a refresher on synchronizing images.) This is the syntax of the form
team
statement:
form team(team_num, team_variable[, new_index, stat, errmsg])
team_num
is a positive, scalar, integer constant or expression that uniquely identifies the team to be created.
new_index
is a scalar integer that allows you to specify an image number that this image will have on the new team.
stat
is the integer status code, with a zero value in case of success and nonzero otherwise
errmsg
is the character string with an informative error message, if stat
returns a nonzero value.
team_num
and team_variable
are required input parameters. The value of team_num
across images determines how many teams will be created with the form
team
statement and which image will belong to which new team. If multiple new teams are created, their numbers don’t need to be contiguous, but they need to be unique positive integers. new_index
is an optional input parameter that you can use to specify the number of the image on the new team, which is otherwise compiler-dependent. If provided, the values of new_index
must be unique and less than or equal to the number of images being assigned to the new team. stat
and errmsg
, both optional output parameters, have the same meaning and behavior as they do in the allocate
and deallocate
statements in chapter 5. As you’ll see throughout the remainder of this chapter, all parallel features introduced by Fortran 2018 have error handling built in.
Now that we have two new teams, how do we instruct images on each team to do certain kinds of work? Recall that by default, all images start on the same team. We need to switch each image to a new team to get it to work on a different task. We do this with the change
team
construct. Following the form_team
statement in listing 12.2, we’ll add this snippet:
change team(new_team) ❶ print *, 'Image', this_image(), 'of', num_images(), & ❷ 'is on team', team_number() ❷ end team ❸
❶ Switches execution to a new team
❷ Reports the image and team number
Now you understand why change
team
is a construct--every change
team
statement must be paired with a matching end
team
statement. A change
team
statement instructs all images that execute it to switch to the team specified in parentheses. The code inside the change
team
construct executes on the new (child) team until the end
team
statement, when the images return to the original (parent) team. Similar to this_image
, which returns the image number, team_number
returns a scalar integer value of the current team.
Let’s save this program in a file called change_team.f 90, compile it, and run it on five images:
caf change_team.f90 -o change_team ❶ cafrun -n 5 --oversubscribe ./change_team ❷ Image 1 of 2 is on team 1 ❸ Image 2 of 2 is on team 1 ❸ Image 1 of 3 is on team 2 ❸ Image 2 of 3 is on team 2 ❸ Image 3 of 3 is on team 2 ❸
❶ Compiles the program using the OpenCoarrays compiler wrapper
❷ Runs the program on five parallel processes
What’s happening here? Each image prints three numbers to the screen: its own image number (this_image()
), the total number of images (num_images()
), and its team number (team_number()
). Let’s look at the values in reverse, from right to left. First, we see that there are two images on team 1 and three on team 2. This is what we expected, as we instructed form
team
to first assign two images (out of five total) to team 1, and the rest to team 2. So far, so good. Second, notice that two of the images report a total number of images of 2, and the remaining three report a total number of images of 3. This means that when executed within the change
team
construct, num_images()
now doesn’t represent the total number of images running the whole program, but the total number of images within the current team. Finally, looking at the current image number, it seems that our original images 3, 4, and 5 now have numbers 1, 2, and 3 on their new team. Conclusion: when executed within the change
team
construct, functions this_image
and num_images
operate in the context of the current team.
Note that the Fortran Standard doesn’t prescribe what the new image numbers on the newly formed teams will be, and leaves the numbering of images on new teams as implementation- (compiler-) dependent. If you need to ensure specific image indices on new teams (or preserve the ones from the initial team), use the new_index
argument in the form
team
statement, described in the previous subsection.
The syntax for the change
team
construct is
[name:] change team(team_value[, stat, errmsg]) ❶ ... ❷ end team [name] ❸
❶ Switches all images to the new team with team_value
❷ All code here is executed in the context of the new team.
❸ Synchronizes images and returns them to the original team
team_value
is an input scalar variable or expression of type team_type
.
stat
and errmsg
are optional output parameters that have the same meaning as in the form
team
statement.
name
is an optional label for the construct, much like a labeled do
loop.
At the beginning of a change
team
construct, all images that execute it switch to the team provided in parentheses. Inside the construct, these images execute within the new team. When they reach the end
team
statement, the images automatically synchronize and return to the original (parent) team that they were on immediately before the change
team
statement.
We’ve learned so far, both from coarrays in chapter 7 and from developing the parallel tsunami simulator, that synchronizing images is crucial for writing correct parallel programs. Recall that when we have data dependency between parallel images, one image must wait for data from another image before proceeding with its own calculation. This subsection explains how synchronization of images works within teams, and how to synchronize multiple teams as a whole.
The essential synchronization mechanism you learned in chapter 7 was the sync
all
statement, which placed a barrier in the code at which every image had to wait for all others before proceeding. At the point of a sync
all
statement, we considered all images to be synchronized. Another option that’s available to us, when we need to synchronize the current image with some but not all other images, is the sync
images
statement. For example, we used sync
all
in the sync_edges
method of the Field
type in the tsunami simulator (see section 10.4) to synchronize every image with all other images. Using sync
images
, we can instead synchronize each image only with its four neighbors, in mod_field.f 90, subroutine sync_edges
:
... sync images(set(neighbors)) ❶ edge(1:je-js+1,1)[neighbors(1)] = self % data(is,js:je) ❷ edge(1:je-js+1,2)[neighbors(2)] = self % data(ie,js:je) ❷ edge(1:ie-is+1,3)[neighbors(3)] = self % data(is:ie,js) ❷ edge(1:ie-is+1,4)[neighbors(4)] = self % data(is:ie,je) ❷ sync images(set(neighbors)) ❸ self % data(is-1,js:je) = edge(1:je-js+1,2) ❹ self % data(ie+1,js:je) = edge(1:je-js+1,1) ❹ self % data(is:ie,js-1) = edge(1:ie-is+1,4) ❹ self % data(is:ie,je+1) = edge(1:ie-is+1,3) ❹ ...
❶ Synchronizes with neighbors before copy into buffer
❷ Copies data into the coarray buffer, edge
❸ Synchronizes with neighbors again before copying out of buffer
❹ Copies data from coarray buffer into the field array
The same behavior holds in the context of teams: sync
all
and sync
images
statements now operate within the team in which they’re executed. For example, if you have two teams and you’ve switched the images to them using the change
team
construct, issuing sync
all
synchronizes the images within each team, but not the teams themselves. Ditto for sync
images
. Although this may be confusing at first, you’ll get used to it over time as you practice working with teams. Just remember: sync
all
and sync
images
statements always operate only within the current team and can’t affect the images outside of the team. In the next subsection, you’ll see how you can synchronize between teams.
In the sync
images
snippet, set(neighbors)
ensures that we pass unique values of neighbors
to sync
images
. We’ll define set
in the same module in mod_field.f 90, as shown in the following listing.
pure recursive function set(a) result(res) ❶ integer, intent(in) :: a(:) integer, allocatable :: res(:) if (size(a) > 1) then res = [a(1), set(pack(a(2:), .not. a(2:) == a(1)))] ❷ else res = a end if end function set
❶ The recursive attribute allows a function to call itself.
❷ Eliminates nonunique elements from the array, one at a time
This is the first time we encounter the recursive
attribute. This attribute allows a function or subroutine to invoke itself. The crux of this function is in the fifth line of the listing, where we recursively reduce the array by removing duplicate elements, one by one, using the built-in function pack
. For a refresher on pack
, see section 5.4, where we used it for the first time. Note that Fortran 2018--the latest iteration of the language as of this writing--makes all procedures recursive by default, so specifying the recursive
attribute won’t be necessary anymore. I still include it here because most Fortran compilers have yet to catch up with this recent development.
Having established that sync
all
and sync
image
statements operate only within the current team and can’t affect the images outside of it, we need a mechanism to synchronize between the teams. Back to our working tsunami example from listing 12.1, where we began incorporating teams for the simulation and logging tasks:
change team(new_team) if (team_num == 1) then ... ❶ else if (team_num == 2) then ... ❷ end if end team
As logging depends on the data from the simulation team, we need a way to synchronize images between different teams. This is where the new sync
team
statement comes in, as shown in the following listing.
use iso_fortran_env, only: initial_team, team_type ❶ ... change team(new_team) if (team_num == 1) then ... ❷ sync team(get_team(initial_team)) ❸ else if (team_num == 2) then sync team(get_team(initial_team)) ❸ ... ❹ end if end team
❶ Imports the initial_team constant from the module
❸ Synchronizes with all images that belong to the initial team
sync
team
has been introduced to the language to allow synchronizing images within the parent team without leaving the change
team
construct. To use it, we need to provide it a team value over which to synchronize. In practice, this will typically be a parent team or some other ancestor team (see the “Exercise 1” sidebar for an example of multiple levels of teams), but can also be the current team or the child team. To refer to a team such as the initial team, which we never defined as a variable, we use the get _team
built-in function, and pass it the initial_team
constant available from the iso_fortran_env
module. Besides the initial_team
integer constant, iso_fortran _env
also provides the parent_team
and current_team
constants.
For brevity, we won’t get bogged down with the exact code that the logging team will execute. In practice, it could be monitoring the time stepping progress of the simulation team, checking and processing files written to disk, printing simulation statistics to the screen, and perhaps even serving them as a web server. An important element to most of these activities is getting the data from the simulation team.
I mentioned in the previous subsection that one of the activities the logging team could be performing is monitoring the time stepping of the simulation team. If they’re operating independently and concurrently, how can the logging team know each time the simulation team steps forward? To demonstrate the exchange of data between teams, let’s send the time step count from the simulation team to the logging team. To do this, we’ll make our time step count variable a coarray, and we’ll use the team number in the image selector when referencing that coarray, as shown in the following listing.
integer(ik) :: time_step_count[*] ❶ ... change team(new_team) if (team_num == 1) then ... time_loop: do n = 1, num_time_steps ... time_step_count[1, team_number=2] = n ❷ end do time_loop else if (team_num == 2) then n = 0 time_step_count = 0 do ❸ if (time_step_count > n) then ❹ n = time_step_count print *, 'tsunami logger: step ', n, 'of', num_time_steps, 'done' if (n == num_time_steps) exit ❺ end if end do end if end team
❶ Declares time step count as a coarray
❷ Copies n into time_step_count on image 1 of team 2
❹ Runs this code if time_step_count has been updated
❺ Leaves the loop if we’ve reached the end
In listing 12.5, we’ve declared the time_step_count
integer coarray, which we’ll use to exchange the time step count between the simulation team and the logging team. To send the data, we’ll use the usual coarray indexing syntax from chapter 7, with a twist: here, we also specify the team number in the image selector (the values between square brackets). When we write time_step_count[1,
team_number=2]
=
n
, we’re saying “Copy the value of n
into the time_step_count
variable on image 1 of team number 2.” This means that the image number is relative to the team in question--image 1 on team 1 is different from image 1 on team 2. On the logging team, we initialize the local value of time_step_count
to zero, loop indefinitely, and check for its value in each iteration. Every time time_step_count
is incremented by the simulation team, we print its value to the screen.
While this is a somewhat trivial example--printing a single integer to the screen is not that much work--it illustrates how to effectively offload heavy compute work to other teams. In a real-world app, while the simulation team is busy crunching numbers, one team could be writing the output files to disk, while another could be serving them as a web server. The results of the tsunami simulator won’t change with introduction of teams into the code, because they affect only how the code and its order of execution are organized. The simulation part of the code, which is responsible for producing numerical results, is now running in its dedicated team rather than on all images. While teams don’t necessarily unlock any new capability relative to original image control and synchronization mechanisms, they allow you to more cleanly express distribution of work among images. This becomes especially important for larger, more complex apps.
In the previous section, we used teams to distribute work among groups of images. Teams allow us to express some parallel patterns and synchronization more elegantly than we otherwise could by controlling individual images directly. Fortran 2018 introduces another new parallel concept called events, provided through the built-in derived type called event_type
. In a nutshell, you can post events from one or more images, and query or wait for those events from others. Figure 12.3 illustrates how events are implemented in Fortran.
You can read this diagram in any order. An alert event is an instance of event_ type
. Image 1 triggers the alert on image 2 by issuing event
post(alert[2])
. This statement is nonblocking, which means that image 1 can immediately move on with whatever code follows. All instances of event_type
keep a count of posted events internally. This count is incremented on every event
post
statement, from any image. Image 2 issues event
wait(alert)
. This is a blocking statement, which means that image 2 will wait until the alert is posted. When it finally happens, event
wait
decrements the internal event count. Alternatively, image 2 can also poll the number of alerts in a nonblocking fashion with the built-in subroutine event_query
.
That’s all there is to it! Let’s first tinker with posting and waiting for events in an example of sending a notification, and then we’ll dive into the syntax and rules of events.
In this section, we’ll build from our tsunami teams example and use events to post updates from the simulation team to the logging team about data being written to disk. While this is technically doable with coarrays alone, you’ll see that events are a perfect candidate for such parallel patterns. Before we jump back into the tsunami, let’s see how events work from a simple push notification example.
Sending a notification from one process to another will be important in any scenario in which you have data dependency between processes. Examples include a long-running data mining job by a worker process, waited on by a process whose role is to write a report for the user (see figure 12.1), or waiting for data to become available on a remote server.
This example will demonstrate using events to wait for another image to complete a long-running job. It doesn’t matter what the actual job is--here, we’ll emulate it by making the image wait for five seconds. When the time is up, the image will send a notification to another image that’s waiting for it. The following listing shows the complete program.
program push_notification use iso_fortran_env, only: event_type ❶ implicit none type(event_type) :: notification[*] ❷ if (num_images() /= 2) error stop & ❸ 'This program must be run on 2 images' ❸ if (this_image() == 1) then print *, 'Image', this_image(), 'working a long job' call execute_command_line('sleep 5') ❹ print *, 'Image', this_image(), 'done and notifying image 2' event post(notification[2]) ❺ else print *, 'Image', this_image(), 'waiting for image 1' event wait(notification) ❻ print *, 'Image', this_image(), 'notified from image 1' end if end program push_notification
❶ Imports event_type from the built-in module
❷ Declares an instance of event_type as a coarray
❸ Requires running on two images
❹ Simulates a long job by waiting for five seconds
❺ Posts the event to notification on image 2
❻ On image 2, waits for notification
First, we import event_type
and declare a coarray instance of it. Like team_type
, event_type
is also provided by the iso_fortran_env
module. An event variable must either be declared as a coarray or be a component of a coarray derived type. Then, from image 1, we post the event by executing the event
post
statement on the notification
variable, with image 2 as the target. This increments the event count in the notification
variable, which can now be queried or waited for on image 2. On the other side, image 2 issues the matching event
wait
statement. This statement blocks the execution on image 2 until image 1 has posted the event.
If you compile and run this program, you’ll get
caf push_notification.f90 -o push_notification ❶ cafrun -n 2 ./push_notification ❷ Image 1 working a long job Image 2 waiting for image 1 Image 1 done and notifying image 2 Image 2 notified from image 1
❶ Compiles the program using the OpenCoarrays compiler
❷ Runs the program on two images
Notice the order of printed lines in the output. The sequence of operations is set by the event
post
and event
wait
statements. Because image 1 is working on a long job (here emulated by sleeping for five seconds), image 2 will announce that it’s waiting for image 1 before it receives the notification and will print that the message was received only after event
wait
has executed. The following two subsections describe the general syntax of event
post
and event
wait
statements.
The first step to any work with events is to post them using the event
post
statement, which takes the general form
event post(event_var[, stat, errmsg])
where event_var
is a variable of event_type
, and stat
and errmsg
have the same meaning as they do in the form
team
and change
team
statements.
While not strictly required by the language, you’d always want to post to an event variable on another image by coindexing it (indexing a coarray); for example
type(event_type) :: notification[*]
event post(notification[this_image() + 1]) ❶
❶ Posts a notification to the next image
You can post to an event variable as many times and as frequently as you want, with or without matching event
wait
statements. Every time you do, an internal event count for that event variable is incremented. You can also post to an event from more than one image. You’ll see soon how this mechanism can be used to make multiple event posts and wait for them only on some occasions.
Images posting events is just one side of the transaction. For an image to wait for the event that it owns, it needs to execute the event
wait
statement. This statement has the syntax
event wait(event_var[, until_count, stat, errmsg)
event_var
is a scalar variable of event_type
and has the same meaning as in event
post
.
until_count
is an optional integer expression that’s the number of posted events for which to wait, with a default value of 1.
stat
and errmsg
are optional output parameters for error handling and have the same meaning as before.
In a nutshell, event
wait
blocks the image that executes it until some other image posts an event to it. If until_count
is provided and greater than 1, the image will wait until that many events have been posted. On successful execution of event
wait
, the internal event count is decremented by until_count
, if provided, and by 1 otherwise. For example, this statement
event wait(notification, until_count=100)
blocks the executing image until 100 events have been posted to the notification
variable from any other image. Once executed, the internal event count is decremented by exactly 100. Note that this doesn’t mean that the event count is always reduced to zero, because remote event posts can keep incrementing the event count before event
wait
has time to return.
Using event
wait
together with the until_count
parameter allows you to not block on every posted event, but only on some number of events. However, it also illustrates a restriction to event
wait
: it’s impossible for the image that listens for events to know how many have been posted without explicitly blocking execution with event
wait
. This is indeed rather limiting. To poll events without blocking the current image, Fortran provides a built-in subroutine event_query
, which we’ll explore in the next subsection.
As you work with events, you’ll soon find it useful to query an event variable to find out how many times an event has been posted. The built-in subroutine event_query
does exactly this
call event_query(event_var, count[, stat])
where event_var
is the input variable of type event_type
, and count
is the output integer number of events posted. Unlike the event
wait
statement, calling event _query
doesn’t block execution but simply returns the count of posted events. event_query
is a read-only operation, so it doesn’t decrement the event count like event
wait
does. This makes it more suitable for implementation of nonblocking parallel algorithms, as you’ll find out in the “Exercise 2” sidebar.
In chapter 7, you learned how to use coarrays and their square bracket syntax to exchange values between parallel images. This mechanism for data exchange is simple and to the point--you as the programmer explicitly instruct the computer to send and receive data between images. For common calculations across many images, such as a global sum or maximum and minimum values of distributed arrays, implementing such parallel algorithms using coarrays directly can be tedious and prone to errors. Fortran 2018 introduced collective subroutines to perform common parallel operations on distributed data.
Take, for example, a climate model that predicts the air temperature over the globe far into the future. As a climate scientist or a policy maker, you’d be interested in finding out what the global minimum, maximum, and average value of air temperature or mean sea level was over time. However, if the climate model was running in parallel (almost all of them are!), calculating the global temperature statistics would not be trivial, because every CPU would have the data only for the region that it was computing for. In the simplest implementation, you’d have to do the following:
Calculate minimum, maximum, and average values on each CPU for its region.
Calculate the global statistics on one CPU based on arrays of regional statistics.
We went through this exercise with a simple dataset back in chapter 7 when we were first introduced to coarrays. Now, collective subroutines (I’ll refer to them as collectives) can do some of the heavy lifting for you.
Let’s try this out in the tsunami simulator. In our working version of the simulator so far, for every time step, we were reporting the time step count to the screen, while the program was writing raw data into files in the background:
program tsunami ... ❶ time_loop: do n = 1, num_time_steps ❷ if (this_image() == 1) & ❸ print *, 'Computing time step', & ❸ n, '/', num_time_steps ❸ ... ❹ end do time_loop ❺ end program tsunami
❶ Initialization part of the program
❷ Iterates through simulation time steps
❸ If image 1, reports time step count to the screen
❹ The simulation calculation goes here.
❺ End of the simulation time loop
At the beginning of each time step, we print the current time step count and the total number of time steps to the screen. We do this only from one image to avoid printing the same message from all images. Let’s augment this short report by adding the minimum, maximum, and average water height value to each print statement. Like in the thought experiment of a parallel climate model, the water height values here are also distributed across parallel images. The following listing shows how we’d calculate global minimum and maximum values using standard collectives co_min
and co_max
, respectively.
... real(ik) :: hmin, hmax ❶ ... time_loop: do n = 1, num_time_steps ... hmin = minval(h % data) ❷ call co_min(hmin, 1) ❸ hmax = maxval(h % data) ❹ call co_max(hmax, 1) ❺ if (this_image() == 1) print '(a, i5, 2(f10.6))', & ❻ 'step, min(h), max(h):', n, hmin, hmax ❻ end do time_loop
❶ Declares temporary variables
❷ Calculates the local minimum on each image
❸ Calculates the collective minimum from hmin on each image and stores it into hmin on image 1
❹ Calculates the local maximum on each image
❺ Calculates the collective maximum from hmax on each image and stores it into hmax on image 1
❻ Prints the current time step and global minimum and maximum to the screen
To compute the global minimum of water height, we first calculate the local minimum on each image using the minval
function and store it into the temporary variable hmin
. Recall that h
is a type(Field)
instance, so we access the raw values through its component h
%
data
. Second, we use the collective subroutine co_min
to calculate the minimum value of hmin
across all images. The first argument to co_min
is an intent(in
out)
scalar, and the second argument (optional) is the number of the image on which to store the result. In this case, all images invoke co_min
, and only the value of hmin
on image 1 is modified in-place. If the image number were not specified (call
co_min(hmin)
), the value of hmin
would be updated in-place on all images. This implies that invoking the collective subroutine will inevitably overwrite the value of the input on at least one image.
We repeat the same procedure to compute the global maximum using co_max
. Finally, we report the current time step and minimum and maximum values to the screen using a modified print
statement. Here’s the sample output:
step, min(h), max(h): 1 0.000000 1.000000 step, min(h), max(h): 2 0.000000 0.996691 step, min(h), max(h): 3 0.000000 0.990097 ... step, min(h), max(h): 998 -0.072596 0.186842 step, min(h), max(h): 999 -0.072279 0.188818 step, min(h), max(h): 1000 -0.071815 0.190565
This was an introduction to the co_min
and co_max
subroutines by example. In the next section, I’ll describe the rest of the collectives and provide their general syntax.
Note Collective subroutines are built into the language and are available out of the box, just like the regular functions min
, max
, and sum
.
Fortran 2018 defines a total of five collective subroutines:
co_broadcast
--Sends the value of a variable from the current image to all others
co_max
--Computes the maximum value of a variable over all images
co_min
--Computes the minimum value of a variable over all images
co_sum
--Computes the sum of all values of a variable across all images
These cover most collective operations that you’ll likely encounter in your work. However, the language won’t stop you from implementing your own custom collectives using coarrays and synchronization, should you ever need them. The rest of this section describes co_sum
and co_broadcast
in more detail. To learn more about co_reduce
, the most complex collective subroutine, see section 12.7 for reference.
Note If you’re familiar with parallel programming using MPI, Fortran 2018 collective subroutines will look familiar, as they’re analogs to their MPI counterparts.
Figure 12.4 illustrates how co_sum
works when invoked on four images.
In this example, we invoke co_sum(a)
on each image, which triggers a summation of values of a
across all images. The exact data exchange pattern may vary depending on compilers and underlying libraries, but the point is that you can use this built-in subroutine and not worry about explicitly copying data via coarrays and synchronizing images to avoid race conditions. By default, the result of the collective sum is made available on all images, and the value of a
is updated on each image to the global sum value. However, if you need this value on only one image, you can specify it as an argument; for example, call
co_sum(a,
3)
would compute a sum over all images but update the value of a
only on image 3.
The full syntax for invoking co_sum
is
call co_sum(a[, result_image, stat, errmsg])
a
is a variable that has the same type across all images. It doesn’t need to be declared as a coarray. This is an intent(in
out)
argument, so its value may be modified in-place.
result_image
is an optional integer scalar indicating on which image to store the result. If omitted, the result is stored on all images.
stat
and errmsg
, both optional, are scalar integer and character variables, respectively. They have the same meaning as in allocate
and deallocate
statements, and allow for explicit error handling.
As you might guess, co_sum
, co_min
, and co_max
are implemented for numeric types only (integer
, real
, and complex
).
While all images must execute the call to co_broadcast
, the specified image acts as the sender, and all others act as receivers. Figure 12.5 illustrates an example of this functioning.
The inner workings of this procedure, including copying of data and synchronization of images, are implemented by the compiler and underlying libraries, so you and I don’t have to worry about them.
The full syntax for invoking co_broadcast
is similar to co_min
, co_max
, and co_sum
, except that the broadcast variable isn’t limited to numeric data types. A subtle but important point about collective subroutines is that the variables they operate on don’t have to be declared as coarrays. This allows you to write some parallel algorithms without declaring a single coarray. For an example, take a look at the source code of a popular Fortran framework for neural networks and deep learning at https://github.com/modern-fortran/neural-fortran. It implements parallel network training with co_broadcast
and co_sum
, without explicitly declaring any coarrays.
Congratulations, you made it to the end! Having now covered teams, events, and collectives, it’s a wrap. Work through the exercises, make a few parallel toy apps of your own, and you’re off to the races. You should have enough Fortran experience under your belt to start new Fortran programs and libraries, as well as to contribute to other open source projects out there. If you’d like to return to our main example, appendix C provides a recap and complete code of the tsunami simulator. It also offers ideas on where to go from here, as well as tips for learning more about Fortran. The amazing world of parallel Fortran programming is waiting for you.
This section contains solutions to exercises in this chapter. Skip ahead if you haven’t worked through the exercises yet.
Solving this exercise will require creating three new teams at the beginning of the program: hunters, gatherers, and elders. Furthermore, we’ll need to create a yet-to-be-determined number of subteams on each of the hunter and gatherer teams. Let’s tackle the first step first, as shown in the following listing.
program hunters_gatherers use iso_fortran_env, only: team_type implicit none type(team_type) :: new_team integer :: team_num integer, parameter :: elders_team_num = 1 ❶ integer, parameter :: hunters_team_num = 2 ❶ integer, parameter :: gatherers_team_num = 3 ❶ real :: image_fraction ❷ image_fraction = this_image() / real(num_images()) ❷ team_num = elders_team_num ❸ if (image_fraction > 1 / 6.) & ❸ team_num = hunters_team_num ❸ if (image_fraction > 1 / 2.) & ❸ team_num = gatherers_team_num ❸ form team(team_num, new_team) ❹ end program hunters_gatherers
❶ Sets the team numbers as compile-time parameters
❷ Calculates the fraction of the image number relative to the total number of images
❸ Based on the image number fraction, sets the team number for each image
In this part, we’re not doing anything new relative to what we learned in section 12.2, except that we’re creating three new teams instead of two. The image_fraction
variable here is used as a convenience to easily assign 1/6, 1/3, and 1/2 to the elders, hunters, and gatherers, respectively.
Now, let’s change the team to new_team
and print a message from one image on each team, as shown in the following listing.
... change team(new_team) ❶ if (team_number() == elders_team_num) then ❷ if (this_image() == 1) & print *, num_images(), 'elders stayed in the village to rest' else if (team_number() == hunters_team_num) then ❸ if (this_image() == 1) & print *, num_images(), 'hunters went hunting' else if (team_number() == gatherers_team_num) then ❹ if (this_image() == 1) & print *, num_images(), 'gatherers went foraging' end if end team ❺ end program hunters_gatherers
❷ Branch that will be executed by the elders
❸ Branch that will be executed by the hunters
❹ Branch that will be executed by the gatherers
❺ Returns context to the original team
Like we learned in section 12.2, we change the team for all images to new_team
. Depending on the image number, this will be the elder, hunter, or gatherer team. Inside the change
team
construct, we check which team we’re on by comparing the value of team_number
to our compile-time constants for team number. At this point, we only report the activity for each team.
Next, we’ll create subteams from each of the hunter and gatherer teams. Specifically for hunters, we’ll have the following snippet inside the hunters if
branch:
form team ((this_image() - 1) / 3 + 1, hunters) ❶ change team(hunters) ❷ print *, 'Hunter', this_image(), 'in team', & ❸ team_number(), 'hunting for game' ❸ end team ❹
❶ Places hunters in subteams of 3
❷ Changes context to the new subteam
❸ Each image reports from its subteam.
❹ Returns back to the hunters team
The code to create and change to subteams for gatherers is similar to that for hunters:
form team ((this_image() - 1) / 2 + 1, gatherers) change team(gatherers) print *, 'Gatherer', this_image(), 'in team', & team_number(), 'gathering fruits and veggies' end team
Place this code inside the if
branch for the gatherers team, and there you have it. If you now compile this program and run it on, say, 12 images, you’ll get output similar to this:
caf hunters_gatherers.f90 -o hunters_gatherers ❶ cafrun -n 12 --oversubscribe ./hunters_gatherers ❷ 2 elders stayed in the village to rest ❸ 4 hunters went hunting ❸ 6 gatherers went foraging ❸ Hunter 1 in team 2 hunting for game ❹ Hunter 2 in team 1 hunting for game Hunter 1 in team 1 hunting for game Gatherer 1 in team 1 gathering fruits and veggies Hunter 3 in team 1 hunting for game Gatherer 2 in team 3 gathering fruits and veggies Gatherer 1 in team 3 gathering fruits and veggies Gatherer 2 in team 1 gathering fruits and veggies Gatherer 2 in team 2 gathering fruits and veggies Gatherer 1 in team 2 gathering fruits and veggies
❶ Compiles using the OpenCoarrays compiler, caf
❸ Group activity report from each team
❹ Individual activity reports from each villager on their subteams
In this example, I chose only 12 images for brevity, but this example will work with any number of images (well, up to the limit of your computer’s RAM, as each image runs its own copy of the program). Notice that the individual hunter and gatherer activity reports aren’t in order, and they shouldn’t be--all images execute completely asynchronously, except at form
team
and end
team
statements, where they synchronize (and only with images in their own team). For example, in the outer change
team
construct, the elders, hunters, and gatherers teams run in parallel to one another, and this is the beauty of parallel programming in Fortran.
You can run this program on many images on a single-core computer, and it will run like a traditional concurrent program, which in other languages is accomplished with, say, threading or async/await
. You can also run this program (unchanged!) on many distributed-memory servers in parallel, and even on computers around the world.
Let’s start with the simulation team that’s stepping forward through the computation. Recall that in listing 12.5, we used a coarray to copy the time step count from one team to another:
if (this_image() == 1) time_step_count[1, team_number=2] = n
In this snippet, we sent the value of the local time step n
to the time_step_count
coarray on image 1 of team 2. We did that only from one image, as all images on the simulation team have the same value for the time step count. Now, if we’re implementing this using events, this first part is easy. We’ll just declare an event variable and use it in the event
post
statement from team 1 to post an event from the simulation team to the logging team:
type(event_type) :: time_step_event[*] ❶ ... if (this_image() == 1) & event post(time_step_event[1, team_number=2]) ❷
❷ Posts the event from image 1 on the current team to image 1 on team 2
That’s it as far as posting the event from the simulation team goes. Let’s see how we can receive this information from the logging team.
On the logging team, we’ll run in an infinite loop and have an event
wait
statement to block the execution. On each event intercepted, we’ll increment the counter, print the time step count to the screen, and exit the loop only if we’ve reached the end of the simulation:
... else if (team_num == 2) then n = 0 do ❶ event wait(time_step_event) ❷ n = n + 1 ❸ print *, 'tsunami logger: step ', n, & ❹ 'of', num_time_steps, 'done' ❹ if (n == num_time_steps) exit ❺ end do end if ...
❷ Blocks until the event is posted
❹ Prints the time step count to the screen
❺ Exits if we’ve reached the end of the simulation
The advantage to the approach using event
wait
is that we’re guaranteed to catch every event that’s posted. The downside is that we need to do the counting outselves (n
=
n
+
1
), and that event
wait
is blocking the execution. This is fine if counting time steps is the only thing the logging team needs to do. The event
wait
approach thus makes the logging team tightly coupled to the simulation team. Now let’s take a look at the alternative solution using event_query
.
Here’s the solution to the exercise using event_query
. Rather than blocking execution until each event is posted, we’re simply going to query the event count and print it to the screen if its value changed from the previous iteration:
... else if (team_num == 2) then n = 0 do ❶ call event_query(time_step_event, time_step_count) ❷ if (time_step_count > n) then ❸ n = time_step_count print *, 'tsunami logger: step ', n, & ❹ 'of', num_time_steps, 'done' ❹ end if if (n == num_time_steps) exit ❺ end do end if ...
❷ Blocks until the event is posted
❹ Prints the time step count to the screen
❺ Exits if we’ve reached the end of the simulation
The advantage to this approach is that the counting is handled automatically inside the time_step_event
variable. This approach is also not blocking, unlike the event
wait
approach. If we needed to, we could carry out some other tasks on the logging team, and in each iteration, the event_query
subroutine would return whatever the current value of the time_step_count
was. This approach is thus asynchronous, and some time steps may be skipped if the simulation iterations are faster than the logging.
We’ll begin with our existing code in listing 12.7 that computes the global minimum and maximum of water height:
hmin = minval(h % data) call co_min(hmin, 1) hmax = maxval(h % data) call co_max(hmax, 1) if (this_image() == 1) print '(a, i5, 2(f10.6))', & 'step, min(h), max(h):', n, hmin, hmax
To calculate the global average, we’ll follow the same procedure. However, considering that we don’t have a collective average function available out of the box, we’ll get creative with the collective sum function co_sum
. First, to calculate the local average, we’ll take the sum of the local array and divide it by the total number of elements. Your first instinct may be to do something like this:
hmean = sum(h % data) / size(h % data)
Although this is the correct approach, recall that the data
component of the Field
type is allocated with one extra row and column on each side of the array, to facilitate halo exchange with neighboring images. From the Field
type constructor function in mod_field.f 90
allocate(self % data(self % lb(1)-1:self % ub(1)+1,& ❶ self % lb(2)-1:self % ub(2)+1)) ❶
❶ Allocates the data array with an extra index on each end
Thus, if we were to compute the sum of h
%
data
as a whole, we’d also be including values from the edges of neighbor images, which isn’t what we’re looking for. Instead, we’ll slice the array to go exactly from the lower bound (lb
) to the upper bound (ub
) in each axis:
hmean = sum(h % data(h % lb(1):h % ub(1),h % lb(2):h % ub(2))) & / size(h % data(h % lb(1):h % ub(1),h % lb(2):h % ub(2)))
At this point, hmean
is the local average value of water height on each parallel image. Of course, don’t forget to declare hmean
in the declaration section of the program. Like with the collective minimum and maximum, we now apply co_sum
to hmean
to store the sum on image 1, and divide the result by the total number of images to arrive at the average value:
call co_sum(hmean, 1) ❶ hmean = hmean / num_images() ❷
❶ Computes the collective sum of hmean and stores the result on image 1
❷ Divides hmean by the total number of images to get the average value
Finally, let’s add hmean
to the print
statement and modify the format string accordingly:
if (this_image() == 1) print '(a, i5, 3(f10.6))', & 'step, min(h), max(h), mean(h):', n, hmin, hmax, hmean
If you now recompile and rerun the tsunami simulator, you’ll get output like this:
step, min(h), max(h), mean(h): 1 0.000000 1.000000 0.003888 step, min(h), max(h), mean(h): 2 0.000000 0.996691 0.003888 step, min(h), max(h), mean(h): 3 0.000000 0.990097 0.003888 ... step, min(h), max(h), mean(h): 998 -0.072596 0.186842 0.003888 step, min(h), max(h), mean(h): 999 -0.072279 0.188818 0.003888 step, min(h), max(h), mean(h): 1000 -0.071815 0.190565 0.003888
The rightmost column in the output is our newly added water height average. Its values are constant throughout the simulation, which serves as evidence that our simulator conserves water volume.
Teams, a mechanism to group images by common task:
team_type
--A new type for working with teams, available from the iso_ fortran_env
module
change
team
/end
team
--A construct to switch images to a new team
team_number
--A built-in function to get the current team number
get_team
--A built-in function to get the team variable, current or otherwise
sync
team
--A statement to synchronize images across a common, typically parent team
Events, a mechanism to organize the flow of your parallel programs around discrete events:
Collective subroutines co_broadcast
, co_max
, co_min
, co_reduce
, and co_sum
, which implement some common parallel operations
recursive
--A procedure attribute that allows a procedure to invoke itself
execute_command_line
--A built-in subroutine to run a command from the host operating system
“The new features of Fortran 2018,” by John Reid (PDF download): http:// mng.bz/EdaX
“A parallel Fortran framework for neural networks and deep learning,” by Milan Curcic: https://arxiv.org/abs/1902.06714
Fortran 2018 introduces new concepts for advanced parallel programming: teams, events, and collectives.
Teams and events are mechanisms for distribution of work and synchronization, whereas collective subroutines are used for parallel reduction operations, such as sum, minimum, and maximum.
Teams are used to form distinct groups of images and assign them different tasks.
At the beginning of the program, all parallel images start in the initial team, and you can create as many teams as you want.
When you switch images to new teams, all teams run independently from one another until explicitly synchronized.
Events allow you to express the flow of your parallel program in a more elegant, and, ahem, event-driven style: post events from one or more images, wait for events from others, or just count them asynchronously.
Collective subroutines allow you to perform some common parallel patterns without directly invoking coarrays.
3.149.214.32