1.11 APPLICATIONS OF PARALLEL COMPUTING

The availability of inexpensive yet really powerful parallel computers is expected to make a hitherto unforeseeable impact on our lives. We are used now to parallel computers helping us access any information through web search engines. In fact, the search progresses as we are typing our search key words. However, there is room for improvement and, more importantly, for innovation, as the following sections illustrate.

1.11.1 Climate Modeling

Climate simulations are used for weather forecasting as well as for predicting global climate changes based on different phenomena or human activities. As Reference 1 points out, the resolution of today’s climate models is 200 km. This is considered low resolution given the fact that some climate systems exist completely within such resolution scale.

Assume a high-resolution model for climate simulation partitions the globe using 3-D cells 1 km in size in each direction. Assume also that the total surface of the earth to be 510 × 106 km2 and the thickness of the atmospheric layer to be approximately 1,000 km. Then, we need to simulate approximately 5 × 1011 weather cells. Assume further that each cell needs to do 200 floating point operations for each iteration of the simulation. Thus, we have to perform a total of 1014 floating point operations per iteration.

Let us now assume that we need to run the simulation 106 times to simulate the climate over some long duration of the weather cycle. Thus, we have the following performance requirements for our computing system:

(1.32) c01e032

A computer operating at a rate of 109 floating point operations per second (FLOPS) would complete the operations in 1011 seconds, which comes to about 31 centuries. Assuming that all these simulations should be completed in one workday, then our system should operate at a rate of approximately 2.8 × 1015 FLOPS. It is obvious that such performance cannot be attained by any single-processor computer. We must divide this computational task among many processors. Modeling the atmosphere using a mesh or a grid of nodes lends itself to computational parallelization since calculations performed by each node depend only on its immediate six neighboring nodes. Distributing the calculations among several processors is relatively simple, but care must be given to the exchange of data among the processors. Table 1.2 compares building a parallel processor system needed to give us a performance of 2.8 × 1015 FLOPS. We assume using desktop microprocessors versus using a simple embedded microprocessor [1].

Table 1.2 Parallel Multicore Computer Implementation Using Two Types of Microprocessors Needed to Perform 2.8 × 1015 FLOPS

c01t02323tm

The power advantage of using low-power, low-performance processors is obvious from the table. Of course, we need to figure out how to interconnect such a huge system irrespective of the type of processor used. The interconnection network becomes a major design issue here since it would be impossible to think of a system that uses buses and single global system clock.

1.11.2 CT

CT and magnetic resonance imaging (MRI) are techniques to obtain a high-resolution map of the internals of the body for medical diagnosis. Figure 1.10 shows a simplified view of a CT system. Figure 1.10a shows the placement of the patient on a gurney at the center of a very strong magnet and a strong X-ray source. The gurney is on a movable table in a direction perpendicular to the page. The X-ray source or emitter is placed at the top and emits a collimated beam that travels to the other side of the circle through the patient. An X-ray detector is placed diametrically opposite to where the X-ray source is. When the machine is in operation, the source/detector pair is rotated as shown in Fig. 1.10b. After completing a complete rotation and storing the detector samples, the table is moved and the process is repeated for a different section or slice of the body. The output of a certain detector at a given time is affected by all the patient tissue that a certain X-ray beam encounters in its passage from the source to the detector. As things stand at the time of writing, the patient needs to be in this position for several minutes if not hours (personal experience).

Figure 1.10 Computerized tomography (CT) system. (a) Setup of X-ray sources and detectors. (b) Schematic of the output of each sensor when a single X-ray source is active.

c01f010

Assume the image we are trying to generate is composed of N × N pixels, where N could be approximately equal to 4,000. Thus, we have approximately 107 pixels to generate per image, or slice, of the body scan. As the table moves, more slices should be generated. This allows for 3-D viewing of the body area of concern. For a system that generates S = 1,000 successive slices, SN2 = 1010 pixels will have to be processed. A slice will require approximately N2 (log2 N)3 calculations [16]. For our case, we need approximately

(1.33) c01e033

Assume we need to generate these images in 1 second to allow for a real-time examination of the patient. In that case, the system should operate at a rate of approximately 1013 FLOPS. For an even more accurate medical diagnosis, high-resolution computerized tomography (HRCT) scans are required even at the nanoscale level where blood vessels need to be examined. Needless to say, parallel processing of massive data will be required for a timely patient treatment.

1.11.3 Computational Fluid Dynamics (CFD)

CFD is a field that is closely tied to parallel computers and parallel algorithms. It is viewed as a cost-effective way to investigate and design systems that involve flow of gas or fluids. Some examples of CFD are:

  • ocean currents,
  • our atmosphere and global weather,
  • blood flow in the arteries,
  • heart deformation during high-G maneuvers of a fighter jet,
  • air flow in the lungs,
  • design of airplane wings and winglets,
  • seat ejection in a fighter jet,
  • combustion of gases inside a car cylinder,
  • jet engine air intake and combustion chamber,
  • shape of a car body to reduce air drag, and
  • spray from nozzles such as paint guns and rocket exhaust.

Typically, the region where the flow of interest is being studied is divided into a grid or mesh of points using the finite element method. The number of grid points depends on the size of the region or the desired resolution. A system of linear equations or a set differential equations is solved at each grid point for the problem unknowns. The number of unknown might be around 103, and each variable might require around 103 floating point operations at each grid point.

The targeted region of the CFD applications ranges from 1012 to 1018 FLOPS [17]. If the computer system operates at a speed of 109 (giga) FLOPS, then CFD applications would complete a simulation in the time period that ranges between 15 minutes and 30 years. On the other hand, a parallel computer system operating at 1012 (tera) FLOPS would complete the application in a time period between 1 second and 12 days. Currently, there are few supercomputer systems that operate at the rate of 1015 (peta) FLOPS. On such a system, the larger problem would take about 3 minutes to complete.

1.12 PROBLEMS

1.1. Assume you are given the task of adding eight numbers together. Draw the DG and the adjacency matrix for each of the following number adding algorithms:

(1) Add the numbers serially, which would take seven steps.

(2) Add the numbers in a binary fashion by adding each adjacent pair of numbers in parallel and then by adding pairs of the results in parallel, and continue this process.

1.2. Derive general expressions for the number of tasks required to do the number adding algorithms in Problem 1.1 when we have N = 2n numbers to be added. What conclusion do you make?

1.3. Now assume that you have a parallel computer that can add the numbers in Problem 1.1. The time required to add a pair of numbers is assumed 1. What would be the time required to perform the two algoritnms for the case N = 2n? How much is the speedup?

1.4. Consider Problem 1.3. Now the parallel computers require a time C to obtain data from memory and to communicate the add results between the add stages. How much speedup is accomplished?

1.5. Which class of algorithms would the fast Fourier transform (FFT) algorithm belong to?

1.6. Which class of algorithms would the quicksort algorithm belong to?

1.7. The binary number multiplication problem in Chapter 2 could be considered as a RIA algorithm. Draw the dependence graph of such an algorithm.

1.8. The binary restoring division algorithm is based on the recurrence equation

c01ue002

where rj is the partial remainder at the jth iteration; qk is the kth quotient bit; and D is the denominator. It is assumed that the number of bits in the quotient is n and qn−1 is the quotient most significant bit (MSB). What type of algorithm is this division algorithm?

1.9. A processor has clock frequency f, and it requires c clock cycles to execute a single instruction. Assume a program contains I instructions. How long will the program take before it completes?

1.10. Repeat Problem 1.9 when a new processor is introduced whose clock frequency is f′ = 2f and c′ = 1.5c.

1.11. Give some examples of serial algorithms.

1.12. Give some examples of parallel algorithms.

1.13. Consider the speedup factor for a fully parallel algorithm when communication overhead is assumed. Comment on speedup for possible values of α.

1.14. Consider the speedup factor for a fully parallel algorithm when communication overhead is assumed. Comment on speedup for possible values of R.

1.15. Write down the speedup formula when communication overhead is included and the algorithm requires interprocessor communications Assume that each task in the parallel algorithm requires communication between a pair of processors. Assume that the processors need to communicate with each other m times to complete the algorithm.

1.16. Consider an SPA with the following specifications:

Number of serial tasks per stageNs
Number of serial tasks per stageNp
Number of stagesn

Now assume that we have a single processor that requires τ to complete a task and it consumes W watts while in operation. We are also given N = Np parallel but very slow processors. Each processor requires rτ to complete a task and consumes W/r watts while in operation, where r > 1 is a performance derating factor.

(1) How long will the single processor need to finish the algorithm?

(2) How much energy will the single processor consume to finish the algorithm?

(3) How long will the multiprocessor need to finish the algorithm?

(4) How much energy will the multiprocessor system consume to finish the algorithm?

(5) Write down a formula for the speedup.

(6) Write down a formula for the energy ratio of the multiprocessor relative to the single processor.

1.17. The algorithm for floating point addition can be summarized as follows:

(1) Compare the exponents and choose the larger exponent.

(2) Right shift the mantissa of the number with the smaller exponent by the amount of exponent difference.

(3) Add the mantissas.

(4) Normalize the results.

Draw a dependence graph of the algorithm and state what type of algorithm this is.

1.18. The algorithm for floating point multiplication can be summarized as follows:

(1) Multiply the mantissas.

(2) Add the two exponents.

(3) Round the multiplication result.

(4) Normalize the result.

Draw a dependence graph of the algorithm and state what type of algorithm this is.

1.19. Discuss the algorithm for synthetic apperture radar (SAR).

1.20. Discuss the Radon transform algorithm in two dimensions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.244.44