Multicore

Recently, a number of Android devices came out based on multicore architecture. For example, the Samsung Galaxy Tab 10.1 and Motorola Xoom tablets both use a dual-core processor (Cortex A9 cores). A multicore processor, unlike a single-core processor, can execute multiple threads simultaneously. That being said, it is easy to see how this could improve performance as a dual-core processor can theoretically do twice as much as a single-core one (everything else being equal, for example, clock frequency). Although optimizing for multiple cores is not as easy as it sounds, and some caveats exist, your application can definitely leverage the additional power that today's multicore processors bring. Devices with dual-core CPUs include:

  • Samsung Galaxy Tab 10.1
  • Motorola Xoom
  • Motorola Phonton 4G
  • Motorola Droid 3
  • HTC EVO 3D
  • LG Optimus 2X
  • Samsung Galaxy Nexus

In many cases, you won't have to worry about how many cores a device has. Delegating certain operations to a separate thread using a Thread object or AsyncTask is usually enough as you can still create multiple threads even on a single-core processor. If the processor has several cores, then threads will simply run on different processor units, which will be transparent to your application.

That being said, there may be times when you really need to get the most out of the CPU to achieve an acceptable level of performance and design algorithms especially tailored for multiple cores.

To achieve the best performance, your application may first need to find out how many cores are available, simply by calling the RunTime.availableProcessors() method, as shown in Listing 5–15.

Listing 5–15. Getting the Number Of Processors

    // will return 2 on a Galaxy Tab 10.1 or BeBox Dual603, but only 1 on a Nexus S or
Logitech Revue
    final int proc = Runtime.getRuntime().availableProcessors();

Typically, the number of “available processors” is 1 or 2 although future products will be using quad-core CPUs. Current Android notebooks may already be using quad-core architecture. Depending on when you plan on making your application available, you may want to focus on only 1- and 2-core CPUs and only later publish an update to take advantage of more cores.

NOTE: Assume the number of cores may not always be a power of 2.

Modifying Algorithm For Multicore

Some of the Fibonacci algorithms presented in Chapter 1 are good candidates to take advantage of multiple cores. Let's start with the divide-and-conquer algorithm whose implementation is shown in Listing 5–16 (which is the same implementation shown in Listing 1–7 in Chapter 1).

Listing 5–16. Fibonacci Divide-and-Conquer Algorithm

public class Fibonacci
{
    public static BigInteger recursiveFasterBigInteger (int n)
    {
        if (n > 1) {
            int m = (n / 2) + (n & 1);

            // two simpler sub-problems
            BigInteger fM = recursiveFasterBigInteger(m);
            BigInteger fM_1 = recursiveFasterBigInteger(m - 1);

            // results are combined to compute the solution to the original problem
            if ((n & 1) == 1) {
                // F(m)^2 + F(m-1)^2
                return fM.pow(2).add(fM_1.pow(2));
            } else {
                // (2*F(m-1) + F(m)) * F(m)
                return fM_1.shiftLeft(1).add(fM).multiply(fM);
            }
        }
        return (n == 0) ? BigInteger.ZERO : BigInteger.ONE;
    }
}

This algorithm does what divide-and-conquer algorithms do:

  • The original problem is divided into simpler sub-problems.
  • The results are then combined to compute the solution to the original problem.

Since the two sub-problems are independent, it is possible to execute them in parallel without much synchronization. The Java language defines the ExecutorService interface, which is implemented by several classes you can use to schedule work to be done. An example is shown in Listing 5–17, which uses the factory method from the Executors class to create a thread pool.

Listing 5–17. Using ExecutorService

public class Fibonacci {
    private static final int proc = Runtime.getRuntime().availableProcessors();
    private static final ExecutorService executorService = Executors.newFixedThreadPool(proc + 2);

    public static BigInteger recursiveFasterBigInteger (int n) {
        // see Listing 5–16 for implementation
    }

    public static BigInteger recursiveFasterBigIntegerAndThreading (int n) {
        int proc = Runtime.getRuntime().availableProcessors();
        if (n < 128 || proc <= 1) {
            return recursiveFasterBigInteger(n);
        }

        final int m = (n / 2) + (n & 1);
        Callable<BigInteger> callable = new Callable<BigInteger>() {
            public BigInteger call() throws Exception {
                return recursiveFasterBigInteger(m);
            }
        };
        Future<BigInteger> ffM = executorService.submit(callable); // submit first job
as early as possible

        callable = new Callable<BigInteger>() {
            public BigInteger call() throws Exception {
                return recursiveFasterBigInteger(m-1);
            }
        };
        Future<BigInteger> ffM_1 = executorService.submit(callable); // submit second job

        // getting partial results and combining them
        BigInteger fM, fM_1, fN;

        try {
            fM = ffM.get(); // get result of first sub-problem (blocking call)
        } catch (Exception e) {
            // if exception, compute fM in current thread
            fM = recursiveFasterBigInteger(m);
        }
        try {

            fM_1 = ffM_1.get(); // get result of second sub-problem (blocking call)
        } catch (Exception e) {
            // if exception, compute fM in current thread
            fM_1 = recursiveFasterBigInteger(m-1);
        }

        if ((n & 1) != 0) {
            fN = fM.pow(2).add(fM_1.pow(2));
        } else {
            fN = fM_1.shiftLeft(1).add(fM).multiply(fM);
        }

        return fN;
    }
}

As you can clearly see, the code is harder to read. Moreover, this implementation is still based on a low-performance code: as we saw in Chapter 1, the two sub-problems would end up computing many of the same Fibonacci numbers.  Better implementations were using a cache to remember the Fibonacci numbers already computed, saving significant time. Listing 5–18 shows a very similar implementation, but using a cache.

Listing 5–18. Using ExecutorService and Caches

public class Fibonacci {
    private static final int proc = Runtime.getRuntime().availableProcessors();
    private static final ExecutorService executorService = Executors.newFixedThreadPool(proc + 2);

    private static BigInteger recursiveFasterWithCache (int n, Map<Integer, BigInteger>
cache)
    {
        // see Listing 1–11 for implementation (slightly different though as it was
using SparseArray)
    }

    public static BigInteger recursiveFasterWithCache (int n)
    {
        HashMap<Integer, BigInteger> cache = new HashMap<Integer, BigInteger>();
        return recursiveFasterWithCache(n, cache);
    }

    public static BigInteger recursiveFasterWithCacheAndThreading (int n) {
        int proc = Runtime.getRuntime().availableProcessors();
        if (n < 128 || proc <= 1) {
            return recursiveFasterWithCache (n);
        }

        final int m = (n / 2) + (n & 1);
        Callable<BigInteger> callable = new Callable<BigInteger>() {
            public BigInteger call() throws Exception {
                return recursiveFasterWithCache (m);
            }
        };
        Future<BigInteger> ffM = executorService.submit(callable);

        callable = new Callable<BigInteger>() {

            public BigInteger call() throws Exception {
                return recursiveFasterWithCache (m-1);
            }
        };
        Future<BigInteger> ffM_1 = executorService.submit(callable);

        // getting partial results and combining them
        BigInteger fM, fM_1, fN;

        try {
            fM = ffM.get(); // get result of first sub-problem (blocking call)
        } catch (Exception e) {
            // if exception, compute fM in current thread
            fM = recursiveFasterBigInteger(m);
        }
        try {
            fM_1 = ffM_1.get(); // get result of second sub-problem (blocking call)
        } catch (Exception e) {
            // if exception, compute fM in current thread
            fM_1 = recursiveFasterBigInteger(m-1);
        }

        if ((n & 1) != 0) {
            fN = fM.pow(2).add(fM_1.pow(2));
        } else {
            fN = fM_1.shiftLeft(1).add(fM).multiply(fM);
        }


        return fN;
    }
}

Using Concurrent Cache

One thing to notice in this implementation is the fact that each sub-problem will use its own cache object and therefore duplicate values will still be computed. For the two sub-problems to share a cache, we need to change the cache from a SparseArray object to an object that allows concurrent access from different threads. Listing 5–19 shows such an implementation, using a ConcurrentHashMap object as the cache.

Listing 5–19. Using ExecutorService and a Single Cache

public class Fibonacci {
    private static final int proc = Runtime.getRuntime().availableProcessors();
    private static final ExecutorService executorService = Executors.newFixedThreadPool(proc + 2);

    private static BigInteger recursiveFasterWithCache (int n, Map<Integer, BigInteger>
cache)
    {
        // see Listing 1–11 for implementation (slightly different though as it was using SparseArray)
    }

    public static BigInteger recursiveFasterWithCache (int n)

{
        HashMap<Integer, BigInteger> cache = new HashMap<Integer, BigInteger>();
        return recursiveFasterWithCache(n, cache);
    }

    public static BigInteger recursiveFasterWithCacheAndThreading (int n) {
        int proc = Runtime.getRuntime().availableProcessors();
        if (n < 128 || proc <= 1) {
            return recursiveFasterWithCache (n);
        }

        final ConcurrentHashMap<Integer, BigInteger> cache =
            new ConcurrentHashMap<Integer, BigInteger>(); // concurrent access ok
        final int m = (n / 2) + (n & 1);

        Callable<BigInteger> callable = new Callable<BigInteger>() {
            public BigInteger call() throws Exception {
                return recursiveFasterWithCache (m, cache); // first and second jobs
share the same cache
            }
        };
        Future<BigInteger> ffM = executorService.submit(callable);

        callable = new Callable<BigInteger>() {
            public BigInteger call() throws Exception {
                return recursiveFasterWithCache (m-1, cache); // first and second jobs
share the same cache
            }
        };
        Future<BigInteger> ffM_1 = executorService.submit(callable);

        // getting partial results and combining them
        BigInteger fM, fM_1, fN;

        try {
            fM = ffM.get(); // get result of first sub-problem (blocking call)
        } catch (Exception e) {
            // if exception, compute fM in current thread
            fM = recursiveFasterBigInteger(m);
        }
        try {
            fM_1 = ffM_1.get(); // get result of second sub-problem (blocking call)
        } catch (Exception e) {
            // if exception, compute fM in current thread
            fM_1 = recursiveFasterBigInteger(m-1);
        }

        if ((n & 1) != 0) {
            fN = fM.pow(2).add(fM_1.pow(2));
        } else {
            fN = fM_1.shiftLeft(1).add(fM).multiply(fM);
        }

        return fN;
    }
}

NOTE: The second parameter of recursiveFasterWithCache is a map so that it can be called with any cache that implements the Map interface, for example a ConcurrentHashMap or HashMap object. A SparseArray object is not a map.

You may not always observe performance gains when dividing a problem into sub-problems and assigning each sub-problem to a different thread. Since there could still be dependency between data, and synchronization would have to occur, threads may spend some or most of their time waiting for the access to data to be possible. Also, performance gains may not be as significant as you would hope for. Even though theoretically you would expect to double the performance on a dual-core processor and quadruple the performance on a quad-core processor, reality can show you otherwise.

In practice, it is easier to use multiple threads to perform unrelated tasks (therefore avoiding the need for synchronization), or tasks that need to be synchronized only either sporadically or regularly if the frequency is “low enough.” For example, a video game would typically use one thread for the game logic and another thread to do the rendering. The rendering thread would therefore need to read the data manipulated by the logic thread 30 or 60 times per second (for every frame being rendered), and could relatively quickly make a copy of the data needed to start rendering a frame, therefore blocking the access for only a very short moment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.117.35