Chapter 7. Libraries and Linking

Introduction

Linking is the final step in generating an application. This chapter discusses how to link to existing libraries and how to develop new libraries. There are two types of libraries: static and dynamic. Static libraries are typically used as part of the build process, whereas dynamic libraries are usually shipped as part of the final product. When dynamic libraries are used as part of an application it is necessary to specify how those libraries are to be located at runtime. Another useful type of library is an interposing library. These libraries fit between an application and its original libraries and provide replacement code to be called instead of the code in the original libraries. By the end of the chapter, the reader will know how to generate and use both static and dynamic libraries as well as use interposition and auditing to examine the runtime use of libraries. The reader will also know some of the features of the libraries shipped as part of Solaris and the compiler suite.

Linking

Overview of Linking

Linking is the process of combining all the object files produced by the compiler with any libraries that are required to produce an executable, shared object, or even other object files. It is performed by the linker, ld. You can invoke the linker directly, but this can involve knowing some details of the compilation system. Therefore, it is strongly recommended that you perform the linking process by invoking the compiler on the object files, and allowing the compiler to call the linker with the appropriate linker options.

Dynamic and Static Linking

In general, you can link libraries either as part of the executable (static linking), or dynamically at runtime. It is usually preferable to use dynamic linking, because doing so provides a number of benefits.

  • Dynamic linking enables libraries (and their code) to be shared among multiple applications. The libraries can even be loaded only when needed.

  • Dynamic linking allows the system to pick the appropriate version of a library at runtime, depending on the characteristics of the platform upon which it is running. For example, an application run on an UltraSPARC II-based system can pick up libraries optimized for the UltraSPARC II, and the same application run on an UltraSPARC III-based system can pick up a version of the libraries optimized for the UltraSPARC III. The operating system uses this method to provide versions of the C runtime library that are optimized for the hardware running them.

  • With dynamic linking, you can use an interposing library to examine function calls, and to gain knowledge of the application’s runtime behavior (as demonstrated in Section 7.2.10).

  • You can replace dynamic libraries with new or debug versions without having to change the application.

Of course, there are also advantages for statically linking libraries into an application.

  • It is slightly faster to call a statically linked library than a dynamically linked library.

  • It is possible to know exactly what code will be called. With dynamically linked libraries, the code that gets run depends on what is installed on the system.

A static library is given the postfix .a, meaning “archive”—for example, libtest.a. The library will become part of the executable; the .a file will not need to be distributed or be present on the system at runtime. A dynamic library is given the postfix .so, meaning “shared object”—for example, libtest.so. The library will be loaded at runtime, so it must be available on the target system.

One of the most important libraries on the system is the C runtime library (libc). For 64-bit applications, this has always been solely available as a dynamic library. For versions of Solaris earlier than Solaris 10, this library and a number of other system libraries were provided for 32-bit applications, as both a static and a dynamic library. Unfortunately, using the static library turned out to be a source of problems. The role of the library is to provide an interface between user code and the kernel; hence, kernel changes could require changes in the system libraries. Once a static version of the library is linked into an application, it is impossible to patch the library; hence, use of the static version of the system libraries has been strongly discouraged. Starting with Solaris 10, the system libraries (including libc) are available only as dynamic versions for both 32-bit and 64-bit applications.

Linking Libraries

To link a library into an application, you must at least specify which library is required, using the -l<library> flag. A few other flags are useful as well.

  • The -L<path> flag tells the compiler where to search for the following libraries at link time.

  • The -R<path> flag tells the application where to search for the following libraries at runtime.

  • The -l<library> flag tells the compiler which library to link in. The linker will search for a library with a prefix of lib, so the -lmylib flag would look for a library named libmylib.so or libmylib.a (you can use the flags -Bstatic and -Bdynamic before the library appears on the command line to specify which version to use if both are available).

Example 7.1 shows an example of linking a library into an application. The -L flag tells the compiler that the mylib library is located in the current directory. The -R flag tells the application that the library will be located in the current directory at runtime. This is a poor way to specify the runtime location of the library, because the application would fail to locate the library if it were invoked from a different directory. I describe a better approach in Section 7.2.6.

Example 7.1. Example of Linking a Library into an Application

$ cc -O -o myapp -R. -L. -lmylib myapp.c

Creating a Static Library

The process of creating a static library is relatively straightforward. A static library is nothing more than a collection of object files. It is referred to as an archive and given the suffix .a. The tool to create an archive for C and Fortran objects is ar. Example 7.2 shows the syntax for creating an archive.

Example 7.2. Syntax for Creating an Archive

$ ar -r <archive> <object files>

For C++ object files, it is necessary to invoke the compiler to generate the archive files. The compiler needs to add more information to the object files, particularly to support templates. The C++ compiler option to generate an archive is -xar, an example of which is shown in Example 7.3.

Example 7.3. Example of Making an Archive of C++ Object Files

% CC -xar -o myapp.a myapp.cc

At link time, the linker will search any archives specified on the link line and pull in code from them as needed.

Creating a Dynamic Library

The flag to tell the linker to generate a dynamic library is -G. A dynamic library is also referred to as a shared object, and is given the suffix .so to reflect this.

By default, the compiler will produce code that is designed to reside at a fixed position in memory. When it is loaded, the library has to be updated with its actual location in memory. This is known as a position-dependent library. This can cause a significant performance hit when the library is loaded. If the test in Example 7.4 returns a result for a given library, that library is position dependent.

Example 7.4. Testing a Library for Position Dependence

$ dump -L libtest.so|grep TEXTREL
[9]     TEXTREL         0

The linker can report objects containing position-dependent code at link time using the -ztext flag, as shown in Example 7.5.

Example 7.5. Detecting Position-Dependent Code at Link Time

% cc -g -o libtest.so -G test.c -ztext
Text relocation remains                       referenced
    against symbol                  offset    in file
$XB5wWCA9MdrF3xS.r.c                0x18      test.o
$XB5wWCA9MdrF3xS.r.c                0x1c      test.o
$XB5wWCA9MdrF3xS.r.c                0x28      test.o

To avoid this startup cost, you can compile libraries in a position independent way. On SPARC, two flags enable this: -xcode=pic13 and -xcode=pic32 (these flags are equivalent to the obsolete SPARC flags -Kpic and -KPIC). The difference between the two flags is the number of relocatable symbols (variables and routines) that the library can contain; -xcode=pic13 provides fewer (211 symbols versus 230 symbols). There is also a performance difference between the two flags. -xcode=pic13 requires one instruction to load the address of a symbol, whereas -xcode=pic32 requires three. If the compile-time error too many symbols require 'small' PIC references is reported, it is necessary to build the library with -xcode=pic32. On x86, the flags -Kpic and -KPIC generate position-independent code and have the same constraint of 211 symbols.

Example 7.6 shows an example of a command line to build a position-independent shared library.

Example 7.6. Example Command Line to Generate a Library

$ cc -G -xcode=pic13 -o libtest.so test.c

Specifying the Location of Libraries

At runtime, when an application is loaded, it is necessary to load the required libraries. The program that does this is called the runtime linker. The runtime linker will, by default, look for 32-bit libraries in /lib and /usr/lib and 64-bit libraries in /lib/64 and /usr/lib/64. If the libraries that an application requires are not to be found there, it is necessary for the application to describe where those libraries are to be found.

You can specify two flags at compile time to tell the compiler and application where to find libraries. The -L flag tells the compiler where to look for libraries when it is generating the application. The -R flag tells the compiler to include a runtime path to the libraries. The application will use this path to locate the libraries at runtime.

Example 7.7 shows an example of specifying where a library is to be found at runtime and at compile time. In the example, the location of the library is set to be the current directory, which can be correct at compile time but is unlikely to be correct at runtime. It is also possible to specify a hardcoded path where the libraries will be installed. This approach enables the application to locate the libraries, but at the expense of flexibility in terms of how the application can be installed, and often whether multiple versions of the application can coexist.

Example 7.7. Example of Specifying a Compile-Time and a Runtime Library Path

$ cc -o myapp -L. -R. -lmylib myapp.c

Other similar issues need to be resolved as well. For example, how do you specify different versions of libraries for different processors?

To resolve these issues, the linker defines some symbols that simplify this task. The $ORIGIN symbol tells the linker to make the path relative to the location of the application. Example 7.8 shows an example of using the $ORIGIN symbol to specify a relative location for the library.

Example 7.8. Using $ORIGIN to Specify a Relative Path to a Library

$ cc -o myapp -L. -R$ORIGIN/../lib -lmylib myapp.c

Using the $ORIGIN symbol specifies that at runtime, the application should look for the appropriate libraries in a directory path relative to the location of the application. This is a convenient approach because the application and its libraries can be relocated anywhere, as long as the relative path from the application to the libraries remains the same.

You can use the $ISALIST symbol to search for instruction-set-specific versions of a library. The runtime linker will expand this symbol into a set of paths that include the various instruction set architectures (ISAs) returned by the isalist command (discussed in Section 4.2.5 of Chapter 4). Each ISA-specific version of the library is placed in a separate subdirectory. In this way, the application can exploit specific features of the hardware, while still having a default version of the library. An alternative way to achieve the same result is to use the $HWCAP symbol. This symbol tells the linker to inspect all the matching libraries in a given directory and find the library version that is most appropriate for the current hardware.

It is possible to use the LD_LIBRARY_PATH environment flag to specify the directories where the libraries might be located. The environment flags LD_LIBRARY_PATH_32 and LD_LIBRARY_PATH_64 exist for specifying the library search path for 32-bit and 64-bit applications, respectively. However, this approach means that the environment variable must be correctly set for the application to run; hence, its use is strongly discouraged.

Lazy Loading of Libraries

One way to improve the startup time of an application is to use lazy loading. By default, libraries are loaded into memory as soon as the application starts. However, for many applications the code in the library is not required as part of the application startup. Hence, loading the libraries can be delayed until the routines in them are required; this is called lazy loading. To specify that an application or library should use lazy loading you should pass the -zlazyload flag to the linker before the libraries that are to be lazy loaded. You can use the -znolazyload flag to return to the default behavior.

Initialization and Finalization Code in Libraries

Some libraries and applications will need to set up state before the library or application is executed. The easiest way to set up these sections is to use the compiler pragmas init and fini, which define the routines that should be called before and after the application or library executes, as shown in Example 7.9.

Example 7.9. Initialization and Finalization Code

% more init.c
#include <stdio.h>

#pragma init (start)
#pragma fini (end)

void start()
{
  printf("Started
");
}

void end()
{
  printf("Ended
");
}

void main()
{
  printf("Executing
");
}
% cc -O init.c
% a.out
Started
Executing
Ended

Symbol Scoping

By default, almost all symbols are visible outside an object file (exceptions are symbols such as static variables). However, sometimes it is useful to keep functions (or variables) local to a module. You can do this using the scoping specifier __hidden (the default is __global). It is also possible to use mapfiles to achieve the same result. Example 7.10 shows an example of this. In this example, the count variable and the calc and display routines are declared with __hidden scope. The nm tool (discussed in Section 4.5.3 of Chapter 4) shows that the __hidden symbols are still defined in the library, but they are given local scope rather than the routines declared without scoping information, which have global scope. As you might expect, an application that attempts to use these local symbols is unable to link.

Example 7.10. Limiting Symbol Scope

% more ex7.10a.c
#include <stdio.h>

__hidden int count=0;
__hidden void calc(int value) { count+=value; }
__hidden void display() { printf("Value = %i
",count); }
void inc() { calc(1); display();}
void dec() { calc(-1); display();}
% more ex7.10b.c
extern int count;
extern void calc(int);
extern void inc();

void main()
{
  count=5;
  calc(7);
  inc();
}
% cc -O -G -Kpic -o libscope.so ex7.10a.c
% nm libscope.so
...
[24]    |0x00000280|0x00000030|FUNC |LOCL |0x2  |6     |calc
[26]    |0x00010428|0x00000004|OBJT |LOCL |0x2  |14    |count
...
[38]    |0x00000308|0x00000014|FUNC |GLOB |0    |6     |dec
[28]    |0x000002c0|0x00000034|FUNC |LOCL |0x2  |6     |display
[45]    |0x000002f4|0x00000014|FUNC |GLOB |0    |6     |inc
...
% cc -O -o scopetest ex7.10b.c -L. -R. -lscope
Undefined                       first referenced
 symbol                             in file
calc                                ex7.10b.o
count                               ex7.10b.o
ld: fatal: Symbol referencing errors. No output written to scopetest

It is also possible to use mapfiles to achieve the same result, but you need to specify mapfiles on the compile line rather than as part of the source code. Example 7.11 shows an example of this.

Example 7.11. Using Mapfiles to Specify Symbol Scope

% more ex7.11.map
libscope.so
{
  global:
         inc;
         dec;
  local:
         count;
         calc;
         display;
};
% cc -O -G -Kpic -o libscope.so -M ex7.11.map ex7.11.c

Library Interposition

Library interposition is a technique for finding out more information about how a program is using routines located in libraries. The big advantage is that it doesn’t need the program to be modified. The interposition library is loaded before the application is loaded, and consequently the application calls the interposition library, which can then either handle the call or pass the call on to the original library.

Using this approach, the developer can discover exactly what calls are being made and what the parameters to these calls are. For example, this can be useful in determining whether there is some pattern to the calls that can be exploited to improve the application’s performance.

As an example, consider the application shown in Example 7.12. This application makes a call to the sin function, and displays the results.

Example 7.12. Simple Application That Calls the sin Function

#include <stdio.h>
#include <math.h>
int main()
{
  double j=sin(50);
  printf("sin(50)=%5.3f
",j);
}

This application is compiled with a low level of optimization as shown in Example 7.13. Note that it is necessary to pass the -lm flag to link in the math library.

Example 7.13. Compiling the Simple Application

$ cc -O -o sin_test ex7.12.c -lm

Suppose it would be useful to know the number of times the sin function gets called by the test program in Example 7.12. One way to achieve this is to interpose on the call to the sin routine that the application makes, and count the number of times it happens. Example 7.14 shows code that does this.

Example 7.14. Example Library Interposition Code

#include <dlfcn.h>
#include <memory.h>
#include <assert.h>
#include <thread.h>
#include <stdio.h>
#include <procfs.h>
#include <fcntl.h>

static long long count=0;

void exit(int status)
{
   printf("Calls to sin = %lld
",count);
   (*((void (*)())dlsym(RTLD_NEXT, "exit")))(status); } double sin(double x) {
   static double (*func)()=0;
   double ret;
   if (!func) { func = (double(*)()) dlsym(RTLD_NEXT, "sin"); }
   ret = func(x);
   count++;
   return(ret);
}

In the code in Example 7.14, there are two function calls, exit and sin. They interpose on the existing exit and sin functions that reside in other libraries.

In the interposing function, the dlsym routine is called to locate the function that would have been called if this library was not there (this is necessary only if the original behavior is to be preserved). A counter is incremented every time the interposing sin function is called. The interposing function calls the original function to return the value of the original function call. Note that the code as written is not thread safe.

The exit function will be called when the library is unloaded. In the exit routine, it is important to call the exit routine of the next library so that all libraries can execute their cleanup code. In the exit routine for this example, the value of the count variable is printed before calling the original exit function.

The interposing library needs to be loaded before the application is loaded. This enables the interposing library to get between the application and the library that is to be inspected. The LD_PRELOAD environment setting is used for this purpose. This environment variable tells the operating system to insert the interposing library between the application and the libraries that the application depends on. There are also LD_PRELOAD_32 and LD_PRELOAD_64 environment variables that allow different libraries to be preloaded for 32-bit and 64-bit applications.

Example 7.15 shows this interposing library being built and used to count the number of times the application calls the sin function.

Example 7.15. Building and Running with the Interpose Library

$ cc -O -G -Kpic -o mylib.so ex7.14.c
$ LD_PRELOAD=./mylib.so; export LD_PRELOAD; sin_test
sin(50)=-0.262
Calls to sin = 1

Using the Debug Interface

The LD_DEBUG environment setting generates debug information about the runtime linking of an application. You can use the LD_OPTIONS environment variable, which specifies additional flags to be used by the linker, to obtain debug information when an executable is linked.

These environment variables are useful when checking whether an application is picking up the correct libraries, or to determine which libraries an application is selecting. Various settings show different amounts of detail about the process. One of the more useful settings is libs, which shows how the paths to the various libraries are resolved. Example 7.16 shows an example of using LD_DEBUG with the libs option to obtain runtime linking information for the sleep command.

Example 7.16. Using LD_DEBUG to Identify Loaded Libraries

% setenv LD_DEBUG libs
% sleep 5
06306:
06306: configuration file=/var/ld/ld.config: unable to process file
06306:
06306: find object=libc.so.1; searching
06306:  search path=/usr/lib  (default)
06306:  trying path=/usr/lib/libc.so.1
06306:
06306: find object=libdl.so.1; searching
06306:  search path=/usr/lib  (default)
06306:  trying path=/usr/lib/libdl.so.1
06306:  trying path=/usr/platform/SUNW,Sun-Blade-21000/lib/libc_psr.so.1
06306:
06306: calling .init (from sorted order): /usr/lib/libc.so.1
06306:
06306: calling .init (done): /usr/lib/libc.so.1
06306:
06306: transferring control: sleep
06306:
06306: calling .fini: /usr/lib/libc.so.1

Example 7.17 shows an example of using the LD_OPTIONS environment variable to observe the linking process. The environment variable is used to pass the -D<option> flag to the linker. In this example, setting LD_OPTIONS to -Dfiles provides more information about the files used in the linking process.

Example 7.17. Using LD_OPTIONS to Observe Linking

$ LD_OPTIONS=-Dfiles cc -O -o scopetest ex7.10b.c -L. -R. -lscope
debug:
debug: file=/opt/SUNWspro/prod/lib/crti.o  [ ET_REL ]
debug:
debug: file=/opt/SUNWspro/prod/lib/crt1.o  [ ET_REL ]
debug:
debug: file=/opt/SUNWspro/prod/lib/misalign.o  [ ET_REL ]
debug:
debug: file=/opt/SUNWspro/prod/lib/values-xa.o  [ ET_REL ]
debug:
debug: file=ex7.10b.o  [ ET_REL ]
...

Using the Audit Interface

The audit interface exists so that a library can watch as other libraries are loaded, determine how symbols are resolved, and even change the bindings of symbols. Example 7.18 shows source code for a simple audit library. The la_version function is required to identify the library as an audit library, and to ensure that the version numbers match the target platform. The la_objopen function gets called every time a library is loaded. The code in the routine selects only the libraries that are loaded as part of the base application, and prints a message for each one.

Example 7.18. Simple Audit Library

#include <link.h>
#include <stdio.h>

uint_t la_version(uint_t version)
{
  return (LAV_CURRENT);
}

uint_t la_objopen(Link_map * lmp, Lmid_t lmid, uintptr_t * cookie)
{
  if (lmid == LM_ID_BASE) {printf("file %s loaded
", lmp->l_name);}
  return 0;
}

Example 7.19 shows the library being built and used. It is necessary to link the audit library with the mapmalloc library as well as libc. The -z defs flag will cause the linker to report an error if the linked libraries fail to satisfy all the dependencies (this is the default for applications, but not for libraries).

Example 7.19. Building and Using the Audit Library

$ cc -G -Kpic -O -o audit.so.1 ex7.18.c -z defs -lmapmalloc -lc
$ LD_AUDIT=./audit.so.1; export LD_AUDIT
$ sleep 5
file sleep loaded
file /usr/lib/libc.so.1 loaded
file /usr/lib/libdl.so.1 loaded
file /usr/platform/SUNW,Sun-Blade-1000/lib/libc_psr.so.1 loaded

Using LD_AUDIT enables the library to have some insight into the linking and use of a library. For example, this interface could be used to return the number of calls to functions in a library, or even between libraries.

Libraries of Interest

The C Runtime Library (libc and libc_psr)

The libc library contains most of the routines an application will require at runtime. In fact, it is really two libraries: libc.so and libc_psr.so. The libc_psr.so library contains optimized routines for specific processors; the appropriate version of the library is selected at runtime depending on the hardware the application is run on. For example, memcpy and memset both reside in libc_psr.so so that these routines exploit different architectural features on different processors.

Memory Management Libraries

Memory management is often a bottleneck for programs. The default memory allocation routines may not give the best performance for all applications. A variety of memory managment libraries are provided. libfast, discussed in Section 7.3.3, is a static library option for 32-bit single threaded code. An alternative that is available in a 64-bit version but is not thread-safe is bsdmalloc (link using the -lbsdmalloc compiler flag).

Several others are provided as part of Solaris. An alternative malloc optimized for multithreaded applications is the multithreaded malloc (link using the -lmtmalloc compiler flag). The libumem library (link with -lumem) provides an extensive debug library for investigating memory allocation problems.

There are two parts to malloc performance. The first part is the speed of the malloc and free operations. The second part is the performance given by the resulting layout of data in memory. libfast and bsdmalloc have very similar malloc routines which give data out in power-of-two-size chunks. Hence, these mallocs are typically quite quick to give out memory, but have a larger memory footprint. This larger memory footprint may cause data to become aligned on power-of-two boundaries in memory, which can lead to poor utilization of the caches and the Translation Lookaside Buffer (TLB).

In general, if an application relies heavily on malloc, it is probably worth benchmarking the application under a range of the available mallocs to determine which gives the best performance. Example 7.20 shows an example snippet of code that calls malloc and free.

Example 7.20. Code That Calls malloc and free

#include <stdlib.h>
#include "timing.h"

void main()
{
  int i;
  char* array;
  starttime();
  #pragma omp parallel for private(array)
  for (i=0; i<100000;i++)
  {
    array=(char*)malloc(1023);
    free(array);
  }
  endtime(100000);
}

Various memory management libraries are available, and each offers a different trade-off between performance, debug capability, memory footprint, and thread safety. Runtimes from the simple code shown in Example 7.20 using a number of different libraries are shown in Example 7.21.

The code in Example 7.20 also contains an OpenMP directive to enable it to be run as a multithreaded application. The number of threads that the application will use is controlled by the environment OMP_NUM_THREADS variable. I discuss OpenMP in greater detail in Section 12.8 of Chapter 12. The compiler will recognize the OpenMP directive when the -xopenmp flag is specified. Example 7.22 shows the results of running this parallel application on a Solaris 9 UltraSPARC IIICu-based machine utilizing one and two threads. Ideally, when two threads execute the code, it should take half the time it takes a single thread, so the time per iteration should halve. This demonstrates that for Solaris 9, the default malloc in libc takes about three times longer when several threads are contending for it. On the other hand, mtmalloc and libumem show some performance improvement with increasing the number of threads. It is worth observing that the presence of the -mt compiler flag on a Solaris 9 system causes the malloc and free calls to take longer.

Example 7.21. Single-Threaded Performance of Various malloc and free Implementations

% cc -O ex7.20.c; a.out
Time per iteration 217.01 ns
% cc -O ex7.20.c -lfast; a.out
Time per iteration 45.25 ns
% cc -O ex7.20.c -lbsdmalloc; a.out
Time per iteration 118.70 ns
% cc -O ex7.20.c -lumem; a.out
Time per iteration 444.38 ns
% cc -O ex7.20.c -lmtmalloc; a.out
Time per iteration 392.52 ns

Example 7.22. Performance of malloc and free When Run with Multiple Threads

$ OMP_NUM_THREADS=1; export OMP_NUM_THREADS
$ cc -O -xopenmp -mt ex7.20.c; a.out
Time per iteration 363.19 ns
$ cc -O -xopenmp -mt ex7.20.c -lumem; a.out
Time per iteration 522.30 ns
$ cc -O -xopenmp -mt ex7.20.c -lmtmalloc; a.out
Time per iteration 593.85 ns
$ OMP_NUM_THREADS=2; export OMP_NUM_THREADS
$ cc -O -xopenmp -mt ex7.20.c; a.out
Time per iteration 1090.08 ns
$ cc -O -xopenmp -mt ex7.20.c -lumem; a.out
Time per iteration 408.13 ns
$ cc -O -xopenmp -mt ex7.20.c -lmtmalloc; a.out
Time per iteration 303.88 ns

On Solaris 10, all applications are potentially multithreaded by default, and there is no penalty for adding the -mt flag. The performance of the mtmalloc and libumem libraries has been improved for the single-threaded case, but the scaling is worse. Example 7.23 shows results from similar hardware running Solaris 10.

Example 7.23. malloc Performance on Solaris 10

$ cc -O ex7.20.c; a.out
Time per iteration 223.28 ns
$ cc -O ex7.20.c -lfast; a.out
Time per iteration 45.88 ns
$ cc -O ex7.20.c -lbsdmalloc; a.out
Time per iteration 120.47 ns
$ cc -O ex7.20.c -lumem; a.out
Time per iteration 301.13 ns
$ cc -O ex7.20.c -lmtmalloc; a.out
Time per iteration 333.14 ns
$ OMP_NUM_THREADS=1; export OMP_NUM_THREADS
$ cc -O -xopenmp -mt ex7.20.c; a.out
Time per iteration 222.11 ns
$ cc -O -xopenmp -mt ex7.20.c -lumem; a.out
Time per iteration 293.76 ns
$ cc -O -xopenmp -mt ex7.20.c -lmtmalloc; a.out
Time per iteration 336.22 ns
$ OMP_NUM_THREADS=2; export OMP_NUM_THREADS
$ cc -O -xopenmp -mt ex7.20.c; a.out
Time per iteration 1206.96 ns
$ cc -O -xopenmp -mt ex7.20.c -lumem; a.out
Time per iteration 243.44 ns
$ cc -O -xopenmp -mt ex7.20.c -lmtmalloc; a.out
Time per iteration 313.24 ns

libfast

A static library for SPARC processors, called libfast, ships with the compiler. It contains some optimized string library routines, and an optimized malloc routine. The -lfast compiler flag will cause libfast to be linked in. This flag should be presented after the source files so that the library be linked in appropriately (because the order of linkage is important).

There are several points to observe when using libfast. It is not thread-safe, so you should use the library only in single-threaded applications. It is a static library, which means the routines in the library are the ones that will be used—and the benefits of using the platform-tuned versions of the routines are unavailable. Lastly, the library is available only for 32-bit applications.

In general, it should not be necessary to use libfast. The processor-specific library routines are typically of comparable performance. However, on some occasions the malloc routines may be faster, because they are simpler implementations and do not have the overhead of being thread-safe.

The Performance Library

The performance library contains a large number of routines that are optimized for the various SPARC and x64 processors. Consequently, you can realize significant performance gains from using these routines. Use of these libraries can also reduce application development time because fewer lines of code need to be written. To link the performance library into the application use the compiler flag -xlic_lib=sunperf, as shown in Example 7.24.

Example 7.24. Linking the Performance Library into an Application

$ cc -fast -o matvec ex7.25.c -xlic_lib=sunperf

The matrix-vector multiply code shown in Example 7.25 demonstrates the benefits of using the routines provided by the performance library.

Example 7.25. Example Matrix-Vector Multiply Code

#include <sunperf.h>
#include "timing.h"

#define LENGTH 10000
static double vector[LENGTH], matrix[LENGTH][LENGTH], vector2[LENGTH];

void main()
{
  starttime();
  for (int i=0; i<LENGTH; i++)
  {
   vector2[i]=0;
   for (int j=0; j<LENGTH; j++)
    vector2[i]+=matrix[i][j]*vector[j];
  }
  endtime(LENGTH);

  starttime();
  dgemv('N',LENGTH,LENGTH,1.0,&matrix[0][0],LENGTH,vector,1,0.0,vector2,1);
  endtime(LENGTH);
}

Example 7.26 shows the output from this snippet of code. It is apparent that the performance library code is about four times faster than the manually coded version of the calculation.

Example 7.26. Timing of Matrix-Vector Multiply Code

$ matvec
Time per iteration 212962.98 ns
Time per iteration 57400.56 ns

The reason for the large difference in performance is that the compiler (in this case) has not performed a loop tiling optimization which could improve data reuse, and thereby improve performance. The performance library contains hand-optimized code, which will often outperform versions of the same algorithm coded in a high-level language.

STLport4

The default Standard Template Library (STL) for C++ is Rogue Wave. This library is used for Application Binary Interface (ABI) compatibility reasons. However, often the STLport version of the template library is faster.

The STLport library will be used if the -library=stlport4 compiler option is used. It is important to note the following issues regarding STLport.

  • Using STLport increases the degree of standards compliance expected by the compiler. As a consequence, code that previously compiled might need additional namespace specifiers to be compiled with STLport. In general, this requires simply specifying the std:: namespace for various functions.

  • It is not possible to use STLport if the code being developed is going to be linked with other libraries or applications that use the Rogue Wave library, or some other STL library.

Example 7.27 shows code that benchmarks the performance of the push_back method of the vector template.

Example 7.27. Benchmark for push_back Method

#include <vector>
#include "timing.h"

int main()
{
  int i;
  std::vector<int> vec;
  starttime();
  for (i=0; i<100000; i++)
  {
   vec.push_back(i);
  }
  endtime(100000);
}

You can compile the code to use the default (Rogue Wave) or STLport4 library, as shown in Example 7.28. The STLport4 library shows nearly double the performance of the Rogue Wave library for this particular method.

Example 7.28. Performance Difference of Two STL Implementations

% CC -O ex7.27.cpp; a.out
Time per iteration 144.18 ns
% CC -O ex7.27.cpp -library=stlport4; a.out
Time per iteration 79.45 ns

Library Calls

Often, a number of different library calls are required to achieve the same results. Consequently, it is important to consider what work needs to be done and the cost of the library call to do that work.

Library Routines for Timing

The most obvious place where timing is important is in timing the duration of function calls. Example 7.29 shows sample code that times a number of the alternative calls for obtaining timing information.

Example 7.29. Timing Various Calls

#include <stdio.h>
#include <sys/time.h>
#include <sys/timeb.h>
#include <sys/types.h>
#include <time.h>

#define RPT 1000000
#include "timing.h"

unsigned long long get_tick();

int main()
{
  long count;
  struct timeb tb;
  struct timeval tp;
  time_t tloc;

  starttime();
  for (count=0; count<RPT; count++) { ftime(&tb);}
  endtime(RPT);

  starttime();
  for (count=0; count<RPT; count++) { gettimeofday(&tp,(void*)0); }
  endtime(RPT);

  starttime();
  for (count=0; count<RPT; count++) { gethrtime(); }
  endtime(RPT);

  starttime();
  for (count=0; count<RPT; count++) { time(&tloc); }
  endtime(RPT);

  starttime();
  for (count=0; count<RPT; count++) { get_tick(); }
  endtime(RPT);

  return 0;
}

The call to get_tick is actually an inline template shown in Example 7.30, which reads the hardware tick counter on the SPARC processor.

Example 7.30. Inline Template for Reading Tick Counter

!
! unsigned long long get_tick();
!
        .inline get_tick,0
        rd      %tick,%o0
        .end

Example 7.31 shows the results from building and running the program.

Example 7.31. Results of Various Timing Functions

$ cc -O ex7.29.c ex7.30.il -o calls
$ calls
Time per iteration 958.56 ns
Time per iteration 188.84 ns
Time per iteration 145.93 ns
Time per iteration 992.62 ns
Time per iteration 10.71 ns

Inlined reading of the tick counter on the processor is the fastest way to obtain a count of elapsed time, but even that takes about 10ns. The other tested routines take significantly longer, but have return values that are related to the real time in seconds. In some cases, having the time returned in seconds might be worth the additional cost.

The use of these various timing routines demonstrates that it is important to pick the appropriate routine to use. Some routines will return more information than is necessary, and consequently they take a long time to return. Other routines may return less information, but still may not be the fastest.

Example 7.32 shows the timing harness (timing.h) used in this book. The harness uses the gethrtime call, which returns time in nanoseconds since some arbitrary point in the past. As demonstrated, it is a relatively quick call, so it should be sufficient for timing most tasks that run for a reasonable number of iterations. It is not a good timer for measuring the duration of a task that completes in a few cycles, however.

Example 7.32. Timing Test Harness: timing.h

#include <stdio.h>
#include <sys/time.h>

static double s_time;

void starttime()
{
  s_time=1.0*gethrtime();
}

void endtime(long its)
{
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f ns
", (e_time-s_time)/(1.0*its));
  s_time=1.0*gethrtime();
}

Picking the Most Appropriate Library Routines

Timing is one consideration when selecting the most appropriate library calls to use. It’s also possible that other calls exist which return a more complete set of results in a single call. For example, the Sun Math Library (-lsunmath) contains the sincos() function which returns both the sine and cosine of an angle in one call. This call takes the same time as the calculation of a single one of them. Example 7.33 shows code that demonstrates using this function.

Example 7.33. Code for Testing sincos Function

#include <math.h>
#include <sunmath.h>

#include "timing.h"
#define RPT 1000000

void main()
{
  double a=1.3;
  double b,c;

  starttime();
  for (int i=0; i<RPT; i++) { b=sin(a);c=cos(a);}
  endtime(RPT);

  starttime();
  for (int i=0; i<RPT; i++) {sincos(a,&b,&c);}
  endtime(RPT);
}

Example 7.34 shows the results of building and running this code. The sincos function is more than twice as fast as performing both the sin and cos functions. This is to be expected as the two values rely on similar calculations, and as such can be computed in parallel. It should be noted that this is not really a perfect test, because the trigonometric functions typically have some lookup tables, and these will get cached under this test harness because the test uses only a single input value and repeatedly computes a single result.

Example 7.34. Building and Running sincos Test Code

% cc -O ex7.33.c -lm -lsunmath
% a.out
Time per iteration 801.13 ns
Time per iteration 286.73 ns

So, it is important to know what calls are available in the various libraries available with the compiler. The Sun Math library (-lsunmath) is one that contains a number of useful routines for mathematical code.

SIMD Instructions and the Media Library

Single Instruction, Multiple Data (SIMD) instructions are single instructions that act on multiple items of data at the same time. On SPARC, these are called the Visual Instruction Set (VIS) instructions, and on x86 these are the SSE extensions. For example, a single 8-byte register could hold eight byte-size items of data or four short-size items of data. SIMD instructions can offer a performance advantage because of their ability to parallel-process multiple items of data. The problem people often face with SIMD instructions is the overhead of converting the data into the appropriate structure for the instructions to manipulate; the cost of the conversion can easily outweigh the benefit of being able to perform multiple operations at once.

On the UltraSPARC III/IV family of processors, the VIS instructions have another performance advantage. The instructions act on the floating-point registers so that the data can be prefetched into the prefetch cache on the processor, which eliminates the cost of fetching data from the second-level cache.

One way to use routines that have been optimized using SIMD instructions is to call MediaLib. This library is available for both SPARC and x86 processors and provides a wide range of routines that handle tasks commonly found when handling media (video, images, or audio) data, or other types of data (e.g., matrix manipulation). The MediaLib library is provided as part of Solaris 10, and is available as a download for previous Solaris versions.

Searching Arrays Using VIS Instructions

SIMD instructions can be useful in areas other than the manipulation of media-type data. An example of using VIS instructions might be to determine the length of a string. The code shown in Example 7.35 is string-length code written in C, together with a harness that checks for correctness of the result.

Example 7.35. Test Harness for strlen Code (char_search.c)

#include <stdio.h>
#include <string.h>
#include "timing.h"
#define RPT 30*1024
int vis_length(char *a);
int c_length(char * a)
{
  int len=0;
  while (*a!=0){a++;len++;}
  return len;
}

void main ()
{
  char string[1024*1024];
  int index;
  for (index=0;index<1024*1024;index++){string[index]='';}
  starttime();
  for (index=1;index<RPT;index++)
  {
    string[index-1]='a';
    string[index]='';
    if (c_length(string)!=index) {printf("Error at length %i
",index);}
  }
  endtime(RPT);
  starttime();
  for (index=1;index<RPT;index++)
  {
    string[index-1]='a';
    string[index]='';
    if (vis_length(string)!=index)
      {printf("Error at length %i (%i)
",index,vis_length(string));}
  }
  endtime(RPT);
  starttime();
  for (index=1;index<RPT;index++)
  {
    string[index-1]='a';
    string[index]='';
    if (strlen(string)!=index)
      {printf("Error at length %i (%i)
",index,strlen(string));}
  }
  endtime(RPT);
}

A VIS implementation of the code is slightly more complex, and an example is shown in Example 7.36.

Example 7.36. VIS Implementation of strlen (char_search.il)

/* Routine vis_length(char * string); */
/* %o0 = address of string */

.inline vis_length,4
  and %o0,3,%o2         /* check for 4-byte aligned*/
  cmp %o2,%g0           /* aligned so go to VIS code*/
  be 1f
  clr %o3               /* clear counter */
                        /* next block to handle misaligned data */
2:
  ldub  [%o0],%o1        /* load byte */
  cmp %o1,%g0           /* check if zero */
  be 4f                 /* found */
  add %o0,1,%o0         /* dealt with misaligned byte */
  sub %o2,1,%o2         /* count down misaligned bytes */
  cmp %o2,%g0
  bne 2b
  add %o3,1,%o3         /* compared first character */

1:
  fzero %f0             /* clear comparison word */
  ld [%o0],%f2          /* load 4 bytes */

3:
  add %o0,4,%o0         /* move pointer 4 bytes */
  fexpand %f2,%f4       /* expand 4 bytes into 4 shorts */
  fcmpeq16 %f0,%f4,%o1  /* compare 4 shorts */
  lda [%o0]%asi,%f2     /* speculative load of 4 bytes */
  prefetch [%o0+256],0  /* Prefetch four lines ahead */
  cmp %o1,0             /* check result of compare */
  be,a 3b               /* branch if not found */
  add %o3,4,%o3         /* increment count by four; annulled if found*/

                        /* At this point %o1 contains a bit pattern */
                        /* indicating which byte was zero */
  mov 2,%o4             /* set up mask for 2nd bit */
  andcc %o1,12,%g0      /* check upper 2 bits of return value from compare*/
  bz,a 5f
  add %o3,2,%o3         /* add two if the upper half did not contain zero */
  mov 8,%o4             /* set mask for first bit in upper half */
5:
  andcc %o1,%o4,%g0     /* mask uppermost bit set */
  bz,a 4f               /* if it is not set skip increment */
  add %o3,1,%o3         /* upper bit set, so increment counter */
4:
  or %o3,%g0,%o0       /* Copy counter to output */
.end

The VIS algorithm is not the most efficient possible. In particular, there are many lost cycles in the innermost loop, which could be optimized out at the expense of an increase in code size and complexity.

Here are the steps in the algorithm.

  1. Ensure that the string is aligned for 4-byte access. This is done by reading one character at a time until the alignment is correct.

  2. Once the correct alignment is achieved, you can use VIS instructions in the inner loop to fetch the data from memory. One unfortunate complexity is that the VIS instruction set does not have instructions that perform a comparison of bytes. The workaround is to expand the bytes into short ints, and do a comparison on these instead.

  3. Once the comparison determines that there is a zero byte in the four bytes that have been loaded, it is necessary to locate the particular byte and increment the string length accordingly. The approach used in this example is to check whether there is a zero in either of the two upper bits. If there is no zero there, the zero must be in the lower bits. Having determined which pair of bits contains the zero, the next step is to mask off the upper bit of the pair and see whether that is the zero byte.

Example 7.37 shows the runtime of this example. The VIS code is about twice as fast as the C-language version, but the version of strlen provided by the operating system is about 30% faster than that.

Example 7.37. Performance of VIS strlen Code

$ cc -O -xarch=v8plusa ex7.35.c ex7.36.il
$ a.out
Time per iteration 59983.66 ns
Time per iteration 34316.24 ns
Time per iteration 21401.47 ns

The purpose of the exercise was to show that it is possible to produce inline templates that use VIS instructions. It also demonstrates that even a relatively simple loop in C ends up with some initialization and cleanup code, which adds to the overhead of using VIS.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.218.93