Chapter 5: Exploring Clang's Architecture

Clang is LLVM's official frontend for C-family programming languages, including C, C++, and Objective-C. It processes the input source code (parsing, type checking, and semantic reasoning, to name a few) and generates equivalent LLVM IR code, which is then taken over by other LLVM subsystems to perform optimization and native code generation. Many C-like dialects or language extensions also find Clang easy to host their implementations. For example, Clang provides official support for OpenCL, OpenMP, and CUDA C/C++. In addition to normal frontend jobs, Clang has been evolving to partition its functionalities into libraries and modules so that developers can use them to create all kinds of tools related to source code processing; for example, code refactoring, code formatting, and syntax highlighting. Learning Clang development can not only bring you more engagement into the LLVM project but also open up a wide range of possibilities for creating powerful applications and tools.

Unlike LLVM, which arranges most of its tasks into a single pipeline (that is, PassManager) and runs them sequentially, there is more diversity in how Clang organizes its subcomponents. In this chapter, we will show you a clear picture of how Clang's important subsystems are organized, what their roles are, and which part of the code base you should be looking for.

Terminology

From this chapter through to the rest of this book, we will be using Clang (which starts with an uppercase C and a Minion Pro font face) to refer to the project and its techniques as a whole. Whenever we use clang (all in lowercase with a Courier font face), we are referring to the executable program.

In this chapter, we will cover the following main topics:

  • Learning Clang's subsystems and their roles
  • Exploring Clang's tooling features and extension options

By the end of this chapter, you will have a roadmap of this system so that you can kickstart your own projects and have some gravity for later chapters related to Clang development.

Technical requirements

In Chapter 1, Saving Resources When Building LLVM, we showed you how to build LLVM. Those instructions, however, did not build Clang. To include Clang in the build list, please edit the value that's been assigned to the LLVM_ENABLE_PROJECTS CMake variable, like so:

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra"

The value of that variable should be a semi-colon-separated list, where each item is one of LLVM's subprojects. In this case, we're including Clang and clang-tools-extra, which contains a bunch of useful tools based on Clang's techniques. For example, the clang-format tool is used by countless open source projects, especially large-scale ones, to impose a unified coding style in their code base.

Adding Clang to an existing build

If you already have an LLVM build where Clang was not enabled, you can edit the LLVM_ENABLE_PROJECTS CMake argument's value in CMakeCache.txt without invoking the original CMake command again. CMake should reconfigure itself once you've edited the file and run Ninja (or a build system of your choice) again.

You can build clang, Clang's driver, and the main program using the following command:

$ ninja clang

You can run all the Clang tests using the following command:

$ ninja check-clang

Now, you should have the clang executable in the /<your build directory>/bin folder.

Learning Clang's subsystems and their roles

In this section, we will give you an overview of Clang's structures and organizations. We will briefly introduce some of the important components or subsystems, before using dedicated sections or chapters to expand them further in later parts of this book. We hope this will give you some idea of Clang's internals and how they will benefit your development.

First, let's look at the big picture. The following diagram shows the high-level structure of Clang:

Figure 5.1 – High-level structure of Clang

Figure 5.1 – High-level structure of Clang

As explained in the legend, rectangles with rounded corners represent subsystems that might consist of multiple components with similar functionalities. For example, the frontend can be further dissected into components such as the preprocessor, parser, and code generation logic, to name a few. In addition, there are intermediate results, depicted as ovals in the preceding diagram. We are especially interested in two of them – Clang AST and LLVM IR. The former will be discussed in depth in Chapter 7, Handling AST, while the latter is the main character of Part 3, Middle-End Development, which will talk about optimizations and analyses you can apply to LLVM IR.

Let's start by looking at an overview of the driver. The following subsections will give you a brief introduction to each of these driver components.

Driver

A common misunderstanding is that clang, the executable, is the compiler frontend. While clang does use Clang's frontend components, the executable itself is actually a kind of program called a compiler driver, or driver for short.

Compiling source code is a complex process. First, it consists of multiple phases, including the following:

  • Frontend: Parsing and semantic checking
  • Middle-end: Program analysis and optimization
  • Backend: Native code generation
  • Assembling: Running the assembler
  • Linking: Running the linker

Among these phases and their enclosing components, there are countless options/arguments and flags, such as the option to tell compilers where to search for include files (that is, the -I command-line option in GCC and Clang). Furthermore, we hope that the compiler can figure out the values for some of these options. For example, it would be great if the compiler could include some folders of C/C++ standard libraries (for example, /include and /usr/include in Linux systems) in the header file search paths by default, so that we don't need to assign each of those folders manually in the command line. Continuing with this example, it's clear that we want our compilers to be portable across different operating systems and platforms, but many operating systems use a different C/C++ standard library path. So, how do compilers pick the correct one accordingly?

In this situation, a driver is designed to come to the rescue. It's a piece of software that acts as a housekeeper for core compiler components, serving them essential information (for example, a OS-specific system include path, as we mentioned earlier) and arranging their executions so that users only need to supply important command-line arguments. A good way to observe the hard work of a driver is to use the -### command-line flag on a normal clang invocation. For example, you could try to compile a simple hello world program with that flag:

$ clang++ -### -std=c++11 -Wall ./hello_world.cpp -o hello_world

The following is part of the output after running the preceding command on a macOS computer:

"/path/to/clang" "-cc1" "-triple" "x86_64-apple-macosx11.0.0" "-Wdeprecated-objc-isa-usage" "-Werror=deprecated-objc-isa-usage" "-Werror=implicit-function-declaration" "-emit-obj" "-mrelax-all" "-disable-free" "-disable-llvm-verifier" … "-fno-strict-return" "-masm-verbose" "-munwind-tables" "-target-sdk-version=11.0" … "-resource-dir" "/Library/Developer/CommandLineTools/usr/lib/clang/12.0.0" "-isysroot" "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk" "-I/usr/local/include" "-stdlib=libc++" … "-Wall" "-Wno-reorder-init-list" "-Wno-implicit-int-float-conversion" "-Wno-c99-designator" … "-std=c++11" "-fdeprecated-macro" "-fdebug-compilation-dir" "/Users/Rem" "-ferror-limit" "19" "-fmessage-length" "87" "-stack-protector" "1" "-fstack-check" "-mdarwin-stkchk-strong-link" … "-fexceptions" … "-fdiagnostics-show-option" "-fcolor-diagnostics" "-o" "/path/to/temp/hello_world-dEadBeEf.o" "-x" "c++" "hello_world.cpp"…

These are essentially the flags being passed to the real Clang frontend after the driver's translation. While you don't need to understand all these flags, it's true that even for a simple program, the compilation flow consists of an enormous amount of compiler options and many subcomponents.

The source code for the driver can be found under clang/lib/Driver. In Chapter 8, Working with Compiler Flags and Toolchains, we will look at this in more detail.

Frontend

A typical compiler textbook might tell you that a compiler frontend consists of a lexer and a parser, which generates an abstract syntax tree (AST). Clang's frontend also uses this skeleton, but with some major differences. First, the lexer is usually coupled with the preprocessor, and the semantic analysis that's performed on the source code is detached into a separate subsystem, called the Sema. This builds an AST and does all kinds of semantic checking.

Lexer and preprocessor

Due to the complexity of programming language standards and the scale of real-world source code, preprocessing becomes non-trivial. For example, resolving included files becomes tricky when you have 10+ layers of a header file hierarchy, which is common in large-scale projects. Advanced directives such as #pragma can be challenged in cases where OpenMP uses #pragma to parallelize for loops. Solving these challenges requires close cooperation between the preprocessor and the lexer, which provides primitives for all the preprocessing actions. Their source code can be found under clang/lib/Lex. In Chapter 6, Extending the Preprocessor, you will become familiar with preprocessor and lexer development, and learn how to implement custom logic with a powerful extension system.

Parser and Sema

Clang's parser consumes token streams from the preprocessor and lexer and tries to realize their semantic structures. Here, the Sema sub-system performs more semantic checking and analysis from the parser's result before generating the AST. Historically, there was another layer of abstraction where you could create your own parser action callbacks to specify what actions you wanted to perform when certain language directives (for example, identifiers such as variable names) were parsed.

Back then, Sema was one of these parser actions. However, later on, people found that this additional layer of abstraction was not necessary, so the parser only interacts with Sema nowadays. Nevertheless, Sema still retains this kind of callback-style design. For example, the clang::Sema::ActOnForStmt(…) function (defined in clang/lib/Sema/SemaStmt.cpp) will be invoked when a for loop structure is parsed. It will then do all kinds of checking to make sure the syntax is correct and generate the AST node for the for loop; that is, a ForStmt object.

AST

The AST is the most important primitive when it comes to extending Clang with your custom logic. All the common Clang extensions/plugins that we will introduce operate on an AST. To get a taste of AST, you can use the following command to print out the an AST from the source code:

$ clang -Xclang -ast-dump -fsyntax-only foo.c

For example, on my computer, I have used the following simple code, which only contains one function:

int foo(int c) { return c + 1; }

This will yield the following output:

TranslationUnitDecl 0x560f3929f5a8 <<invalid sloc>> <invalid sloc>

|…

`-FunctionDecl 0x560f392e1350 <./test.c:2:1, col:30> col:5 foo 'int (int)'

  |-ParmVarDecl 0x560f392e1280 <col:9, col:13> col:13 used c 'int'

  `-CompoundStmt 0x560f392e14c8 <col:16, col:30>

    `-ReturnStmt 0x560f392e14b8 <col:17, col:28>

      `-BinaryOperator 0x560f392e1498 <col:24, col:28> 'int' '+'

        |-ImplicitCastExpr 0x560f392e1480 <col:24> 'int' <LValueToRValue>

        | `-DeclRefExpr 0x560f392e1440 <col:24> 'int' lvalue ParmVar 0x560f392e1280 'c' 'int'

        `-IntegerLiteral 0x560f392e1460 <col:28> 'int' 1

This command is pretty useful because it tells you the C++ AST class that represents certain language directives, which is crucial for writing AST callbacks – the core of many Clang plugins. For example, from the previous lines, we can know that a variable reference site (c in the c + 1 expression) is represented by the DeclRefExpr class.

Similar to how the parser was organized, you can register different kinds of ASTConsumer instances to visit or manipulate the AST. CodeGen, which we will introduce shortly, is one of them. In Chapter 7, Handling AST, we will show you how to implement custom AST processing logic using plugins.

CodeGen

Though there are no prescriptions for how you should process the AST (for example, if you use the -ast-dump command-line option shown previously, the frontend will print the textual AST representation), the most common task that's performed by the CodeGen subsystem is emitting the LLVM IR code, which will later be compiled into native assembly or object code by LLVM.

LLVM, assemblers, and linkers

Once the LLVM IR code has been emitted by the CodeGen subsystem, it will be processed by the LLVM compilation pipeline to generate native code, either assembly code or object code. LLVM provides a framework called the MC layer, in which architectures can choose to implement assemblers that have been directly integrated into LLVM's pipeline. Major architectures such as x86 and ARM use this approach. If you don't do this, any textual assembly code that's emitted at the end of LLVM's pipeline needs to be processed by external assembler programs invoked by the driver.

Despite the fact that LLVM already has its own linker, known as the LLD project, an integrated linker is still not a mature option yet. Therefore, external linker programs are always invoked by the driver to link the object files and generate the final binary artifacts.

External versus integrated

Using external assemblers or linkers means invoking a separate process to run the program. For example, to run an external assembler, the frontend needs to put assembly code into a temporary file before launching the assembler with that file path as one of its command-line arguments. On the other hand, using integrated assemblers/linkers means the functionalities of assembling or linking are packaged into libraries rather than an executable. So, at the end of the compilation pipeline, LLVM will call APIs to process the assembly code's in-memory instances to emit object code. The advantage of this integrated approach is, of course, saving many indirections (writing into temporary files and reading them back right away). It also makes the code more concise to some extent.

With that, you have been given an overview of a normal compilation flow, from the source code all the way to the native code. In the next section, we will go beyond the clang executable and provide an overview of the tooling and extension options provided by Clang. This not only augments the functionalities of clang, but also provides a way to use Clang's amazing techniques in out-of-tree projects.

Exploring Clang's tooling features and extension options

The Clang project contains not just the clang executable. It also provides interfaces for developers to extend its tools, as well as to export its functionalities as libraries. In this section, we will give you an overview of all these options. Some of them will be covered in later chapters.

There are currently three kinds of tooling and extension options available in Clang: Clang plugins, libTooling, and Clang Tools. To explain their differences and provide more background knowledge when we talk about Clang extensions, we need to start from an important data type first: the clang::FrontendAction class.

The FrontendAction class

In the Learning Clang's subsystems and their roles section, we went through a variety of Clang's frontend components, such as the preprocessor and Sema, to name a few. Many of these important components are encapsulated by a single data type, called FrontendAction. A FrontendAction instance can be treated as a single task running inside the frontend. It provides a unified interface for the task to consume and interact with various resources, such as input source files and ASTs, which is similar to the role of an LLVM Pass from this perspective (an LLVM Pass provides a unified interface to process LLVM IR). However, there are some significant differences with an LLVM Pass:

  • Not all of the frontend components are encapsulated into a FrontendAction, such as the parser and Sema. They are standalone components that generate materials (for example, the AST) for other FrontendActions to run.
  • Except for a few scenarios (the Clang plugin is one of them), a Clang compilation instance rarely runs multiple FrontendActions. Normally, only one FrontendAction will be executed.

Generally speaking, a FrontendAction describes the task to be done at one or two important places in the frontend. This explains why it's so important for tooling or extension development – we're basically building our logic into a FrontendAction (one of FrontendAction's derived classes, to be more precise) instance to control and customize the behavior of a normal Clang compilation.

To give you a feel for the FrontendAction module, here are some of its important APIs:

  • FrontendAction::BeginSourceFileAction(…)/EndSourceFileAction(…): These are callbacks that derived classes can override to perform actions right before processing a source file and once it has been processed, respectively.
  • FrontendAction::ExecuteAction(…): This callback describes the main actions to do for this FrontendAction. Note that while no one stops you from overriding this method directly, many of FrontendAction's derived classes already provide simpler interfaces to describe some common tasks. For example, if you want to process an AST, you should inherit from ASTFrontendAction instead and leverage its infrastructures.
  • FrontendAction::CreateASTConsumer(…): This is a factory function that's used to create an ASTConsumer instance, which is a group of callbacks that will be invoked by the frontend when it's traversing different parts of the AST (for example, a callback to be called when the frontend encounters a group of declarations). Note that while the majority of FrontendActions work after the AST has been generated, the AST might not be generated at all. This may happen if the user only wants to run the preprocessor, for example (such as to dump the preprocessed content using Clang's -E command-line option). Thus, you don't always need to implement this function in your custom FrontendAction.

Again, normally, you won't derive your class directly from FrontendAction, but understanding FrontendAction's internal role in Clang and its interfaces can give you more material to work with when it comes to tooling or plugin development.

Clang plugins

A Clang plugin allows you to dynamically register a new FrontendAction (more specifically, an ASTFrontendAction) that can process the AST either before or after, or even replace, the main action of clang. A real-world example can be found in the Chromium project, in which they use Clang plugins to impose some Chromium-specific rules and make sure their code base is free from any non-ideal syntax. For example, one of the tasks is checking if the virtual keyword has been placed on methods that should be virtual.

A plugin can be easily loaded into a normal clang by using simple command-line options:

$ clang -fplugin=/path/to/MyPlugin.so … foo.cpp

This is really useful if you want to customize the compilation but have no control over the clang executable (that is, you can't use a modified version of clang). In addition, using the Clang plugin allows you to integrate with the build system more tightly; for example, if you want to rerun your logic once the source files or even arbitrary build dependencies have been modified. Since the Clang plugin is still using clang as the driver and modern build systems are pretty good at resolving normal compilation command dependencies, this can be done by making a few compile flag tweaks.

However, the biggest downside of using the Clang plugin is its API issue. In theory, you can load and run your plugin in any clang executable, but only if the C++ APIs (and the ABI) are used by your plugin and the clang executable matches it. Unfortunately, for now, Clang (and also the whole LLVM project) has no intention to make any of its C++ APIs stable. In other words, to take the safest path, you need to make sure both your plugin and clang are using the exact same (major) version of LLVM. This issue makes the Clang plugin pretty hard to be released standalone.

We will look at this in more detail in Chapter 7, Handling AST.

LibTooling and Clang Tools

LibTooling is a library that provides features for building standalone tools on top of Clang's techniques. You can use it like a normal library in your project, without having any dependencies on the clang executable. Also, the APIs are designed to be more high-level so that you don't need to deal with many of Clang's internal details, making it more friendly to non-Clang developers.

Language server is one of the most famous use cases of libTooling. A Language server is launched as a daemon process and accepts requests from editors or IDEs. These requests can be as simple as syntax checking a code snippet or complicated tasks such as code completions. While a Language server does not need to compile the incoming source code into native code as normal compilers do, it needs a way to parse and analyze that code, which is non-trivial to build from scratch. libTooling avoids the need to recreate the wheels in this case by taking Clang's techniques off-the-shelf and providing an easier interface for Language server developers.

To give you a more concrete idea of how libTooling differs from the Clang plugin, here is a (simplified) code snippet for executing a custom ASTFrontendAction called MyCustomAction:

int main(int argc, char** argv) {

  CommonOptionsParser OptionsParser(argc, argv,…);

  ClangTool Tool(OptionsParser.getCompilations(), {"foo.cpp"});

  return Tool.run(newFrontendActionFactory<MyCustomAction>().         get());

}

As shown in the previous code, you can't just embed this code into any code base. libTooling also provides lots of nice utilities, such as CommonOptionsParser, which parses textual command-line options and transforms them into Clang options for you.

libTooling's API Stability

Unfortunately, libTooling doesn't provide stable C++ APIs either. Nevertheless, this isn't a problem since you have full control over what LLVM version you're using.

Last but not least, Clang Tools is a collection of utility programs build on top of libTooling. You can think of it as the command-line tool version of libTooling in that it provides some common functionalities. For example, you can use clang-refactor to refactor the code. This includes renaming a variable, as shown in the following code:

// In foo.cpp…

struct Location {

  float Lat, Lng;

};

float foo(Location *loc) {

  auto Lat = loc->Lat + 1.0;

  return Lat;

}

If we want to rename the Lat member variable in the Location struct Latitude, we can use the following command:

$ clang-refactor --selection="foo.cpp:1:1-10:2"

                 --old-qualified-name="Location::Lat"

                 --new-qualified-name="Location::Latitude"

                 foo.cpp

Building clang-refactor

Be sure to follow the instructions at the beginning of this chapter to include clang-tools-extra in the list for the LLVM_ENABLE_PROJECTS CMake variable. By doing this, you'll be able to build clang-refactor using the ninja clang-refactor command.

You will get the following output:

// In foo.cpp…

struct Location {

  float Latitude, Lng;

};

float foo(Location *loc) {

  auto Lat = loc->Latitude + 1.0;

  return Lat;

}

This is done by the refactoring framework built inside libTooling; clang-refactor merely provides a command-line interface for it.

Summary

In this chapter, we looked at how Clang is organized and the functionalities of some of its important subsystems and components. Then, we learned about the differences between Clang's major extension and tooling options – the Clang plugin, libTooling, and Clang Tools – including what each of them looks like and what their pros and cons are. The Clang plugin provides an easy way to insert custom logic into Clang's compilation pipeline via dynamically loaded plugins but suffers from API stability issues; libTooling has a different focus than the Clang plugin in that it aims to provide a toolbox for developers to create a standalone tool; and Clang Tools provides various applications.

In the next chapter, we will talk about preprocessor development. We will learn how the preprocessor and the lexer work in Clang, and show you how to write plugins for the sake of customizing preprocessing logic.

Further reading

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.199.162