Peer reviewed by Younes Rafie.
There's a lot going on under the hood when we execute a piece of PHP code. Broadly speaking, the PHP interpreter goes through four stages when executing code:
This piece will skim through these stages and show how we can view the output from each stage to really see what is going on. Note that while some of the extensions used should already be a part of your PHP installation (such as tokenizer and OPcache), others will need to be manually installed and enabled (such as php-ast and VLD).
Lexing (or tokenizing) is the process of turning a string (PHP source code, in this case) into a sequence of tokens. A token is simply a named identifier for the value it has matched. PHP uses re2c to generate its lexer from the zend_language_scanner.l definition file.
We can see the output of the lexing stage via the tokenizer extension:
$code = <<<'code'
<?php
$a = 1;
code;
$tokens = token_get_all($code);
foreach ($tokens as $token) {
if (is_array($token)) {
echo "Line {$token[2]}: ", token_name($token[0]), " ('{$token[1]}')", PHP_EOL;
} else {
var_dump($token);
}
}
Outputs:
Line 1: T_OPEN_TAG ('<?php
')
Line 2: T_VARIABLE ('$a')
Line 2: T_WHITESPACE (' ')
string(1) "="
Line 2: T_WHITESPACE (' ')
Line 2: T_LNUMBER ('1')
string(1) ";"
There's a couple of noteworthy points from the above output. The first point is that not all pieces of the source code are named tokens. Instead, some symbols are considered tokens in and of themselves (such as =
, ;
, :
, ?
, etc). The second point is that the lexer actually does a little more than simply output a stream of tokens. It also, in most cases, stores the lexeme (the value matched by the token) and the line number of the matched token (which is used for things like stack traces).
The parser is also generated, this time with Bison via a BNF grammar file. PHP uses a LALR(1) (look ahead, left-to-right) context-free grammar. The look ahead part simply means that the parser is able to look n
tokens ahead (1, in this case) to resolve ambiguities it may encounter whilst parsing. The left-to-right part means that it parses the token stream from left-to-right.
The generated parser stage takes the token stream from the lexer as input and has two jobs. It firstly verifies the validity of the token order by attempting to match them against any one of the grammar rules defined in its BNF grammar file. This ensures that valid language constructs are being formed by the tokens in the token stream. The second job of the parser is to generate the abstract syntax tree (AST) - a tree view of the source code that will be used during the next stage (compilation).
We can view a form of the AST produced by the parser using the php-ast extension. The internal AST is not directly exposed because it is not particularly "clean" to work with (in terms of consistency and general usability), and so the php-ast extension performs a few transformations upon it to make it nicer to work with.
Let's have a look at the AST for a rudimentary piece of code:
$code = <<<'code'
<?php
$a = 1;
code;
print_r(astparse_code($code, 30));
Output:
astNode Object (
[kind] => 132
[flags] => 0
[lineno] => 1
[children] => Array (
[0] => astNode Object (
[kind] => 517
[flags] => 0
[lineno] => 2
[children] => Array (
[var] => astNode Object (
[kind] => 256
[flags] => 0
[lineno] => 2
[children] => Array (
[name] => a
)
)
[expr] => 1
)
)
)
)
The tree nodes (which are typically of type astNode
) have several properties:
kind
- An integer value to depict the node type; each has a corresponding constant (e.g. AST_STMT_LIST
=> 132, AST_ASSIGN
=> 517, AST_VAR
=> 256)flags
- An integer that specifies overloaded behaviour (e.g. an astAST_BINARY_OP
node will have flags to differentiate which binary operation is occurring)lineno
- The line number, as seen from the token information earlierchildren
- sub nodes, typically parts of the node broken down further (e.g. a function node will have the children: parameters, return type, body, etc)The AST output of this stage is handy to work off of for tools such as static code analysers (e.g. Phan).
The compilation stage consumes the AST, where it emits opcodes by recursively traversing the tree. This stage also performs a few optimizations. These include resolving some function calls with literal arguments (such as strlen("abc")
to int(3)
) and folding constant mathematical expressions (such as 60 * 60 * 24
to int(86400)
).
We can inspect the opcode output at this stage in a number of ways, including with OPcache, VLD, and PHPDBG. I'm going to use VLD for this, since I feel the output is more friendly to look at.
Let's see what the output is for the following file.php script:
if (PHP_VERSION === '7.1.0-dev') {
echo 'Yay', PHP_EOL;
}
Executing the following command:
php -dopcache.enable_cli=1 -dopcache.optimization_level=0 -dvld.active=1 -dvld.execute=0 file.php
Our output is:
The opcodes sort of resemble the original source code, enough to follow along with the basic operations. (I'm not going to delve into the details of opcodes here, since that could be a book in itself.) No optimizations were applied at the opcode level in the above script - but as we can see, the compilation phase has made some by resolving the constant condition (PHP_VERSION === '7.1.0-dev'
) to true
.
OPcache does more than simply caching opcodes (thus bypassing the lexing, parsing, and compilation stages). It also packs with it many different levels of optimizations. Let's turn up the optimization level to four passes to see what comes out:
Command:
php -dopcache.enable_cli=1 -dopcache.optimization_level=1111 -dvld.active=-1 -dvld.execute=0 file.php
Output:
We can see that the constant condition has been removed, and the two ECHO
instructions have been compacted into a single instruction. These are just a taste of the many optimizations OPcache applies when performing passes over the opcodes of a script. I won't go through the various optimization levels here, though.
The final stage is the interpretation of the opcodes. This is where the opcodes are run on the Zend Engine (ZE) VM. There's actually very little to say about this stage (from a high-level perspective, at least). The output is pretty much whatever your PHP script outputs via commands such as echo
, print
, var_dump
, and so on.
So instead of digging into anything complex at this stage, here's a fun fact: PHP requires itself as a dependency when generating its own VM. This is because the VM is generated by a PHP script, due to it being simpler to write and easier to maintain.
We've taken a brief look through the four stages that the PHP interpreter goes through when running PHP code. This has involved using various extensions (including tokenizer, php-ast, OPcache, and VLD) to manipulate and view the output of each stage.
I hope this piece has helped to provide you with a better holistic understanding of PHP's interpreter, as well as shown the importance of the OPcache extension (for both its caching and optimization abilities).
98.82.120.188