12.8 Code Generation

We generate three kinds of outputs – raw assembly language code, reverse Polish notation (RPN) Intermediate Representation (IR) and 4-tuple IR. The code generation is achieved through yacc action terms.

Due to space constraints we do not propose to give the complete listings of the yacc action terms, but give a few indicative examples below.

12.8.1 Arithmetic Expression

We give here the yacc action terms for expr ‘+’ expr:

| expr ‘+’ expr { fprintf(fout1,″+
″); $$ = $3;
                  fprintf(fout2,″%d: ADD T%d T%d T%d
″, ++mcount,
                           tpop(), tpop(), ++tcount); tpush(tcount);
                  efadd();}

The files fout1 and fout2 get the RPN IR and 4-tuple IR outputs, respectively. The raw assembly language code is generated by the function efadd() in the file codegen.c, which is:

efadd(){
  emit(″ faddp
″,0);
}

It outputs the FPU instruction faddp to file fout.

12.8.2 Assignment

Here are the action terms for a typical assignment:

asgn: VAR ‘=’ expr{fprintf(fout1,″%s
 = 
″, name($1));
                   fprintf(fout2,″%d: = T%d %s ––
″, ++mcount,
                                  tcount, name($1));tpush(tcount);
                   pfassgn($1);
                  }

The corresponding function pfassgn() for generating the raw assembly code is:

pfassgn(Node *v){
  char str[40];
  sprintf(str,″ fstps %s
″, name(v));
  emit(str,0);
}

Notice that the floating-point value is being stored single-precision and the FPU stack is popped.

12.8.3 Comparison with Logical Result

Result of an arithmetic expression comparison is a logical value, which we represent as False = 0 and True = 1. We set a byte register as per the result of the companion with one of these two values. Then we can uniformly use a JZ instruction as the building block of all our control constructs. The following sample code which does expr > expr comparison illustrates this idea.

| expr GT expr {efgt(); }

The corresponding raw assembly code generator function – efgt() – looks like

efgt(){
  emit(″ fcomip %%st(1)
″,0);   // want ″st(1) > st(0)?″
  emit(″ setc %%al
″,0);        // i.e.  TOS   < st(1)
  emit(″ andl $1, %%eax
″,0);   //        1st     2nd
  emit(″ fstp FPacc
″,0);
}

The floating point compares the instruction sets two flags C and Z. We set al to 1 if the C is set, i.e. TOS is less than the second operand. Remember we want to check if al is 0 or 1, which is achieved by the andl instruction. This will set up the flags again for a subsequent JZ instruction.

Other comparisons are implemented in similar ways.

12.8.4 Integer Increment and Decrement

A very common operation in C, incrementation or decrementation of an integer variable, requires some a careful implementation. When we write a variable within an expression (on the right-hand side of an assignment, for example), such as a, its current value is expected to be inserted in that place. When an incremented variable is used, for example, ++a, two operations are involved – first increment a and then insert that incremented value at that point in the expression. If it were a++, first insert the value in the expression and then increment the variable.

The yacc action terms are:

| INC IVAR         {efincv($2); pivar($2); $$ = $2; } // ++a
| DEC IVAR         {efdecv($2); pivar($2); $$ = $2; } // ––a
| IVAR INC         {$$ = $1; pivar($1); efvinc($1); } // a++
| IVAR DEC         {$$ = $1; pivar($1); efvdec($1); } // a––

The corresponding raw assembly generation functions are:

efincv(Node *p){
  char str[40];
  sprintf(str,″ incl %s
″, name(p));
  emit(str,0);
}
efdecv(Node *p){
  char str[40];
  sprintf(str,″ decl %s
″, name(p));
  emit(str,0);
}
efvinc(Node *p){
  char str[40];
  sprintf(str,″ incl %s
″, name(p));
  emit(str,0);
}
efvdec(Node *p){
  char str[40];
  sprintf(str,″ decl %s
″, name(p));
  emit(str,0);
}

12.8.5 IF-THEN-ELSE Construct

Actions for IF-THEN-ELSE construct.

if: IF{Symbol *i; i = tree_root(″dope″); $$ = i; ++lcount; lpush(lcount);
                                                  ++lcount; lpush(lcount);}
cond: expr{int l; fprintf(fout1, ″Label%d
BZ
″, l = lpop()); lpush(l); 
                  ($$) –> y.I = ++mcount;
                  fprintf(fout2, ″%d: BZ %d -- --
″, mcount, 0);
                  ethen(l);}
stmt:
      | if’(’ cond ‘)’ stmt end {int l;                            /* else-less if */
        fprintf(fout1, ″Label%d
:
″, l = lpop()); lpop(); /*label1 */
        fprintf(fout2, ″%d: LABI %d -- --
″ , ++mcount, ($3)–>y.I);
        eendif(l);
      }     /* end, if cond fails */
| if’(’ cond ‘)’ stmt end ELSE{int l,l1 ;
      l = lpop(); /* label2 */
          fprintf(fout1, ″Label%d
BR
″, (l1 = lpop())); lpush(l1);
          fprintf(fout1, ″Label%d
:
″, l); /* label2 */
          fprintf(fout2, ″%d: BR %d -- --
″, ++mcount, 0);
          ($3)–>x.I = mcount;
          fprintf(fout2, ″%d: LABI %d -- --
″, ++mcount, ($3)–>y.I);
          eelse(l,l1);
          }
          stmt end{int l;        /* if with else */
            fprintf(fout1, ″Label%d
:
″, l = lpop()); /* label2 */
            fprintf(fout2, ″%d: LABI %d -- --
″, ++mcount,($3)–>x.I);
            eendif(l);
          }

The variables fout1 and fout2 are handles for the Intermediate Representation output files *.AST and *.MAT containing the RPN and 4-tuple outputs, respectively. The raw assembly code is output to a file *.S. All these three files have names derived from the source file myfile.miniC.

When if is parsed, two label numbers and a node in Symbol Table are created. When the “cond” is parsed, a conditional branch is generated. The RPN IR and 4-tuple IR are generated within the action terms themselves, but the raw assembly code is generated by functions in the file codegen.c. The three functions used for IF-THEN-ELSE construct are:

eendif(int l){
  emit(″ Label%d:
″,l);
}
ethen(int l){
  emit(″ jz Label%d
″, l);
}
eelse(int l, int l1){
  emit(″ jmp Label%d
″ , l1);
  emit(″ Label%d:
″, l);
}

WHILE-DO and FOR loops are handled in a similar manner.

12.8.6 Function Definition and Call

The code for user-defined functions is to be generated in a slightly special way. The raw assembly code output for the functions is written, as an intermediate step, to a temporary file, with file handle fout3. All the code generated within a function definition is to be put in this file. To manage that, we have a switch variable:

int definition = 0;

For normal code, this variable has a value zero, but when the code for a function definition is being generated its value is set to 1.

The grammar and yacc action terms associated with function definition, function call and formal arguments of a function are:

defn: FUNC procname{definition = 1; type($2) = FUNCTION;
                    fprintf(fout1,″%s
DFN
″,name($2));
                    fprintf(fout2,″%d: DFN %s -- --
″, ++mcount,
                                                          name($2));
                   ($2)–>u.I = mcount;
                    edefn($2);}
      ‘(’ ‘)’ stmt{fprintf(fout1,″DFE
″);
                    fprintf(fout2,″%d: RET T%d -- --
″,++mcount,
                                                               tcount);
                    efstmt(); definition = 0;}
        ;
procname: VAR       { $$ = $1;}
        | FUNCTION  { $$ = $1;}
        ;

An identifier is initially assigned a type VAR by the Scanner, but when it appears within a valid function definition as the function name, its type is set to FUNCTION. The function prologue code is generated by edefn(), each statement within the definition generates its own code, but remember that now it goes to the temporary file. The function epilogue is generated by efstmt().

Within the function definition, its formal arguments are written as $1, $2, etc. The Scanner detects such arguments, assigns them type ARG and returns the argument number as the value of yylval:

if(c == ‘$’)  { /* argument? */
	        int n = 0;
	        while(isdigit(c = getc(fin)))
                n = 10 * n + c – ‘0’;
                ungetc(c, fin);
                if(n == 0)
                       execerror(″strange $…″, (char *)0);
                yylval.narg = n;
                return ARG;
        }

When a formal argument appears within a function definition, the code generator function parg() pushes an appropriate value on the computation stack. If a value is assigned to a formal argument, it is handled by function pargassgn(). Both these functions use stack addressing with respect to the Base Pointer ebp.

parg(int i){
  emit(″ flds %d(%%ebp)
″, (i + 1) * 4);
}
pargassgn(int i){
  char str[40];
  emit(″ fstps %d(%%ebp)
″, (i + 1) * 4);
}

A function call is handled by the following action terms:

 | FUNCTION begin ‘(’ arglist ‘)’
    { fprintf(fout1,″%s
″, name($1));
      fprintf(fout2,″%d: CALL %s T%d %d
″, ++mcount, name($1),++tcount, 4 * $4);
      efcall($1,$4);
}

arglist: /* nothing */   {$$ = 0; }
       | expr            {fprintf(fout2,″%d: ARG T%d -- --
″,++mcount,
                          tpop()); $$ = 1; eargpush($1); }
       | arglist ‘,’ expr {fprintf(fout2,″%d: ARG T%d -- --
″,++mcount,
                          tpop()); $$ = $1 + 1; eargpush($3);}
       ;

The function call arguments are pushed on the computation stack by code generator function eargpush() and efcall() generates the function invocation code.

efcall(Node *p, int q){
  char str[40];
  sprintf(str,″ call %s
″, name(p));
  emit(str, 0);
  emit(″ addl $%d, %%esp
″, q * 4);
}
eargpush(Node *p){
  char str[40];
  if(type(p) == INT || type(p) == IVAR){
    sprintf(str,″    pushl %s
″, name(p));
  }
  if(type(p) == NUMBER || type(p) == VAR){
    sprintf(str,″    pushl %s
″, name(p));
  }
  if(type(p) == STRING || type(p) == SVAR){
    sprintf(str,″    pushl %s
″, name(p));
  }
  emit(str,0);
}

Finally, the emit() function switches the output depending upon the value of definition.

emit(char *s, int i){
  if(definition == 0)
    fprintf(fout, s, i);
  else if(definition == 1)
    fprintf(fout3, s, i);
}

As a small example, consider the following miniC program:

float: j
func my(){ $1 = $1 + 1.0
  return $1
}
j = 5.0
printr(my(j))
prints(″\n″)
end

The code for the user-defined function my() is:

.text
  .globl my
  .type my, @function
my:
  pushl %ebp
  movl %esp, %ebp
  flds 8(%ebp)
  flds FC1
  faddp
  fstps 8(%ebp)
  flds 8(%ebp)
  movl %ebp, %esp
  pop %ebp
  ret

The “main” program is:

.data
Iacc: .int 0
.text
.globl  _start
_start: nop
    flds   FC2
    fstps  j
    flds   j
    pushl  j
    call   my
    addl   $4, %esp
    fstps  FPacc
    pushl  FPacc
    call   printr
    addl   $4, %esp
    movl   $SC3, %eax
    pushl  %eax
    call   prints
    addl   $4, %esp
    movl   %eax, %ebx
    movl   $1, %eax
    int    $0x80
.section .data
DEG: .float 57.295780
E:   .float 2.718282
FC1: .float 1.000000
FC2:   .float 5.000000
FPacc: .float 0.000000
GAMMA: .float 0.577216
PHI:   .float 1.618034
PI:    .float 3.141593
PREC:  .float 7.000000
SC3:   .ascii ″
″
float: .float 0.000000
int:   .int 0
j:     .float 0.000000
       .lcomm string, 4

When this program was assembled, linked and executed, it gave:

$test14
  6.00000E + 00

as it should.

12.8.7 Assembly Language Macros

Several assembly language macros are defined, mainly for interfacing with the Linux system calls. These macros are in a file named marocs.S. For example, the following macros are used to do read and write operations to the system console:

.macro read buff, buff_size
  movl $3, %eax
  movl $0, %ebx
  movl uff, %ecx
  movl uff_size, %edx
  int    $0x80
.endm
.macro write str, str_size
  movl $4, %eax
  movl $1, %ebx
  movl str, %ecx
  movl str_size, %edx
  int    $0x80
.endm

12.8.8 Built-in Functions Library

It was decided that we shall not use the standard C library for linking with the final compiled miniC program, though it is possible to do so, if we take care to use proper command-line arguments while linking the object files. This was demonstrated in Section 12.4.3.

Several functions, called built-in functions, which are callable from a miniC program, are defined. As the calling conventions of miniC is compatible with the standard C calling conventions, these functions can be tested independently of a compiled miniC program.

The following functions are available at present in the “built-in” library:

printi   Print an integer.

getint   Read an integer.

mystrlen   Find length of a string.

prints   Print a string.

gets   Read a string.

printf   Print a floating-point number.

getf   Read a floating-point number.

myfabs   Floating-point absolute value.

myfcos   Cosine of an angle (radians).

myfsin   Sine of an angle (radians).

myfsqrt   Square-root.

extract   Exponent of the floating-point argument.

atofproc   Convert a string to a floating-point number.

ftoaproc   Convert a floating-point number to a string.

 

We give here a small example of the function strlen which finds the length of the given null-terminated string.

#  find the length of a string
#  uses %ecx, %edi
#  %edi contains address of the string and %eax the result at end
.globl strlen
       .type strlen @function
strlen:
       pushl %ebp
       movl %esp, %ebp
       movl (maxstr), %ecx
       xorl %eax, %eax
       movl 8(%ebp), %edi
       repnz scasb (%edi)
       movl (maxstr), %eax
       subl %ecx, %eax
       decl %eax
       movl %ebp, %esp
       popl %ebp
       ret

Here, maxstr is the address of a global constant specifying the maximum string length that is handled, which is tentatively and arbitrarily set as 255.

12.8.9 A Few Example miniC Programs

Here, we present some example programs written in miniC to illustrate the syntax.

Example 1

A while loop plus floating point and string output.

float: a
a = 0.0
while(a < 10.0){
   printr(a)
   prints(″\n″)
   a = a + 1.0 } end
end

When compiled, assembled, linked and executed it gave the following output:

0.00000E + 00
1.00000E + 00
2.00000E + 00
3.00000E + 00
4.00000E + 00
5.00000E + 00
6.00000E + 00
7.00000E + 00
8.00000E + 00
9.00000E + 00

Example 2

Boolean conditions and unary-minus test:

float: a
float: b
b = 1.2
a = 2.3
if(a > b) prints(″a greater than b\n″)
if(a >= b) prints(″a greater/equal b\n″)
if(a == a) prints(″a equal to a\n″)
if(a != b) prints(″a not equal to b\n″)
if(b <= a) prints(″b less/equal a\n″)
if(b < a) prints(″b less than a\n″)
b = –a
printr(b)
prints(″\n″)
printr(–99.99)
prints(″------ That's all.\n″)
end

The output is:

a greater than b
a greater/equal b
a equal to a
a not equal to b
b less/equal a
b less than a
–2.30000E + 00
–9.99900E + 01----  That's all.

Example 3

Test pre- and post-increment and decrement:

int : a
a = 7
printi(++a)
prints(″\n″)
printi(a++)
prints(″\n″)
printi(a)
printi(––a)
prints(″\n″)
printi(a––)
prints(″\n″)
printi(a)
prints(″ that's all.\n″)
end

The result output is:

8
8
9            8
8
7 that's all.

12.8.10 Assembly Language Idioms

We present here some assembly language idioms which may help the reader in understanding and extending the miniC language.

IF-THEN-ELSE Construct

In C, an if-statement consists of three parts – the condition, the true branch and the false branch. However, since assembly language is not a block-structured language, you have to work a little to implement the block-like nature of C. For example, look at the following C code:

if(a == b){
  /* True Branch Code Here */
} else {
  /* False Branch Code Here */
}
  /* At This Point, Reconverge */

In assembly language, this can be rendered as:

# Move a and b into registers for comparison
  movl a, %eax
  movl b, %ebx
# Compare
  cmpl %eax, %ebx
# If True, go to true branch
  je true_branch
false_branch:
# This label is unnecessary,
# only here for documentation
# False Branch Code Here
# Jump to recovergence point
  jmp reconverge
true_branch:
# True Branch Code Here
reconverge:
# Both branches recoverge to this point

Since assembly language is linear, the blocks have to jump around each other. A case statement is written just like a sequence of if-statements.

Function Call

A function call in assembly language simply requires pushing the arguments to the function onto the stack in reverse order and issuing a call instruction. After the call is executed, the arguments are then popped back off of the stack or the stack pointer is adjusted. For example, consider the C code:

printf(″The number is %d″, 88);

In assembly language, this can be implemented as:

.section .data
text_string:
  .ascii ″The number is %d″
.section .text
  pushl $88
  pushl $text_string
  call printf
  popl %eax
  popl %eax
#%eax is just a dummy variable.

Variables and Assignments

Global and static variables are declared using .data or .bss entries. Local variables are declared by reserving space on the stack at the beginning of the function. This space is given back at the end of the function. Interestingly, global variables are accessed differently than local variables in assembly language. Global variables are accessed using direct addressing, whereas local variables are accessed using base pointer (%ebp) addressing. For example, consider the following C code:

int global_var;
int foo(){
  int local_var;
  local_var = 1;
  global_var = 2;
  return 0;
}

This can be implemented in assembly language as:

.section .data
  .lcomm global_var, 4
  .type foo, @function
foo:
  pushl %ebp                         # Save old base pointer
  movl %esp, $ebp                    # make stack pointer base pointer
  subl $4, %esp                      # Make room for local_var
  .equ local_var, –4                 # we can now use local_var to
                                     # find the local variable
movl $1, local_var(%ebp)
movl $2, global_var
movl %ebp, %esp                      # Clean up function and return
popl %ebp
ret

What may not be obvious is that accessing the global variable takes fewer machine cycles than accessing the local variable. However, that may not matter because the stack is more likely to be in physical memory (instead of swap-disk) compared with the global variable, which might have been swapped out.

Loops

Loops work somewhat like IF-statements in assembly language – the blocks are formed by jumping around. In C, a WHILE loop consists of a loop body, and a test to determine whether it is time to exit the loop. A FOR loop is similar, with optional initialization and counter-increment sections. In C, a while loop looks like this:

while(a < b){
  /* Do stuff here */
}
/* Finished Looping */

This can be implemented in assembly language as:

loop_begin:
  movl a, %eax
  movl b, %ebx
  cmpl %eax, %ebx
  jge     loop_end
loop_body:
# Do stuff here
  jmp loop_begin
loop_end:
# Finished looping

The x86 assembly language has some direct support for looping as well. The %ecx register can be used as a counter that ends with zero. The loop instruction will decrement %ecx and jump to a specified address unless %ecx is zero. For example, if you wanted to execute a statement 100 times, you would do this in C:

for(i = 0; i < 100; i++){
/* Do process here */
}

In assembly language, it can be written as:

loop_initialize:
  movl $100, %ecx
loop_begin:
  #
  #Do Process Here
  #
  #Decrement %ecx and loops if not zero
  loop loop_begin
rest_of_program:
  #Continues on to here

One thing to notice is that the loop instruction requires you to be counting backwards to zero. If you need to count forwards or use another ending number, you should use the loop form which does not include the loop instruction. For really tight loops of character string operations, there is also the rep instruction, which we have already used in strlen() implementation.

Structs

Structs are simply descriptions of memory blocks. For example, in C you can say:

struct person{
  char firstname[40];
  char lastname[40];
  int age;
}

This does not do anything by itself, except giving you ways of elegantly using 84-bytes of data. You can do basically the same thing using . equ directives in assembly language, like this:

.equ PERSON_SIZE, 84
.equ PERSON_FIRSTNAME_OFFSET, 0
.equ PERSON_LASTNAME_OFFSET, 40
.equ PERSON_AGE_OFFSET, 80

When you declare a variable of this type, all you have to do is reserve 84-bytes of space. So, if you have this in C:

void foo(){
  struct person p;
  /* Do stuff here */
}

In assembly language, you would have:

foo:
# Standard header beginning
  pushl %ebp
  movl %esp, %ebp
# Reserve our local variable
  subl $PERSON_SIZE, %esp
# This is the variable's offset from %ebp
  .equ P_VAR, 0 – PERSON_SIZE
# Do Stuff Here
# Standard function ending
  movl %ebp, %esp
  popl %ebp ret

To access structure members, you just have to use base pointer addressing with the offsets defined above. For example, in C you could set the person's age like this:

p.age = 30;

In assembly language, it looks like this:

movl $30, P_VAR + PERSON_AGE_OFFSET(%ebp)

Pointers

First, we take a look at global variables. For example,

int global_data = 30;
int *a = &global_data;

In assembly language, this can be written as:

 .section .data
global_data:
  .long 30
   movl $global_data, %eax

With assembly language you are almost always accessing memory via pointers. Direct addressing does that. To get the pointer itself, you just have to go with immediate-mode addressing.

Local variables are slightly more difficult. Here is how you take the address of a local variable in C:

void foo(){
  int a;
  int *b;
  a = 30;
  b = &a;
  *b = 44;
}

The same code in assembly language looks like:

foo:
# Standard opening
  pushl %ebp
  movl %esp, %ebp
# Reserve two words of memory
  subl $8, $esp
  .equ A_VAR, –4
  .equ B_VAR, –8
# a = 30
  movl $30, A_VAR(%ebp)
# b = &a
  leal A_VAR(%ebp), %eax
  movl %eax, B_VAR(%ebp)
# *b = 44
  movl B_VAR(%ebp), %eax
  movl $44, (%eax)
# Standard closing
  movl %ebp, %esp
  popl %ebp
  ret

As you can see, to take the address of a local variable, the address has to be computed in the same way the computer computes the addresses in base pointer addressing. There is an easier way – the processor provides the instruction leal, which stands for “load effective address”. This lets the computer compute the address, and then load it wherever you want.

Getting GCC to Help

One of the nice things about GCC is its ability to output assembly language code. To convert a C language file to assembly, give the command:

gcc –S file.c

The output will be in file.s. In it most of the variable names have been removed and replaced either with locations with respect to the base pointer %ebp or references to automatically generated labels. You should turn off optimizations with -00 so that the assembly language output will follow your source code more closely. Something else you might notice is that GCC reserves more stack space for local variables than we do, and then use an AND instruction to adjust the address. This is for a double-word aligning of variables, which is expected to increase memory and cache efficiency.

12.8.11 Linux System Calls

Some of the more important system calls are given in Table 12.6. Remember that %eax holds the system call numbers, and that the return values and error codes are also stored in %eax.

 

Table 12.6 Important linux system calls

 

Important linux system calls
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.99.71