Jobs to be done by an assembler: An assembler has to do the following at the minimum:
We shall now consider various types of assemblers and their major characteristics. We then discuss working of a two-pass assembler for the example assembly language.
Such an assembler basically cannot handle forward references. For example, consider the code:
DCR B
JZ FRWRD
... ...
... ...
FRWRD: ... ...
Here, the one-pass assembler will not have address for label FRWRD and so will not be able to assemble the JZ instruction. There are three options for such an assembler:
<symbol> <address> <ptr> ––> <obj code pos #1> <obj code pos #2> …
When a forward reference is not yet defined, <address> is zero. When it gets defined later in the code, the assembler inserts its address value at the code positions according to the link list. This assumes that either the complete object code is held in memory or a “random-load” loader is required.Such an one-pass assembler was relevant when in old days the secondary storage units were slow and expensive. Sometimes it is used as back-end of a compiler, which maintains its own symbol table, though the tendency is to use a “standard” two-pass assembler even there.
We shall not discuss one-pass assembler any further.
The assembler has two passes or scans over the input source code. During the first pass, the emphasis is on building up a symbol table apart from checking the lexical validity of the code. During the second pass, it replaces all references with their equivalent addresses or values, assembles the code and outputs it.
The assembler uses the following data structures:
<symbol> <value>
<Mnemonic> <Hex value> <no. of oprnds> <Type op1> <Type op2> <length>
<Reg name> <1st value> <2nd value>
The algorithms for Pass-I and Pass-II are given as Algorithms B.2.1 and B.2.2, respectively.
One of the most frequent operations and one which consumes major portion of execution time in an assembly operation is Symbol Table building and look-up. A real-life assembly language program may have hundreds or thousands of symbols in use. The data structure used to store the table and the algorithms used to access it contribute significantly to the speed of an assembler. The operations done on a symbol table are:
Note that deleting a symbol is not a required operation. The data structure should be selected keeping these points in mind.
What are the possible options for the data structure? They are:
Simple array: Each entry is a (symbol, value) pair. As the array size would be limited and fixed, useful only for experimental work, because the number of symbols that can be allowed in the source will be limited by the array size. Also, if the symbol names are stored directly in the array, there will be an upper limit on the size of the names also, considerable wastage of memory.
Array with pointers to symbol names: The symbol names are stored separately in a character array and within the main symbol table we put the starting index and length of the symbol name. This will allow considerable variation in the size of the symbol names, at the same time saving memory (see Fig. B.3). Even with this enhancement, the major problem of limited number of entries remains. If the implementation language is C, we can use ‘0’ terminated C strings in the name storage. This avoids storing of length value in the symbol table entry, but the implementor must take care to use the character array as readonly once a symbol name is entered in it, otherwise there is a danger of disturbing the array.
Sorted array: If the entries are put in the symbol table in an arbitrary order, then for each look-up, time O(n) will be taken, where n is the number of symbols in the table. We can sort the array every time a new symbol is inserted and use the Binary search method to look-up the table. As there is only one time insertion for each symbol, but there may be several look-ups for it, this seems to be a viable approach. As sorting does take time, one alternative is to sort the table only after 10 new symbols are inserted and use an auxiliary linked list to keep new symbols. Look-up operation checks the linked list first and then the main sorted array. When linked list has 10 symbols in it, all of them are put in the array and the array is resorted.
Linked list: If the symbol table is constructed as a singly linked list, one of the major objections to the array storage does not apply. The loop-up times will be, though, large and unacceptable for large symbol tables. One way out of this is to use multiple linked lists, one each for symbol name starting with a particular character. The link head pointers are kept in an array.
Binary search tree: Provide considerable improvement, at the cost of complexity of handling routines, over the above methods. Here, assumption is that the new symbols arrive in a random lexical order, so that the BST is more or less balanced.
Hash table: Provide possibly the best performance, especially for large source codes having thousands of symbols. Here, the main idea is that each symbol name is converted to a corresponding hash value by a fast hashing function. The hash value is a positive integer ≤ the size of the hash table. The symbol data is inserted in the hash table at that index position (see Fig. B.4).
Unfortunately, the symbol name space will generally be very large compared to the number of entry slots available in the hash table. This results in what is known as collision. One of the popular methods to handle these inevitable collisions is to use linked list for each of the hash value, as shown in Fig. B.5.
It is usual to have a very useful facility, known as Macro facility, in an assembler. Though macro facility can be made quite complex, and many real-life assembler do have sophisticated macro facilities, even simple macro implementation can be of considerable help to an assembly language programmer.
A Macro simply means a name given to some pattern. It refers to a mapping macro : symbol ⇒ pattern. We say that the pattern defines the macro denoted by the symbol. There are two steps in using a macro: first, define a macro, by associating a symbol with an arbitrary pattern, then second, use the macro, by simply using the symbol. When we use the macro, macro expansion or replacement takes place, i.e. wherever the symbol is occurring, it is replaced by the pattern.
This simple idea of replacement of a symbol by a pattern is very powerful. It is used in assemblers, compilers, spreadsheets, text editors, etc. In fact, the software that the authors of this book used to compose it –TEX and LATEX uses macros very heavily. In Section B.3, we discuss several such macro processors.
An assembly language programmer finds that he uses certain sequences of assembly language statements repeatedly, with possibly minor variations.
For example, assume that adding 4 to the value of H register occurs several times in our program. We may define a macro as follows:
ADD4 MACRO
MOV A,H
ADI 4
MOV H,A
ENDM
Here, ADD4 is the name of the macro, i.e. the symbol to be replaced during macro expansion, MACRO and ENDM are macro assembler pseudo-operations denoting beginning of macro definition and end of definition respectively.
If this macro definition is included in an earlier part of our program, then later on we can use it simply as:
... ...
ADD4
... ...
and it will result in the following code, ADD4 being replaced by 3 lines of code.
... ...
MOV A,H
ADI 4
MOV H,A
... ...
Note well that pseudo-operations MACRO and ENDM are not inserted. This insertion happens every time the symbol ADD4 is used.
Let us try to define another macro – Load accumulator indirect, which is a very common operation. We would like to load data from a specified address into the A register. This requires that we use a macro argument to be able to specify any address.
LDIND MACRO ADDRS
LHLD ADDRS MOV A,M
ENDM
The symbol ADDRS is the local macro argument. While using the macro, it will be replaced by a use, call or invocation argument. Thus,
... ...
LDIND 1234H
... ...
Note that we have supplied the call argument as 1234H. The code inserted will be:
... ...
LHLD 1234H
MOV A,M
... ...
As a further example, the bytes copy operation as shown in Fig. B.1 can be implemented as a macro:
MOVBT MACRO BLK1,BLK2,COUNT
LXI H,BLK1
LXI D,BLK2
MVI B,COUNT
LOOP: MOV A,M
STAX D
INX H
INX D
DCR B
JNZ LOOP
ENDM
Note the three macro arguments BLK1, BLK2 and COUNT. A difficulty arises if we try to use this macro in our program at more than one place. The label LOOP will be defined multiple times.
The first thing to note is that the macro arguments BLK1, BLK2 and COUNT are local to the macro and they will be replaced by the call arguments, so there is no problem with them. Can we tell the macro processing functions within the macro assembler to create unique labels on fly? Yes, if we write $<number> instead of LOOP as the label, the macro assembler will generate unique labels for each of them every time the macro is used and insert them in the symbol table.
... ...
MVI B,COUNT
$1: MOV A,M
STAX D
... ...
JNZ $1
ENDM
A macro assembler may allow calling a previously defined macro within the definition of another macro (nested macro call). It may further allow definition of a macro within definition of another macro (nested macro definition.)
We shall assume that the macro processing is done independently of the assembler, as a preprocessing step. Also, we discuss only a simplified version, without nested macro calls or nested macro definitions. We have two passes – Pass-I will search for macro definitions and store them; Pass-II detect macro calls and expand them. The data structures used are:
MDT Macro Definition Table – stores the body of macro definitions.
MNT Macro Name Table – stores names of defined macros.
MDTC MDT counter – indicates the next available entry in MDT.
MNTC MNT counter – indicates the next available entry in MNT.
ALA Argument List Array – stores macro arguments and call arguments.
MDTP MDT pointer – indicates the next line of text during expansion.
The algorithms are given as Algorithms B.2.3 and B.2.4.
Seemingly, from the viewpoint of the programmer, macros provide the same kind of facilities as subroutines. Then how do they compare?
Macro: Complete macro body code gets substituted every time it is used, so increases the code length. As it is in-line code, save on the subroutine call and return instruction time and stack load overheads, and thus works fast.
Subroutine: Code is included only once in the program. Time and stack-space overheads for each call and return.
In summary, we can say that if the job can be done with a few instructions, say 5 to 10, then it may be advisable to implement it as a macro, otherwise it should be implemented as a subroutine.
3.12.136.186