I wanted to improve my knowledge of compiling and link editing. Writing an assembler seemed like a good way to do this, since an assembler operates at an interesting juncture in compilation. Assemblers must read and decipher text input. Assemblers also generate relocatable object code that the link-editor merges into a larger program entity. Writing an assembler requires some of the lexical analysis and parsing of a compiler proper. It also requires the ability to generate binary instructions and linking information.
The task of writing an assembler also possesses a definable ending. You complete the task when the assembler replaces the stock, system-provided assembler.
This assembler transforms "suggested syntax" SPARC Version 8 assembly language into object modules of the NetBSD a.out format. I tried to retain all features of Sun Microsystems SunOS assembler, and include as much as I could decipher of the GNU SPARC assembler.
I used The SPARC Architecture Manual, version 8, Prentice Hall, 1992, ISBN 0-13-825001-4 for the definition of the SPARC instruction set, and for the suggested assembly language syntax.
One obvious question arises: why did I decide to spend time on such an elaborate project. I could learn about parsing, assembly and linking without writing an entire assembler plug-compatible with GNU as. I could stop after writing a version that could generate link-editable object code for a few simple cases and say I achieved my goals. When I finished that much of my assembler, I realized that my employer's programming standards prevented me from writing good programs. I decided to finish my assembler in a professional fashion to perform penance.
I wrote this two-pass assembler in a language as close to ANSI C as I could manage. It completely parses the input for each pass. It uses a program-generated lexical analyzer and a program-generated parser to recognize and structure its textual input. The grammar includes productions that specify assembler directives. The inclusion of assembler directives allows the assembler to recognize and evaluate algebraic expressions in directive arguments. The assembler recognizes and removes comments during lexical analysis.
The assembler determines the size of the assembled machine code and initialized data during its first pass over the input file. It fills a symbol table with all labels that refer to data or instructions in the input. Between passes over the input, it pre-allocates a file of the size required for instructions and initialized data. It does not pre-allocate space in that file for symbol table and relocation information.
The assembler's second pass creates and writes instructions and initialized data into the pre-allocated file. It also adds any special purpose debugging symbols that appear in the input to the symbol table during the second pass. The assembler ignores debugging symbols ("stabs") during the first pass, since debugging symbols can refer to other symbols, but nothing refers to the debugging symbols. The second pass creates relocation information that allows link-editing the final object module with other object modules and library code. After the second pass, the assembler appends symbol table and relocation information to the file containing the object code.
I chose NetBSD/sparc version 1.1 for my development platform. I currently own a functional NetBSD/sparc 1.1 computer. A few other reasons seem important.
I perceived some drawbacks when choosing this platform. The developers designated NetBSD/sparc an experimental operating system. Developing or running software may expose bugs in the operating system or system-provided software.
I wanted to put a limit on this project. I wanted to scope it so that it would not drag on for a long time. I also wanted some way to decide whether to include or exclude changes.
I decided to consider my assembler complete when it correctly assembles the lcc 3.5 ANSI C compiler, and when it correctly assembles the SPARC assembly language portion of the NetBSD/sparc 1.1 operating system kernel. I later added two more tests: participating in recompiling itself, and correctly assembling the SPARC assembly code for the SML/NJ 0.93 interpreter.
An expert SPARC programmer hand-coded the assembly language portion of the NetBSD kernel, using all the odd features of SPARC assembly. The kernel assembly code includes many privileged mode instructions that compilers don't generate. Another expert wrote the assembly code from the SML/NJ interpreter, so it possesses the same properties.
Building the lcc ANSI C compiler verifies that the assembler correctly handles compiler output. The complete build of lcc compiles three versions, two of which get compiled by the previous version. Each version potentially generates different assembly code. lcc also uses floating-point arithmetic to choose among alternatives during its code generation phase.
Substituting my assembler for the NetBSD assembler in the compilation process fulfills the traditional requirement that a programming language possess the ability to compile itself.
I wrote this assembler in the style of complete compilers. It uses a lexical analyzer to breaks the input text stream into tokens and mark each token with a type. It uses a parser that receives tokens and their type markers from the lexical analyzer. It organizes small parse trees from the tokens based on their types. It builds one parse tree (a true tree, not a DAG) per assembly language statement.
I used the freely-available lexical-analyzer-generator flex, version 2.5.2, to generate the lexical analyzer. I used the similarly freely-available LALR(1) parser generator byacc, version 1.9, to generate the parser.
flex seemed like a wise choice because of word-of-mouth recommendations. I can't complain about the results of this choice.
I chose byacc because the suggested assembly language syntax given in The SPARC Architecture Manual intrigued me. The suggested syntax looked like a natural fit with byacc's input.
The suggested syntax did lead to a small grammar. Many of the productions of the grammar exist to enforce the elaborate C-language-style operator precedence rules for algebraic expressions, not for dealing with the assembly language proper. The grammar also encouraged association of small, manageable semantic actions with each production.
This elaborate method of recognizing input seems like overkill for the simple SPARC assembly language. I suspect that using a full-blown byacc style parser might cause the assembler to execute slower than an assembler using a manually written, recursive-descent parser. The assembler generates simple parse trees that amount to little more than a direct transliteration of a line of assembly into a C struct. This make me think that a simpler parser might function equally well.
I based the assembler's symbol table on the linear dynamic hash table described in "Dynamic Hash Tables" by Per-Ake Larson, Communications of the ACM, April 1988, volume 31, number 4. I suspect I expended more than minimal effort on the hash table, but it should yield decent performance for both small and large symbol tables. The hash table uses symbol names (ASCII strings) as the single key to retrieve struct nlist symbol table entries. I used the venerable "PJW" hash function on the symbol name strings to generate hash values.
Handling the old SunOS 4.x assembler's "local labels" complicated the symbol table. In the assembly language input, local labels consist of a single digit (0 through 9) followed by a colon. Any digit can label any quantity of locations in the code. In the context of a branch destination or symbolic name in an instruction mnemonic, a local label consists of a digit and a letter ('f' or 'b'). The letters distinguish forward or backward reference. The reference evaluates to the address of the nearest digit-labeled symbol. Even worse than permitting multiple uses of the same label, the assembly language specifies the "nearest digit-labeled symbol" on a source-code-textual basis, not on a memory address basis. In a code sequence like:
.data 1: .asciiz "some stuff" .text [...input without local labels...] set 1b,%o1The reference to 1b in the .text segment correctly corresponds to the label 1: in the .data segment. The label referred to in the set instruction corresponds to the immediately previous "1" in the input, not in the assembled object code. This feels like a poor misfeature.
To pass my self-imposed acceptance tests, I needed this assembler to correctly handle "local labels". I ended up generating temporary labels when a local label appeared in the input, using a counter to keep the "nearest" label accessible.
I used the following software and systems of software to write this assembler. I extend thanks to everyone who wrote the following things.NetBSD/sparc 1.1
Since I don't suffer from illusions about getting money for my assembler, I believe my assembler falls under the non-commercial-use clauses of any encumbered software above.
A Retargetable C Compiler: Design and Implementation describes the lcc 3.5 ANSI C compiler. Addison-Wesley, 1995, ISBN 0-8053-1670-1.
Get the really great multi-platform NetBSD operating system from NetBSD home page
My first versions performed a single pass over the input assembly language. I re-wrote the assembler to do two passes. Using and maintaining several symbol and relocation information tables during assembly allows a reconciliation of forward references. After the reconciliation, the assembler rewrites instructions that make forward references.
The intricacies of the a.out object file format make a single-pass assembler difficult in practice. The worst difficulties I encountered involved source level debugging information. Special symbol table entries contain source level debugging information for high-level languages. This means potentially make the same debugging-information-symbols forward references that "branch" instructions make. I believe this can lead to a problem of symbols that refer to other symbols, potentially circularly. I didn't want to deal with detecting and resolving problems like that.
Relocation information that refers to unresolved symbols presents another problem for a single-pass assembler. An index into an array (r_index field of struct relocation_info) associates relocation information and symbol information. This indexing forces an assembler into (re-)formatting the symbol table as an array before completing the relocation information.
The byacc grammar did lead to some difficulty in writing the code that generates the instruction bit patterns. Given the SPARC instruction formats, I'd hoped to write only a few routines that formatted instructions. The parser calls the appropriate routine after it builds up the parse tree for an assembly language statement.
I found it impossible to write 5 or 6 routines (one for each instruction format) and reuse those few functions to assemble many different instructions. The suggested assembly language describes arrangements like this:
Both ld and st mnemonics assemble to identical machine code instruction formats. The parser can distinguish between these based on location of square brackets. It rearranges the tokens to a conventional order (source register 1, source register 2, destination register). The parser can't make that distinction for some "synthetic" instructions.
tst %o5 ! assembles to: orcc %g0, %o5, %g0 dec %o5 ! assembles to: sub %o5, 0x1, %o5 clr %o5 ! assembles to: or %g0, %g0, %o5
The suggested syntax doesn't contain enough information for the parser to decide to use the operand register for one or both of the source registers, or for the destination register.
Sometimes, one human-readable mnemonic assembles to several distinct instructions. The selection of instruction depends on the register named in the assembly language statement. For instance, for the following instructions, the assembler selects the actual bit-pattern for the instruction by examining the operand registers.
rd %y,%r15 ! RDY instruction rd %psr,%r15 ! RDPSR instruction st %f1,[%r1+%r2] ! STF instruction st %r3,[%r1+%r2] ! ST instruction
I ended up writing 5 or 6 routines that handle the bulk of the instruction mnemonics, and about 10 more routines that handle the irregular synthetic instructions. I could accomplish some or all of the work done by those 10 extra routines by re-writing the lexical analysis part of the assembler to recognize the special cases. The lexical analysis code code give the parser code extra or different tokens in those special cases. I could enlarge the parser code to recognize more types of tokens and write more productions to handle those special cases.
More instruction formats exist in the SPARC ISA than the The SPARC Architecture Manual explicitly specifies. For example, the various trap-on-condition-code instructions, nominally assembled in format 3, use the destination register field ( rd ) of the instruction as the distinguishing code for the trap condition (e.g. not equal, positive, not zero, etc, relating to the integer condition codes bits of the Processor Status Register).
Both the SunOS assembler and the NetBSD version of the GNU assembler accept undocumented opcode mnemonics. For example, the SunOS assembler accepts ld2 and st2 mnemonics. I can't find documentation for these anywhere.
I only found weak documentation on the ancient BSD a.out object file format. I found the relocation information particularly tricky to get right.
The disk file representations of the symbol table (an array of struct nlist) relocation information (two arrays of struct relocation_info_sparc) seem designed to minimize the amount of disk space consumed, yet the structs contain some odd redundancies. The r_extern field of struct relocation_info_sparc duplicates the information in the N_EXT bit of the n_type field of struct nlist.
While developing this assembler, I collected a series of files of assembly language. I used this series of files to ensure that further development didn't cause previously correct assembly to fail. I compared the disassembled outputs of the GNU assembler and my assembler for the same input.
Some of these files exercise particular blocks of code in the assembler, some contain exhaustive enumerations of categories of instructions (e.g. Bicc - branch on integer condition code) and some contain compiler outputs that triggered bugs.
I "expect to pass" all the regression tests. If I wrote better tests, I would collect or write a mix of cases "expected to succeed" and cases "expected to fail".
Regression tests don't check that symbol table and relocation information get generated correctly. I wrote a series of C programs that cause compilers to generate specific assembly language or specific types of relocation information and symbol table entries. Assembling the compiler-generated code for these programs verifies the format of the link-editing information and checks that information contains correct values. Running the compiled programs exercises the assembled machine code.
I also wrote assembly language functions that link with C language functions. Compiling and linking the mixed assembly and C programs also verifies the link-editing information and contents. Running those mixed language programs verifies that subroutine-linkage works correctly.
I used several malloc debuggers in a continual effort to detect memory leaks. I used dmalloc version 3.1.3 by Gray Watson, and Electric Fence 2.0.1 by Bruce Perens. This experience nearly convinced me to use garbage collection instead of explicit allocation and de-allocation.
I tried to reduce compiler dependencies in the code of the assembler itself. During development I alternated between compiling the assembler with the GNU C compiler provided with NetBSD/sparc 1.1 (GCC 2.4.5) and compiling with Hansen and Fraser's lcc ANSI C compiler. I ran all my regression and integration tests against assembler executables compiled with both compilers.
I used compiler warnings to determine ANSI C compliance. GNU C and lcc report different problems. For example, lcc reports on "compiler dependent" behaviors. The GNU C compiler reports when all members of an enum aren't labels of cases in a switch. I tried to eliminate all compiler warnings.
I used the lcc compiler to measure branch coverage. Compiling with the proper options causes lcc to produce instrumented code that counts executions of basic blocks of code. Assembling a large amount of compiler-generated code that triggers no assembly language syntactic or semantic errors causes my assembler to execute about 50% of its basic blocks of code. Assembling code specifically written to contain semantic errors raises this to about 75% of its basic blocks. Error outputs for failure of C library calls like malloc(3) constitute the remainder of the unexecuted blocks of code.
If I wrote a special version of malloc() to randomly fail, I could use this hypothetical tool to increase branch coverage. It would also test the robustness of the program in the face of memory allocation failures.
I think that writing ports of this software (in the sense of getting it to run on another hardware platform or under another operating system) wouldn't require much effort, given a set of NetBSD/sparc header files. I tried to write code in ANSI-C, using compilers and compiler options to enforce this. The program uses only stock "stdio" for input and output, although it does do fseek()s to write instructions and data in the same file.
I believe I should address two specific portability issues.
Further development might pay off in two areas: working to speed it up, and adding support for SPARC V9 instructions.
I never profiled execution of this assembler, nor made any quantitative determination of speed. I estimate that NetBSD's standard assembler assembles any particular input file in half the time my assembler takes. Two speed-ups come to mind. The assembler does a binary search of an array to find opcode mnemonics and assembler directives. A pre-hashed table, or an Aho-Corasick type finite automata might cut search time. The assembler lexes and parses the input during both passes. I could eliminate some amount of execution time by re-writing the first pass to save the parse trees, and re-writing the second pass to work from these trees instead of re-parsing the input.
This new approach requires an indeterminate amount of memory, roughly proportional to the input assembly language file size. My current approach uses a minimum of dynamically-allocated memory, very nearly a deterministic, finite amount.
I did not add support for SPARC V9 opcode mnemonics since my computer (a SPARCstation IPC) executes SPARC V7 instructions: I felt I stretched the truth by writing a SPARC V8 assembler.
Adding SPARC V9 instruction mnemonics wouldn't require much work. The suggested assembly language syntax looks identical. I feel certain that a few kinks exist, but I can't tell without writing the code.
SPARC V9 support would require the lexer code to recognize the new register names, and the parser productions to handle the additional branch opcode modifiers. The V9 specification includes a few more instruction formats, which mandates more low-level instruction output routines.
$Id: sparc_asm.html,v 1.3 1999/03/28 06:14:44 bediger Exp bediger $