This is my solution to the final assignment of the C-Course (20465) at The Open University.
It should be noted that I am not affiliated with OpenU, and I do not attend as a student there. I was given this project by a coworker who attended OpenU.
This project has not been formally graded.
This project was written on the Ubuntu distro, but it may run on all other distributions of Linux.
To build the solution, open the terminal in the solution folder and run:
> make
To run the solution, open the terminal in the build folder and run:
> ./assembler [input file path]
The input file path is relative to the current working directory.
For example, if you're running this project from: home/uname/Assembler
:
> ./assembler test
Will look for the file: home/uname/Assembler/test.as
.
Every line in Assembly code is, at maximum, 80 characters long.
Every line is of the form:
[OPTIONAL LABEL]: [INSTRUCTION|DIRECTIVE] [OPTIONAL SOURCE OPERAND],[OPTIONAL TARGET OPERAND] The number of operands varies between every instruction and directive.
Labels are the Assembly parallel to variables in C.
Labels are a string of up to 30 alphanumeric characters, with the first character being a letter.
The value stored in labels is the address to the first line in the translated machine code that the instruction they point at is written.
MAIN: mov -13, @r3 ; (ic = 0 -> MAIN = 0)
jmp END ; (ic = 3, 5 is substituted for END)
END: stop ; (ic = 5 -> END = 5)
A label may be local or external.
Labels marked as entries in one file may be used as external labels in other files.
An external label in one file takes the value of where it was defined as an entry. See 2.3 Directives to see how this is done.
Instructions control the flow of the program and instruct the machine what to do.
The following is a table of all instructions, their opcode (see 3.2 Instructions), and how many operands they take.
Instruction | Opcode | # Operands |
---|---|---|
MOV | 0 | 2 |
CMP | 1 | 2 |
ADD | 2 | 2 |
SUB | 3 | 2 |
NOT | 4 | 1 |
CLR | 5 | 1 |
LEA | 6 | 2 |
INC | 7 | 1 |
DEC | 8 | 1 |
JMP | 9 | 1 |
BNE | 10 | 1 |
RED | 11 | 1 |
PRN | 12 | 1 |
JSR | 13 | 1 |
RTS | 14 | 0 |
STOP | 15 | 0 |
For instructions that take 2 operands, the first operand is the source operand and the second is the target operand.
For instructions that take 1 operand, the first operand is always the target operand.
For more information about each instruction, please visit this page.
Directives are data and storage instruction providers. They are always initiated by a dot.
Directives take either one or an unlimited number of operands:
Directive | # Operands | Purpose | Usage |
---|---|---|---|
.string | 1 | Store a string* of characters in memory | .string "abcdef" |
.data | Unlimited | Store integers** in memory | .data 6, 7, -13, 25 |
.extern | Unlimited | Mark a label as an external (see 2.1 Labels) | .extern LENGTH, L1 |
.entry | Unlimited | Mark a label as an entry (see 2.1 Labels) | .entry MAIN, END |
* Recall that a string is terminated by the null-terminator: '\0'
.
** Integers are converted to binary using the Two's complement method.
Macros are the Assembly parallels to functions in C.
A macro is declared using the mcro
keyword, and terminated using the endmcro
keyword.
mcro <macro name>
...
...
endmcro
You can paste the block of code between mcro
and endmcro
around your program by writing the macro's name. See 4.1 Preassembler.
Program.as
File:
mcro m1
inc K
sub @r1,@r2
endmcro
...
m1
...
m1
END: stop
K: .data 6
Program.am
File:
...
inc K
sub @r1,@r2
...
inc K
sub @r1,@r2
END: stop
K: .data 6
A machine does not understand ASCII symbols, it understands 1's and 0's. More specifically, it understand 1's and 0's that are arranged in a predetermined order.
The point of the Assembler is to translate Assembly code into Machine code.
The following is the first 10 lines of the test.ob
file outputted by running the Assembler:
101000001100
000110000000
000010000110
000100101100
000001000110
111111101100
000101001100
101001110100
000000000001
000101001100
Just like you (probably) can't read this, the computer also couldn't before it was told how to read it.
Instructions and directives are translated into words.
Words come in many different formats, but the first word of an instruction is always the same.
The first word of an instruction is a 12-bit-long binary stream that is arranged in the following order:
SOURCE OPERAND METHOD | OPCODE | TARGET OPERAND METHOD | A, R, E |
---|---|---|---|
3 bits | 4 bits | 3 bits | 2 bits |
Other than telling the machine what to do (opcode), it also tells the machine how to read the next words, which describe the source and target operands.
There are 3 operand methods:
METHOD | BINARY REPRESENTATION | DESCRIPTION |
---|---|---|
Immediate | 001 | Integer Value |
Direct | 011 | Local/External Address |
Register Direct | 101 | Register Address |
If an operand is of an immediate method, it will be written as:
Integer in Binary | A, R, E |
---|---|
10 bits | 2 bits |
If an operand is of a direct method, it will be written as:
Address | A, R, E |
---|---|
10 bits | 2 bits |
If an operand is of a register direct method, it will be written as:
Address of Source Register | Address of Target Register | A, R, E |
---|---|---|
5 bits | 5 bits | 2 bits |
Every operand adds 1 word, unless both operands are of the register direct method.
In that case, both words are merged into one.
A, R, E is used for the linkage process - it describes where the content of the word is located.
METHOD | BINARY REPRESENTATION | DESCRIPTION | |
---|---|---|---|
A | Absolute | 00 | Immediate Values (Integer/Register Addresses) |
R | Relocatable | 10 | Local Variable Address |
E | External | 01 | External Variable Address |
Consider the following assembly code:
mov @r3, LENGTH
add @r1, @r2
LENGTH: .data 6
This code generates the following machine code:
IC | Assembly Code | Description | Machine Code |
---|---|---|---|
0 | mov @r3, LENGTH | First word | 101000001100 |
1 | Source @r3 | 000110000000 | |
2 | Target LENGTH | 000000010110 | |
3 | add @r1, @r2 | First Word | 101001010100 |
4 | Source @r1 AND Target @r2 | 000010001000 | |
5 | LENGTH: .data 6 | 6 in Binary | 000000000110 |
If the first word does not follow the first instruction word format, it should be read as a .data
or .string
directive.
Data and string directives take all 12 bits and write the binary representation of that character.
The number of words produced by a .data
or .string
directive is equal to the amount of integers/characters wanting to be stored.
Note: .extern
and .entry
directives do not translate into machine code. They are only to be used by the assembler.
Consider the following assembly code:
.data 6, 13, -7
.string "abc"
This code generates the following machine code:
DC | Assembly Code | Description | Machine Code |
---|---|---|---|
0 | .data 6, 13, -7 | 6 | 000000000110 |
1 | 13 | 000000001101 | |
2 | -7 | 111111111001 | |
3 | .string "abc" | 'a' = 97 | 000001100001 |
4 | 'b' = 98 | 000001100010 | |
5 | 'c' = 99 | 000001100011 | |
6 | '\0' = 0 | 000000000000 |
Now that we understand assembly and machine code, it's time to make the connection between the two.
There are a few problems we will need to face, which we will get into in every stage.
The preassembler's job is to set-up the input file.
The assembler itself doesn't read macros. It wants to receive pure assembly code so it can execute it line by line. The preassembler's purpose is to spread these macros where they are called.
The work done by the preassembler is written to a Program.am
file, that is passed to the First-Pass and Second-Pass functions.
To see an example of the output of the preassembler, please refer to 2.4 Macros.
As the name suggests, the first-pass routine is the first routine to properly read through the assembly file.
Here, we tackle a new problem: Labels do not have to be defined before they are used.
This assembly code, for example, is completely valid:
mov @r3, LENGTH
LENGTH: .data 6
Although LENGTH
was defined after the mov
instruction, the assembler will translate this into valid machine code.
But the assembler reads the assembly code line-by-line, how does it know what to write as the value of LENGTH
to the output Program.ob
file?
Well, here's the trick: it doesn't. The main purpose of the first-pass routine is to write down information about local and external labels. What is the label's address? Is it local or external? Should it be used as an entry for other files? Everything is written down and memorized in the first-pass routine.
Along the way, first-pass also starts writing machine code for values and addresses it already holds. Those are:
- First words
- Immediate and Register Direct operands
- Data and Strings
Additionally, the first-pass also allocates space for where Direct operand words should go.
Finally, the first-pass also writes the .ext
file, see 5.3 Program.ext
File.
Consider the following code:
.extern W
mov @r3, LENGTH
inc W
.entry LENGTH
add @r1, @r2
LENGTH: .data 6
After the first-pass, the following machine code is generated:
IC | Assembly Code | Description | Machine Code |
---|---|---|---|
0 | .extern W | No machine code is generated | |
0 | mov @r3, LENGTH | First word | 101000001100 |
1 | Source @r3 | 000110000000 | |
2 | Target LENGTH | ? | |
3 | inc W | First word | 000011101100 |
4 | Target W | ? | |
5 | .entry LENGTH | No machine code is generated | |
5 | add @r1, @r2 | First Word | 101001010100 |
6 | Source @r1 AND Target @r2 | 000010001000 | |
7 | LENGTH: .data 6 | 6 in Binary | 000000000110 |
After the first-pass, the following label table has been written down:
Label | Address | Is Instruction? | Is Data? | Is External? | Is Entry? |
---|---|---|---|---|---|
W | -1 | 0 | 0 | 1 | ? |
LENGTH | 7 | 0 | 1 | 0 | ? |
Now that we're aware of all of the labels and the addresses they correspond to, the second-pass routine's objective is to fill in for the missing holes in the machine code.
Additionally, the labels can now be marked as entry/non-entry to prepare the .ent
file, see 5.4 Program.ent
File.
Consider the code from before.
.extern W
mov @r3, LENGTH
inc W
.entry LENGTH
add @r1, @r2
LENGTH: .data 6
After the second-pass, the following machine code is generated:
IC | Assembly Code | Description | Machine Code |
---|---|---|---|
0 | .extern W | No machine code is generated | |
0 | mov @r3, LENGTH | First word | 101000001100 |
1 | Source @r3 | 000110000000 | |
2 | Target LENGTH | 000000011110 | |
3 | inc W | First word | 000011101100 |
4 | Target W | 000000000001 | |
5 | .entry LENGTH | No machine code is generated | |
5 | add @r1, @r2 | First Word | 101001010100 |
6 | Source @r1 AND Target @r2 | 000010001000 | |
7 | LENGTH: .data 6 | 6 in Binary | 000000000110 |
After the second-pass, the following label table has been written down:
Label | Address | Is Instruction? | Is Data? | Is External? | Is Entry? |
---|---|---|---|---|---|
W | -1 | 0 | 0 | 1 | 0 |
LENGTH | 7 | 0 | 1 | 0 | 1 |
The Program.am
file is the result of the work done by the preassembler routine.
To see an example of what the program.am
looks like, please refer to 2.4 Macros.
The Program.ob
file is the translated machine code.
This, for example, is the first 10 lines of code generated by running the example test file included in this repository:
101000001100
000110000000
000010000110
000100101100
000001000110
111111101100
000101001100
101001110100
000000000001
000101001100
To see the rest, feel free to run the program yourself.
The Program.ext
file provides information about external variables used by the Program.as
file.
Such information is limited to what these variables' names are and where they are used in the machine code.
This information is crucial for the linking stage, which is not part of this project.
This, for example, is the Program.ext
file generated by running the example test file included in this repository:
W 8
L3 12
W 24
The Program.ent
file provides information about entry variables defined in the Program.as
file.
Such information is limited to what these variables' names are and what their address is.
This information is crucial for the linking stage, which is not part of this project.
This, for example, is the Program.ext
file generated by running the example test file included in this repository:
LENGTH 33
LOOP 3