Build Your Own Programming Language: the Blog #2

Adding a VM Instruction to the Unicon Virtual Machine

by Jonathan Carsten and Clinton Jeffery, last edit 4/23/2022

We were surprised to discover that the task of adding a new instruction to the Icon/Unicon virtual machine was under-documented in Icon or Unicon implementation documentation. It was likely spelled out in Ralph Griswold's Icon Implementation graduate course at one point, for which lecture notes might still exist somewhere. While we are looking for that, if you need to perform this admittedly rare task, here are some notes.

If you add a new instruction, old virtual machines will not know how to run the new opcode. The first thing to do may well be to change the version number of the ucode and icode files. Changing version numbers is a bit of a gnarly thing to do, and you may want to keep a duplicate copy of your source tree around to mitigate bootstrapping issues when you do it. The file that holds the version number is src/h/version.h. There are a number of macros that need to be updated.

VersionNumber
VersionDate
DVersion
UVersion
IVersion
VersionNumber and VersionDate are human readable and straightforward. DVersion denotes the rtt runtime system "database version". UVersion refers to the version for the human-readable text ucode files, which serve as Unicon's object and VM bytecode assembler format. It will definitely need to change if you add a new instruction in ucode. IVersion specifies the icode version, the platform-dependent binary VM bytecode format that is executed by iconx.

While a new instruction in theory requires only new UVersion and IVersion numbers, it is typical to update all the version numbers together. Version numbers get used in various files that may have to be regenerated after this change; for example there is a file src/runtime/rt.db that holds a database of type information for the runtime system. Be sure the version number at the top of that file gets updated. You might get away with modifying this file manually, but re-building rt.db from scratch will probably do it automatically.

Next you'll want to define a new opcode for your instruction. The file src/h/opdefs.h holds the macro definitions for each VM instruction. You'll want to pick a number that is unused by another instruction and define a new macro for your instruction. The name doesn't really matter but you should follow the naming convention used by the other instructions (Op_<your instruction's name>). Next navigate to the Unicon translator directory at src/icont and find the opcode.c file. This file holds the opcode table. This table defines the string that corresponds with each opcode. The optable is sorted alphabetically, so find out where your instruction should belong in the table and add a new entry for your instruction.

Now that the opcode is defined, we move on to some more complicated stuff. The file src/icont/tcode.c is the translator file that traverses the syntax tree and writes out VM code. The file contains some functions to assist with more complicated syntax structures and the traverse() function. The traverse() function is a recursive function that traverses the syntax tree and executes a switch for every node it encounters. Each case of the switch has some code to write VM instructions using the emit family of functions. Remember the string representation of your opcode? If your new instruction is to be produced as part of the code generation for some piece of Unicon syntax, you'll need to emit that new opcode string somewhere in this switch as part of the code generated for one of the syntax tree nodes. Where you place your emit depends on the context of your VM instruction. You may also need to use some of the helper functions to emit the code you want. (Note : use grep to find the source code for unfamiliar functions. Grep is your friend!)

Now we need to make some changes to the linker so that the ASCII readable icode gets translated to binary. The file you want is src/icont/lcode.c. This file also consists of a bunch of helper functions and a main function called gencode(). The gencode() function reads the ASCII ucode and generates icode. It has a loop with another big switch statement that switches on each opcode. Luckily, if your instruction falls into a common category, like a new operator, or a simple instruction, you probably won't have to write any complex code. The cases are nicely organized into groups. If your instruction falls into one of those groups, you can probably just add another case at the bottom of the group for your instruction and call it a day. If your instruction is more complicated, you may have to look into how the helper functions work to implement your icode translation.

After the changes to the linker are made, you can finally move on to the main interpreter. The main interpreter loop code is located in src/runtime/interp.r. In interp.r, there is an infinite loop (for(;;){...}), containing another big switch statement. This one also has a case for every opcode. This is where you'll want to add the main code to implement your instruction. Just add a new case for your opcode and write your code underneath.