This "no-frills" blog will contain thoughts on programming language design and implementation. It may be viewed as a supplement to the book "Build Your Own Programming Language", and content herein may occasionally refer to and/or serve as fodder for a future edition of that book.
Usually the topics chosen will be motivated by whatever I am working on at
the time. For that reason, articles will often reference the implementation
of the Unicon language, and some material here may supplement available
implementation references for that language, such as "The Implementation of
Icon and Unicon".
On the Implementation of Switch/Case Statements, Part 1
by Clinton Jeffery, last edit 4/23/222
In C-based languages, a switch
statement corresponds to a
case
expression in Unicon. While this article will talk about
case
expressions, most of the techniques would be applicable to
switch
statements and similar constructs in other languages.
The value used to select which case is executed will be called the
selector value, while the values that are selected from will be
called the case values.
The current implementation of Icon and Unicon executes case
expressions about the same as it would a corresponding sequence of
if-then-else
statements to check each case, comparing the
selector value against each case value, one after another. This gets slower
(O(n)) as the number of cases increases, which becomes not OK as n gets
large. Study of this subject starts with ways to implement case
expressions with simple, contiguous sequences of integer case values, and
then considers gradually more difficult case
expressions. The
article will only consider optimizing a subset of case
expressions in which all the case values are constants known at
compile time, which is the most common occurrence and for which the problem
is obviously tractable.
Space is cheap, and if you want a fast-ish implementation for case expressions with a large number (> 10, probably >> 10) of cases, one option may be to throw a bit of space at the problem. In the Unicon language as of April 2022, this is a real need. Unicon programs such as the Unicon translator itself, and the collaborative virtual environment CVE, contain code with a large number of case values to select between. For example, there are approximately 150 different cases in the CVE server's protocol command processor.
case
expressions from Icon, which are described a little bit in
the Implementation Compendium. Arbitrary expressions including generators
may be used as case values. The implementation works for arbitrary types of
values. Cases are evaluated one at a time, in order, in the usual way. Each
in turn is compared against the selection value with semantics of the ===
operator. The notation is concise and general but its performance is similar
to a long chain of if-then-else-if statements, and in short programs that
was just fine. But we can do better for the common special cases. Most case
expressions in real Unicon programs use either all integer labels, or all
string labels. In Icon and Unicon's graphics facilities, the Event() function
returns integers for mouse or special events, while it returns one-letter
strings for
regular keyboard events, so this language has a third common pattern,
in which case
expressions have integer and string constants
as labels.
case x of { 1: write("one") 2: write("two") 3: write("three") }We might want to also include case labels that contain several integer literals in alternation expressions such as (1|2|3) or (3 to 11 by 2) or (!10).
case x of { 1: write("one") 2|3|5|7|11: write("prime") 4|6|8|10: write("even") 9: write("multiple of three") default: write("too large to handle") }Basic properties of integer case expressions include:
tableswitch
instruction for simple/contiguous cases, and a lookupswitch
instruction for other more general cases. My impression is
that Java's switch VM instructions have a good design.
evaluate selector value to top of stack subtract min from selector value gocase ; pop stack, advance program counter N instructions goto X goto Y goto ZThis representation depends on an instruction (
gocase
) that increments the program counter by some value it
pops off the top of the stack. If min is not zero, subtract or add as
necessary so that the first goto instruction refers to the minimum case
value. The instructions that follow waste some space on redundant opcodes;
if you implement an instruction that indexes directly into
an array of labels, such as Java's tableswitch
instruction,
you can save some space on the goto opcodes.
If not all values in the range have code to execute, the jump table will contain jumps to either a default case, or to the instruction after the case expression. If density is low, most of the jump table might consist of redundant jumps to this default target.
In a toy virtual machine, adding a new virtual machine instruction is pretty
easy. Adding a VM instruction to Unicon almost never happens and is so rare
that it is not really documented. So, the next article will walk through
that process.
Adding a VM Instruction to the Unicon Virtual Machine
by Jonathan Carsten and Clinton Jeffery, last edit 4/23/22
We were surprised to discover that the task of adding a new instruction to the Icon/Unicon virtual machine was under-documented in Icon or Unicon implementation documentation. It was likely spelled out in Ralph Griswold's Icon Implementation graduate course at one point, for which lecture notes might still exist somewhere. While we are looking for that, if you need to perform this admittedly rare task, here are some notes.
If you add a new instruction, old virtual machines will not know how to run
the new opcode. The first thing to do may well be to change the version
number of the ucode and icode files. Changing version numbers is a bit of a
gnarly thing to do, and you may want to keep a duplicate copy of your source
tree around to mitigate bootstrapping issues when you do it. The file that
holds the version number is src/h/version.h
. There are a number
of macros that need to be updated.
VersionNumber VersionDate DVersion UVersion IVersion
VersionNumber
and VersionDate
are human readable
and straightforward. DVersion
denotes the rtt runtime system
"database version". UVersion
refers to the version for the
human-readable text ucode files, which serve as Unicon's object and VM
bytecode assembler format. It will definitely need to change if you add a
new instruction in ucode. IVersion
specifies the icode version,
the platform-dependent binary VM bytecode format that is executed by iconx.
While a new instruction in theory requires only new UVersion and IVersion
numbers, it is typical to update all the version numbers together. Version
numbers get used in various files that may have to be regenerated after this
change; for example there is a file src/runtime/rt.db
that
holds a database of type information for the runtime system. Be sure the
version number at the top of that file gets updated. You might get away with
modifying this file manually, but re-building rt.db from scratch will
probably do it automatically.
Next you'll want to define a new opcode for your instruction. The file
src/h/opdefs.h
holds the macro definitions for each VM
instruction. You'll want to pick a number that is unused by another
instruction and define a new macro for your instruction. The name doesn't
really matter but you should follow the naming convention used by
the other instructions (Op_<your instruction's
name>
). Next navigate to the Unicon translator directory at
src/icont
and find the opcode.c
file. This file
holds the opcode table. This table defines the string that corresponds with
each opcode. The optable
is sorted alphabetically, so find out
where your instruction should belong in the table and add a new entry for
your instruction.
Now that the opcode is defined, we move on to some more complicated
stuff. The file src/icont/tcode.c
is the translator file that
traverses the syntax tree and writes out VM code. The file contains some
functions to assist with more complicated syntax structures and the
traverse()
function. The traverse()
function is a
recursive function that traverses the syntax tree and executes a switch for
every node it encounters. Each case of the switch has some code to write VM
instructions using the emit family of functions. Remember the string
representation of your opcode? If your new instruction is to be produced as
part of the code generation for some piece of Unicon syntax, you'll need to
emit that new opcode string somewhere in this switch as part of the code
generated for one of the syntax tree nodes. Where you place your emit depends on the context of your VM
instruction. You may also need to use some of the helper functions to emit
the code you want. (Note : use grep to find the source code for unfamiliar
functions. Grep is your friend!)
Now we need to make some changes to the linker so that the ASCII readable
icode gets translated to binary. The file you want is
src/icont/lcode.c
. This file also consists of a bunch of helper
functions and a main function called gencode()
. The
gencode()
function reads
the ASCII ucode and generates icode. It has a loop with another big switch
statement that switches on each opcode. Luckily, if your instruction falls
into a common category, like a new operator, or a simple instruction, you
probably won't have to write any complex code. The cases are nicely
organized into groups. If your instruction falls into one of those groups,
you can probably just add another case at the bottom of the group for your
instruction and call it a day. If your instruction is more complicated, you
may have to look into how the helper functions work to implement your icode
translation.
After the changes to the linker are made, you can finally move on to the
main interpreter. The main interpreter loop code is located in
src/runtime/interp.r
. In interp.r
, there is an
infinite loop (for(;;){...}
), containing another big switch
statement. This one also has a case for every opcode. This is where you'll
want to add the main code to implement your instruction. Just add a new case
for your opcode and write your code underneath.