Build Your Own Programming Language: the Blog

BYOPL Blog Table of Contents

Introduction

by Clinton Jeffery

This "no-frills" blog will contain thoughts on programming language design and implementation. It may be viewed as a supplement to the book "Build Your Own Programming Language", and content herein may occasionally refer to and/or serve as fodder for a future edition of that book.

Usually the topics chosen will be motivated by whatever I am working on at the time. For that reason, articles will often reference the implementation of the Unicon language, and some material here may supplement available implementation references for that language, such as "The Implementation of Icon and Unicon".

On the Implementation of Switch/Case Statements, Part 1

by Clinton Jeffery, last edit 4/23/222

In C-based languages, a switch statement corresponds to a case expression in Unicon. While this article will talk about case expressions, most of the techniques would be applicable to switch statements and similar constructs in other languages. The value used to select which case is executed will be called the selector value, while the values that are selected from will be called the case values.

The current implementation of Icon and Unicon executes case expressions about the same as it would a corresponding sequence of if-then-else statements to check each case, comparing the selector value against each case value, one after another. This gets slower (O(n)) as the number of cases increases, which becomes not OK as n gets large. Study of this subject starts with ways to implement case expressions with simple, contiguous sequences of integer case values, and then considers gradually more difficult case expressions. The article will only consider optimizing a subset of case expressions in which all the case values are constants known at compile time, which is the most common occurrence and for which the problem is obviously tractable.

Space is cheap, and if you want a fast-ish implementation for case expressions with a large number (> 10, probably >> 10) of cases, one option may be to throw a bit of space at the problem. In the Unicon language as of April 2022, this is a real need. Unicon programs such as the Unicon translator itself, and the collaborative virtual environment CVE, contain code with a large number of case values to select between. For example, there are approximately 150 different cases in the CVE server's protocol command processor.

The base (Icon) implementation

Up through Version 13, Unicon inherits an implementation of case expressions from Icon, which are described a little bit in the Implementation Compendium. Arbitrary expressions including generators may be used as case values. The implementation works for arbitrary types of values. Cases are evaluated one at a time, in order, in the usual way. Each in turn is compared against the selection value with semantics of the === operator. The notation is concise and general but its performance is similar to a long chain of if-then-else-if statements, and in short programs that was just fine. But we can do better for the common special cases. Most case expressions in real Unicon programs use either all integer labels, or all string labels. In Icon and Unicon's graphics facilities, the Event() function returns integers for mouse or special events, while it returns one-letter strings for regular keyboard events, so this language has a third common pattern, in which case expressions have integer and string constants as labels.

Categorizing Case Expressions According to Their Case Value Properties

To begin, we consider only case expressions with labels consisting only of integer literals. An example would be
case x of {
1: write("one")
2: write("two")
3: write("three")
}
We might want to also include case labels that contain several integer literals in alternation expressions such as (1|2|3) or (3 to 11 by 2) or (!10).
case x of {
1: write("one")
2|3|5|7|11: write("prime")
4|6|8|10: write("even")
9: write("multiple of three")
default: write("too large to handle")
}
Basic properties of integer case expressions include:
size
the number of case values used in the selection
min and max
the minimum and maximum case values
range
max-min
density
size/range
Calculate these properties and use them to select between the default implementation, and one or more optimized implementations.

Related Work: Java

Java has VM instructions that might be easily serve as an inspiration for folks building their own VMs. There is a tableswitch instruction for simple/contiguous cases, and a lookupswitch instruction for other more general cases. My impression is that Java's switch VM instructions have a good design. There are probably other/better references and explanations available for Java's approach to this problem. At least for starters, this article considers a more basic implementation and leave closer mimicry of Java's VM instructions for switches as an exercise for the reader.

A Contiguous Jump Table Implementation

Unless the range is excessive and the density low, you can build a jump table to select which case branch is taken. There are many possible representations of a jump table. One of the simplest might be:
	evaluate selector value to top of stack
	subtract min from selector value
	gocase	; pop stack, advance program counter N instructions
	goto X
	goto Y
	goto Z
This representation depends on an instruction (gocase) that increments the program counter by some value it pops off the top of the stack. If min is not zero, subtract or add as necessary so that the first goto instruction refers to the minimum case value. The instructions that follow waste some space on redundant opcodes; if you implement an instruction that indexes directly into an array of labels, such as Java's tableswitch instruction, you can save some space on the goto opcodes.

If not all values in the range have code to execute, the jump table will contain jumps to either a default case, or to the instruction after the case expression. If density is low, most of the jump table might consist of redundant jumps to this default target.

In a toy virtual machine, adding a new virtual machine instruction is pretty easy. Adding a VM instruction to Unicon almost never happens and is so rare that it is not really documented. So, the next article will walk through that process.

Adding a VM Instruction to the Unicon Virtual Machine

by Jonathan Carsten and Clinton Jeffery, last edit 4/23/22

We were surprised to discover that the task of adding a new instruction to the Icon/Unicon virtual machine was under-documented in Icon or Unicon implementation documentation. It was likely spelled out in Ralph Griswold's Icon Implementation graduate course at one point, for which lecture notes might still exist somewhere. While we are looking for that, if you need to perform this admittedly rare task, here are some notes.

If you add a new instruction, old virtual machines will not know how to run the new opcode. The first thing to do may well be to change the version number of the ucode and icode files. Changing version numbers is a bit of a gnarly thing to do, and you may want to keep a duplicate copy of your source tree around to mitigate bootstrapping issues when you do it. The file that holds the version number is src/h/version.h. There are a number of macros that need to be updated.

VersionNumber
VersionDate
DVersion
UVersion
IVersion
VersionNumber and VersionDate are human readable and straightforward. DVersion denotes the rtt runtime system "database version". UVersion refers to the version for the human-readable text ucode files, which serve as Unicon's object and VM bytecode assembler format. It will definitely need to change if you add a new instruction in ucode. IVersion specifies the icode version, the platform-dependent binary VM bytecode format that is executed by iconx.

While a new instruction in theory requires only new UVersion and IVersion numbers, it is typical to update all the version numbers together. Version numbers get used in various files that may have to be regenerated after this change; for example there is a file src/runtime/rt.db that holds a database of type information for the runtime system. Be sure the version number at the top of that file gets updated. You might get away with modifying this file manually, but re-building rt.db from scratch will probably do it automatically.

Next you'll want to define a new opcode for your instruction. The file src/h/opdefs.h holds the macro definitions for each VM instruction. You'll want to pick a number that is unused by another instruction and define a new macro for your instruction. The name doesn't really matter but you should follow the naming convention used by the other instructions (Op_<your instruction's name>). Next navigate to the Unicon translator directory at src/icont and find the opcode.c file. This file holds the opcode table. This table defines the string that corresponds with each opcode. The optable is sorted alphabetically, so find out where your instruction should belong in the table and add a new entry for your instruction.

Now that the opcode is defined, we move on to some more complicated stuff. The file src/icont/tcode.c is the translator file that traverses the syntax tree and writes out VM code. The file contains some functions to assist with more complicated syntax structures and the traverse() function. The traverse() function is a recursive function that traverses the syntax tree and executes a switch for every node it encounters. Each case of the switch has some code to write VM instructions using the emit family of functions. Remember the string representation of your opcode? If your new instruction is to be produced as part of the code generation for some piece of Unicon syntax, you'll need to emit that new opcode string somewhere in this switch as part of the code generated for one of the syntax tree nodes. Where you place your emit depends on the context of your VM instruction. You may also need to use some of the helper functions to emit the code you want. (Note : use grep to find the source code for unfamiliar functions. Grep is your friend!)

Now we need to make some changes to the linker so that the ASCII readable icode gets translated to binary. The file you want is src/icont/lcode.c. This file also consists of a bunch of helper functions and a main function called gencode(). The gencode() function reads the ASCII ucode and generates icode. It has a loop with another big switch statement that switches on each opcode. Luckily, if your instruction falls into a common category, like a new operator, or a simple instruction, you probably won't have to write any complex code. The cases are nicely organized into groups. If your instruction falls into one of those groups, you can probably just add another case at the bottom of the group for your instruction and call it a day. If your instruction is more complicated, you may have to look into how the helper functions work to implement your icode translation.

After the changes to the linker are made, you can finally move on to the main interpreter. The main interpreter loop code is located in src/runtime/interp.r. In interp.r, there is an infinite loop (for(;;){...}), containing another big switch statement. This one also has a case for every opcode. This is where you'll want to add the main code to implement your instruction. Just add a new case for your opcode and write your code underneath.