Build Your Own Programming Language: the Blog #1

On the Implementation of Switch/Case Statements

by Clinton Jeffery, last edit 4/23/2022

In C-based languages, a switch statement corresponds to a case expression in Unicon. While this article will talk about case expressions, most of the techniques would be applicable to switch statements and similar constructs in other languages. The value used to select which case is executed will be called the selector value, while the values that are selected from will be called the case values.

The current implementation of Icon and Unicon executes case expressions about the same as it would a corresponding sequence of if-then-else statements to check each case, comparing the selector value against each case value, one after another. This gets slower (O(n)) as the number of cases increases, which becomes not OK as n gets large. Study of this subject starts with ways to implement case expressions with simple, contiguous sequences of integer case values, and then considers gradually more difficult case expressions. The article will only consider optimizing a subset of case expressions in which all the case values are constants known at compile time, which is the most common occurrence and for which the problem is obviously tractable.

Space is cheap, and if you want a fast-ish implementation for case expressions with a large number (> 10, probably >> 10) of cases, one option may be to throw a bit of space at the problem. In the Unicon language as of April 2022, this is a real need. Unicon programs such as the Unicon translator itself, and the collaborative virtual environment CVE, contain code with a large number of case values to select between. For example, there are approximately 150 different cases in the CVE server's protocol command processor.

The base (Icon) implementation

Up through Version 13, Unicon inherits an implementation of case expressions from Icon, which are described a little bit in the Implementation Compendium. Arbitrary expressions including generators may be used as case values. The implementation works for arbitrary types of values. Cases are evaluated one at a time, in order, in the usual way. Each in turn is compared against the selection value with semantics of the === operator. The notation is concise and general but its performance is similar to a long chain of if-then-else-if statements, and in short programs that was just fine. But we can do better for the common special cases. Most case expressions in real Unicon programs use either all integer labels, or all string labels. In Icon and Unicon's graphics facilities, the Event() function returns integers for mouse or special events, while it returns one-letter strings for regular keyboard events, so this language has a third common pattern, in which case expressions have integer and string constants as labels.

Categorizing Case Expressions According to Their Case Value Properties

To begin, we consider only case expressions with labels consisting only of integer literals. An example would be
case x of {
1: write("one")
2: write("two")
3: write("three")
}
We might want to also include case labels that contain several integer literals in alternation expressions such as (1|2|3) or (3 to 11 by 2) or (!10).
case x of {
1: write("one")
2|3|5|7|11: write("prime")
4|6|8|10: write("even")
9: write("multiple of three")
default: write("too large to handle")
}
Basic properties of integer case expressions include:
size
the number of case values used in the selection
min and max
the minimum and maximum case values
range
max-min
density
size/range
Calculate these properties and use them to select between the default implementation, and one or more optimized implementations.

Related Work: Java

Java has VM instructions that might be easily serve as an inspiration for folks building their own VMs. There is a tableswitch instruction for simple/contiguous cases, and a lookupswitch instruction for other more general cases. My impression is that Java's switch VM instructions have a good design. There are probably other/better references and explanations available for Java's approach to this problem. At least for starters, this article considers a more basic implementation and leave closer mimicry of Java's VM instructions for switches as an exercise for the reader.

A Contiguous Jump Table Implementation

Unless the range is excessive and the density low, you can build a jump table to select which case branch is taken. There are many possible representations of a jump table. One of the simplest might be:
	evaluate selector value to top of stack
	subtract min from selector value
	gocase	; pop stack, advance program counter N instructions
	goto X
	goto Y
	goto Z
This representation depends on an instruction (gocase) that increments the program counter by some value it pops off the top of the stack. If min is not zero, subtract or add as necessary so that the first goto instruction refers to the minimum case value. The instructions that follow waste some space on redundant opcodes; if you implement an instruction that indexes directly into an array of labels, such as Java's tableswitch instruction, you can save some space on the goto opcodes.

If not all values in the range have code to execute, the jump table will contain jumps to either a default case, or to the instruction after the case expression. If density is low, most of the jump table might consist of redundant jumps to this default target.

In a toy virtual machine, adding a new virtual machine instruction is pretty easy. Adding a VM instruction to Unicon almost never happens and is so rare that it is not really documented. So, the next article will walk through that process.