More about tiny scripting engines

27 November 2019

Quick! You have to add a command language to your Python app, but the cmd module isn't enough because you also need to let the user make the occasional calculation. And letting them eval arbitrary Python code is out of the question, for security reasons. Besides, you might want to port your app later, but keep the command language.

You could write an expression parser. It's easy enough. Also laborious enough to basically take over the project. No, what you need is Reverse Polish Notation, RPN for friends (not that it has many friends).

Never had the pleasure? It involves rearranging your code some. Instead of this:


	print(3+2, 3*2, 3**2-2)

you're going to have this:


	3 2 + . space 3 2 * . space 3 2 ** 2 - . cr

Yeah, yeah, it's kinda backwards. That's why they call it "reverse", or postfix notation. On the plus side, you can now handle any code with only three rules:

Split the input string into words separated by whitespace and deal with them one by one in order.
If it looks like a number, push it onto a stack. I'll assume you know data structures.
Otherwise look it up in a dictionary. If found, call the associated function.

The trick? Functions can just take any arguments they need off the stack, and put any results right back there. In fact it's not too different from what the Python interpreter does internally. And it only takes a little code:


	stack = []
	words = {}

	while True:
		for w in input("? ").split():
			if w in words:
				words[w]()
			else:
				stack.append(float(w))
		print("ok")

We're talking literally 10 lines of code, how cool is that? Only we haven't yet defined any words. Let's fix the omission:


	words["+"] = lambda: stack.append(stack.pop() + stack.pop())
	words["*"] = lambda: stack.append(stack.pop() * stack.pop())
	words["."] = lambda: print(stack.pop(), end="")
	words["space"] = lambda: print(" ", end="")
	words["cr"] = lambda: print()

Five more lines, and our mini-language can run most of the original example. The last two operations are more tricky, however. You see, addition can take its arguments in any order. Subtraction, not so much. And on the stack, they are reversed: last in, first out.


	def subtract():
		b = stack.pop()
		a = stack.pop()
		stack.append(a - b)
	def power():
		b = stack.pop()
		a = stack.pop()
		stack.append(a ** b)
	words["-"] = subtract
	words["**"] = power

Now those took as much code as the interpreter loop did in the first place. But you know what? We're still at the 25-line mark overall, and 60% of it directly performs the work we need. Call it a glorified calculator if you like; but you can make it do anything else as well by defining more words in the dictionary. For instance, a 3D modeler could understand vocabulary like:


	1 0 0 rgb -5 -2 -3 mark 5 2 3 box

We already have our command language, and we barely had to do anything. Even better, compared to a more traditional interpreter, if things look confusing we can always enter words one by one and see what happens. The stack isn't going anywhere; all we need is a word to examine its contents:


	words[".s"] = lambda: print(stack)
	words["depth"] = lambda: stack.append(len(stack))
	words["words"] = lambda: print(words.keys())

and a couple of other things while we're at it. Power to the user!

So far so good. It would be a shame to stop here though. And one thing we need jumps out: variables. Sounds simple, but we'll hit a snag right away: you can't just type a variable name on the command line, because our interpreter would try to call its attached function. We could make some magic involving a class with a __call__ method:


	class Variable:
		def __init__(self, value = None):
			self.value = value
		def __call__(self):
			stack.append(self)

whose function is to place itself on the stack. From there, a pair of ordinary words, say @ and !, could get and set its value, respectively. But that leaves the problem of declaring it in the first place. We could have a variable declaration that grabs the next word from the input before our interpreter sees it, but we'd have to rewrite the main loop, and then it would open a can of worms.

So let's try something else. Change the main loop such that if a word isn't in the dictionary, it looks up just the first character in another one:


	while True:
		try:
			for w in input("? ").split():
				if w in words:
					words[w]()
				elif w[0] in sigil:
					sigil[w[0]](w[1:])
				else:
					stack.append(float(w))
			print("ok")
		except ValueError as e:
			print("Unknown word:", w, file=sys.stderr)
		except IndexError as e:
			print("Stack underflow", file=sys.stderr)
		except EOFError as e:
			break

Make sure to import sys at the top, and declare "sigil = {}" somewhere. While we're at it, add a bye word that exits cleanly, though now you can also do it with Control-D without the interpreter crashing. Which by the way should happen a lot less. The code to look up sigils is cryptic because I didn't want to use much space, but maybe actual definitions will help with that:


	sigil[">"] = lambda n: words.__setitem__(n, stack.pop())
	sigil["$"] = lambda n: stack.append(words[n])

It's now possible to enter code like:


	? 3 2 + >a
	ok
	? $a . cr
	5.0
	ok

At which point you can see why I picked those symbols. The dollar sign is used for the same purpose in many languages, while the right angle bracket pretty much says "in here". Feel free to go with longer, human-readable alternatives like get: and set: instead; you'll be the one typing them over and over.

For those keeping the score at home, our interpreter just doubled in size from the first version, meaning the engine proper is still only 20 lines of code! How long can we keep going?

Well, I have good news and bad news. The bad news is, we still have the most advanced features ahead of us: comments, strings, control structures and user-defined words. The good news is we can implement all of them through a single unified mechanism.

You see, so far we've handled each word by itself, with only the stack to provide context. It's just that sooner or later we have to start grouping them together, not acting on them until we have a complete group. And that requires yet another change to the interpreter. Add the following at the top:


	buffer = []
	buf_to = None

and alter the code to handle words like this:


	for w in input("? ").split():
		if buf_to != None:
			if w != buf_to:
				buffer.append(w)
			else:
				words[w]()
				buf_to = None
		elif w in words:
			words[w]()
		elif w[0] in sigil:
			sigil[w[0]](w[1:])
		else:
			stack.append(float(w))
	print("ok")

This done, we can define pairs of words such that one sets up the buffer (usually by clearing it), and states which word will end buffering. Any other in-between will be added to the buffer. Once the ending word occurs, call it to do something with the buffer's contents! Which can be as simple as discarding it all:


	def slash_star():
		global buf_to
		buf_to = "*/"
	def star_slash():
		buffer.clear()
	words["/*"] = slash_star
	words["*/"] = star_slash

Yep, that's a C-style comment, and we got here with just 15 more lines of code. A few more and we can also do lists:


	def open_bracket():
		global buf_to
		buffer.clear()
		buf_to = "]"
	def close_bracket():
		stack.append(buffer.copy())
	words["["] = open_bracket
	words["]"] = close_bracket

What about strings? I was about to apply a clever trick, but then realized it's perfectly possible to have one delimiting word pull double duty:


	def double_quote():
		global buf_to
		if buf_to == None:
			buffer.clear()
			buf_to = '"'
		else:
			stack.append(" ".join(buffer))
	words['"'] = double_quote

Just remember that the ending word is still a word, so it has to be separated by whitespace. If that trips you up, use something else, such as parentheses. It's not like we need them for anything else, did you notice? Postfix notation makes them superfluous.


	? " (2 + 3) * 5 = " . space 2 3 + 5 * . cr
	(2 + 3) * 5 = 25.0
	ok

Another issue is that strings aren't really strings: whitespace isn't preserved verbatim inside them. You could do something about it, by using Python's standard shlex module to split input, but you'd still need some way to make the interpreter buffer quoted strings; it has no way to know they are special. I chose to keep things simple.

Luckily, that's not a problem for control structures. After all, the buffer accumulates words we'd otherwise interpret. When the closing bracket pushes a copy thereof to the stack, what you end up with is effectively a block of code. All you need is a way to interpret it instead of the input. Time to factor out part of the main loop (exception handling can remain in place; just remember that w is now a local variable of interpret):


	def interpret(block):
		global buf_to
		for w in block:
			if buf_to != None:
				if w != buf_to:
					buffer.append(w)
				else:
					words[w]()
					buf_to = None
			elif w in words:
				words[w]()
			elif w[0] in sigil:
				sigil[w[0]](w[1:])
			else:
				stack.append(float(w))

Amazingly, that's all we need to build loops and conditionals:


	def times():
		n = int(stack.pop())
		code = stack.pop()
		for i in range(n):
			words["i"] = i
			interpret(code)
	def iftrue():
		code = stack.pop()
		if stack.pop():
			interpret(code)
	words["times"] = times
	words["iftrue"] = iftrue
	words["="] = lambda: stack.append(stack.pop() == stack.pop())

With the code above and nothing else, we can finally write something like:


	? 1 1 = [ " Yes! " . ] iftrue
	Yes!ok
	? [ " La! " . ] 3 times
	La!La!La!ok

It even reads naturally! And you can probably figure out how to add what's missing. Meanwhile, our interpreter has crossed the hundred-line mark, though only 35 of those are part of the core (including blanks), and that won't grow much anymore. So let's move on to user-defined words.

Since a block of code is simply a list of words on the stack, it's easy enough to store it in a variable, get it back and tell the interpreter to run it. It's also ugly. We want the ability to define words that look just like those provided by the interpreter. We could special-case lists in interpret, but that's even uglier. Fortunately, there's a better way.

Earlier I showed you a magic class for variables, that ended up unused. The idea was good though, and it can be recycled:


	class UserWord:
		def __init__(self, code):
			self.code = code
		def __call__(self):
			interpret(self.code)

Enter an instance in the dictionary and the interpreter will happily call it like any old function. But where to get a name from? Why, a sigil:


	sigil["/"] = lambda n: words.__setitem__(n, UserWord(stack.pop()))

Which is all we need to write code like:


	? [ . cr ] /say
	ok
	? " Hello, world! " say
	Hello, world!
	ok

This is backwards however, but the bigger problem is, our lists can't be nested, so how do we code a word containing control structures? Well, for one thing you can add another pair of delimiters, say curly braces, that do the exact same thing as square brackets, except now we can alternate them. As for the other issue, did you notice a sigil can start buffering too?


	def colon(n):
		global buf_to
		stack.append(n)
		buffer.clear()
		buf_to = ";"
	sigil[":"] = colon
	words[";"] = lambda: words.__setitem__(
		stack.pop(), UserWord(buffer.copy()))

Rather more code, but in return we can define words like:


	? :say . cr ;
	ok
	? " Hi! " say
	Hi!
	ok

So now we have two different ways to extend the interpreter's vocabulary from within, plus two styles of code blocks for maximum expressivity!

More importantly, our little toy now has most elements expected of a real scripting language. Adding the rest is largely a matter of patience. And we did it all with roughly 40 lines worth of engine, plus 80 of definitions that any plausible script is going to need. There goes my old record...

Oh, just in case you thought I made it all up, the bulk of this language is based on Forth. Sigils in particular are inspired by the RetroForth dialect (thanks, Kilgore!) with a few more ideas lifted from PostScript; square brackets are used to delimit blocks of code in many stack-based and concatenative languages. I didn't want to mention all this before to avoid intimidating anyone. But as you can see there's nothing to fear.

For further reading, try the concatenative.org wiki, and/or my old tutorial called Make Your Own Programming Language. I'm a much better programmer now, but ten years ago I had the patience to explain in more detail.

Either way, now you can see there's nothing mysterious about a scripting engine. And that should help even if you decide to embed Lua rather than make your own.