how to write a programming language

(3080 words)

 

introduction

writing a programming language is a science, like math– or an art, like writing a play. one day theyll have a word for things that are both an art and a science.

you can learn how i wrote a very simple demo programming language in 25 minutes just by taking it apart, or if you prefer you i can take you on a tour of the features and talk about how to generalize the concepts. (thats what this post is for.)

programming is fractal in nature– the way you write a complex program isnt fundamentally different from the way you write a simple one. its true, that as you write more complex programs you will incorporate little tricks that you didnt need for simpler programs– they dont change the fundamental nature of the task but they refine it. writing a programming langauge is the same way.

but remember that all programming becomes these tiny, very simple little steps, so fundamentally all programs (and programming) work the same way. talking about one way to do it can help you understand a wide variety of ways, minus less significant details you can pick up along the way.

one of the most important differences between programming language designs is “interpreted or compiled?” these dont have to be as different as they seem:

if your target language already has all the features you need, then writing a compiled language will probably be easier– just translate the input code into output code and youre done (if that sounds like an oversimplification, it will be addressed in slightly more detail to demonstrate.)

certainly if youre compiling down to machine code, youre going to want to know at least two “languages” (other than the one youre writing) including the machine code itself, and the higher-level language youre probably writing the compiler in.

the thing is, compiling to machine code isnt the rule it used to be. coffeescript, as one example– compiles to javascript. (“compiles” is the right word– you can just as easily call it “translating” which is very common, and also correct.)

fig is written (implemented) in python and compiles to python. the demo language shown in this post is also written (implemented) in python.

technically speaking, the language is the language and the program that makes it possible to use is the “implementation.” fig is the name of my programming language, and it is also the name of the implementation (the translator or “compiler”) which is written in python. a good reason to distinguish a language from an implementation, is that you can have implementations for the same language, written in more than one language–

the most popular implementation of python is written in c, and there are other implentations of python– pypy is another typical implementation of python, written in a simplified version of python!

 

compiling, or?

but speaking in fundamentals, the difference between implementing a compiler and implementing an interpreter is this:

an interpreter will actually run code, after translating it.

a compiler will take the translated code and write it to a file instead.

if you think thats a huge difference, think about it this way:

for every line of code your intepreter runs, you can convert the part that runs it into a print statement and compile to stdout. then instead of running the interpreter this way:

$ name-of-interpreter name-of-inputfile

 

you would run it this way:

$ name-of-compiler name-of-inputfile > outputfile.targetlanguage

 

that will be demonstrated with a “real language” in this post.

one of the biggest differences between a simple language and a sophisticated one, is the complexity of the parser (and so, the complexity of the source it can parse.) people keep implementing variations of lisp, because its probably the easiest (least complicated) recursive language you can parse. basic is more complicated (to parse.) so is javascript.

if you want to learn how to write a complex language first, get a big, big book on implementing programming languages, and get to work. people do, its probably how youd go about it if you had a college course in the subject and wanted your first language to look and walk and quack like the big languages you use everyday.

if, on the other hand you want to write a programming language that is simple enough to understand in a week or so, and start working your way up to more complex languages after having an easy one under your belt– well, keep reading. after i wrote fig, i started looking at how more complex languages were written, hoping to find the “secret” of making them far more complicated.

the secret is, there isnt one. learn how to get a more complicated parser, and youll have a more complicated language.

you can do it the following ways:

  • get a parser generator and learn and use its syntax (its supposed to be easier, especially if your target language is c.)
  • make your language free software and adapt an existing free software parser to your needs
  • learn how to write recursive parsers, perhaps starting as so many do with a parser for lisp-like languages

a big, big book can help you with any of those. as i said, i skip all that and go for:

  • write a simpler language to parse, and a simpler parser to parse it

writing an interpreter is easier if your language is going to have simple features. if its not going to have loops and conditionals, then an interpreter is simply easier to implement– you might even do it without it feeling like youre writing a programming language.

a couple tips for starters:

you can have limited loop functionality (the easy way) in an interpreted language, by putting the entire interpreter in a big loop, and implementing an exit command to stop looping. i watched one person with a tiny interpreter do this, and its one way to have some looping functionality without a lot of trouble.

you can have limited conditional functionality by creating functions that are conditional based on input– use your imagination there.

what makes an interpreter difficult to implement with full loop and conditional functionality is that you have to implement branching– you have to be able to move from one part of your program to another, rather than simply intepreting each line. in the old days, this was done with an intermediate translation to a “bytecode.” (this isnt a dead practice, as java shows, but i think it has probably evolved.)

im going to walk you through a simple interpreter, tell you how to avoid bytecode, and then offer you the option of compiling to a high-level language as the easiest way to have all the complex branching features you like. (thats what fig does.)

in other words, fig (as a language) offers the functionality of conditionals and loops, by translating (compiling) to python conditionals and python loops. thats the easiest way to have that– write a source-to-source compiler. but if you dont need that functionality (no loops or conditionals in the language i will walk you through) then an interpreter is even easier.

 

too easy:

if you really want to “cheat” then you can implement at the very least, a pseudo-language interpreter by abusing the exec() command in python. python has a ton of features that are wonderful for language implementation, and exec() is the one i deliberately avoided in fig, both for security and educational purposes.

wanna see a print command implemented using exec in python 2? you can laugh:

x = raw_input() ; x = x.split()
if len(x) > 1:
    if x[0].lower() == "print":
        exec("print " + " ".join(x[1:]))

 

im not saying this is useless or should never be attempted, but i suspect the language you create will be famous one day for programs with security problems, (exec is simply too powerful and easy to abuse, though someone with more experience can perhaps easily prove me wrong, and that would be pretty cool) and to me this is so close to “cheating” that youd have to do something really novel with it to make it worthwhile.

other than that, i dont want to discourage you. but you may really prefer source-to-source compiling, over this.

 

variables

a python feature i do recommend for intepreters however, is hash tables (aka python dictionaries.) theyre a great place to put all the variables for your interpreter. so the first line of the demo interpreter creates a dictionary for storing variables:

variables = {}

now if you want your intepreted program to set a variable, all the code you need to set it is, run this code:

variables["identifier"] = value

 

this also isolates your languages variables from the rest of the program. if we cheat using exec instead:

exec("x = 5")
exec("print x")

 

that actually works, but youre storing x right in the same variable space / scope that the rest of the program is working in. that could lead to some fun issues later! (but if your language design is complex enough, you could have some real sciencey fun playing with that sort of thing. exec isnt all bad, its just more powerful than you probably even want.)

if you want to print your variable the dictionary way instead, run this code:

print variables["identifier"]

 

copy a variable? sure:

variables["2ndvariableidentifier"] = variables["1stvariableidentifier"]

 

now be careful here, if youre using python– because what i just showed you will have the same features (and issues) of copying a variable in python using =

for a string or numeric, theres no problem. if youre copying a list however, youll be linking to it, not cloning an independent copy. i get out of this by using type():

if type(variables["1stvariableidentifier"]) == list:
   variables["2ndvariableidentifier"] = variables["1stvariableidentifier"][:]
else:
   variables["2ndvariableidentifier"] = variables["1stvariableidentifier"]

 

if your language doesnt have python lists as a feature (hint: tuples of constants dont have this issue) then you can just use the single-line variable copy.

for a source-to-python compiler, if i want to set x to 5, i just do more or less this:

newline = "\n"
outfile.write("x = 5" + newline)

 

so its more like exec, except that its safer and (because it doesnt run until the compiler exits) it doesnt clash with the variable space of the translator program. (and yes, i still use the trick i showed you for dealing with list copying.)

 

you can teach yourself by trying to do it

now you may wonder how i learned to implement features like this. i honestly didnt learn from anything except learning to write programs– and this is a secret i want to pass along to you–

i wrote a programming language exactly the same way i would write any other program:

* i thought about what i wanted to accomplish
* i broke those steps down into individual tasks
* i implemented each of those tasks using code, using what i know about how the implementation language (python) works– and occasionally looking up things

(but for example, i already knew you could clone a list using [:] and i vaguely knew type() existed.)

the goal of your language implementation (program) should be:

1. take text (or numbers, or even pixels or sound data– but information) as input
2. break text down into pieces that the program understands
3. do things (intepret) with those pieces, or:
4. write things to a file (compile) based on those pieces

 

when you understand the simplest notion of programming, you can figure out yourself how to write a programming language. or as ive heard it put very, very simply:

“read text and do stuff based on the contents.”

 

function calls

probably half the trouble of implementing a simple language is parsing. you can separate that into lexing and parsing, etc. etc. fine– lets just say “parsing” for now because thats what its called when you “parse” phrases in english. (in programming languages, you might call it lexing.)

but almost half the work (for me) is a parser, and at least half the work is writing functions.

in fig, a function is what you call a command. you can call it a command, but a function is a function, a sub is a function, a command is a function, etc. this is largely due to the python underneath:

def figprint(p): print p

 

thats not a whole lot more complicated than using exec, is it? in 30+ versions of fig, i never needed more for the “print” command. the “prints” command is a little more involved:

def figprints(p): stdout.write(str(p)) ; sys.stdout.flush()

 

and what about the “left” command?

def figleft(p, s): return p[:s]

 

these fig commands illustrate the point, but our 25-minute demo language doesnt use them. it implements these 4 commands, all of which have 2 parameters each (that makes parsing a little easier. i was trying to be quick!)

def repeatprint(p, x): print ((p + " ") * int(x)).rstrip()

def printsum(p, x): print float(p) + float(x)

def setvariable(p, x): global variables ; variables[p] = x

def repeatvariable(p, x): print ((str(variables[p]) + " ") * int(x)).rstrip()

 

do functions have to be one-liners? of course not. some functions in fig are at least several lines long.

but our little demo language (based on its function calls) can do 4 things, using 4 commands:

repeatprint string count

printsum numeric numeric

setvariable identifier constant

repeatvariable identifier count

 

of course “identifier” is a fancy way of saying “name.”

 

demo program

now that we know what the demo language can do, lets implement a demo program:

* set the variable d to 12
* repeat “hello” 5 times, and print 5 + 9
* print the variable d, 9 times

putting the program in a multiline string instead of bothering to load a text file, heres what it looks like:

demoprogram = """
setvariable d 12
repeatprint hello 5 printsum 5 9
repeatvariable d 9"""

 

a ridiculously simple parser

now it pays off that every function has exactly 2 parameters (to make it easier to parse.)

lets process all the excess spaces (this is to make the language easier to use, but you could demand strict interpretation of spaces instead)

xp = []
for p in demoprogram.replace("\r", " ").replace("\n", " ").split(" "):
    p = p.strip()
    if len(p): xp += [p]

 

that just turned our multiline string demo into this:

[‘setvariable’, ‘d’, ’12’, ‘repeatprint’, ‘hello’, ‘5’, ‘printsum’, ‘5’, ‘9’, ‘repeatvariable’, ‘d’, ‘9’]

 

which is even easier to process. so (keeping it simple,) we go through the list xp, and add to another list called “build.” when its length is 3, we do a function call with the name, the second list item as a parameter, and the third list item as a parameter:

build = []
for p in xp:
    build += [p]
    if len(build) == 3: 
        if build[0].lower() == "repeatprint": repeatprint(build[1], build[2])
        if build[0].lower() == "printsum": printsum(build[1], build[2])
        if build[0].lower() == "setvariable": setvariable(build[1], build[2])
        if build[0].lower() == "repeatvariable": repeatvariable(build[1], build[2])
        build = []

 

so here is our demo program:

setvariable d 12
repeatprint hello 5 printsum 5 9
repeatvariable d 9

 

here is the expected functionality:

* set d to 12
* print hello 5 times
* print the sum of 5 and 9 as a float
* print the value of d, 9 times

here is what it actually does when you run the program:

it puts “hello hello hello hello hello” on the screen
it puts “14.0” (the sum of 5 and 9) on the screen
it puts “12 12 12 12 12 12 12 12 12” on the screen

 

compiling

now what about compiling instead of interpreting?

well the way fig works, is instead of declaring these functions:

def repeatprint(p, x): print ((p + " ") * int(x)).rstrip()
def printsum(p, x): print float(p) + float(x)
def setvariable(p, x): global variables ; variables[p] = x
def repeatvariable(p, x): print ((str(variables[p]) + " ") * int(x)).rstrip()

 

it puts them in a multiline string:

f = """
def repeatprint(p, x): print ((p + " ") * int(x)).rstrip()
def printsum(p, x): print float(p) + float(x)
def setvariable(p, x): global variables ; variables[p] = x
def repeatvariable(p, x): print ((str(variables[p]) + " ") * int(x)).rstrip()
"""

 

and writes that string to a the file it compiles. (you can get fancy and only write the functions that are called by the program, but i didnt want to– better imo to let people edit the python code it output. if you werent compiling to source, youd definitely want to do a pass to count the functions actually called, and only compile those!)

then instead of calling a function when its encountered like this:

if build[0].lower() == "printsum": printsum(build[1], build[2])

 

write it to the same file that the functions are defined in, like this:

if build[0].lower() == "printsum": outfile.write("printsum (" + str(build[1]) + ", " + str(build[2]))

 

improving the language

one very trivial way to make this existing language more sophisticated, is to have it allow more or fewer parameters for each command. you can do this by making it known where the end of each command is.

instead of always looking for two parameters, have it look for a separator of lines (logical lines of code, or lloc.) a popular way to do this is to look for a semicolon:

setvariable d 12
repeatprint hello ; 5 printsum 5 9
repeatvariable d 9

 

you dont need one at the end of every line, if a newline also counts.

in javascript, you need one at the end of every line– in python, you dont.

that doesnt matter, since youre translating things to python, not from it– you decide how the translation (and thus the language youre implementing) works.

one thing about this program and that extra sophistication is:

setvariable d 12
repeatprint hello ; 5 printsum 5 9
repeatvariable d 9

 

now you can have differing parameter counts, but you cant offer a semicolon as a string. you can get around this in several ways:

* strings have to go in quotes
* escaped semicolons \; dont separate lloc
* the only way to print a semicolon is to put a numeric in a variable, then convert using a chr command or something
* instead of using semicolons, demand a one-command-per physical line (ploc) convention

when youre creating your own language, you get to decide which of those (or other) solutions are most reasonable to you, and to the goals of the language.

 

what fig does with this

just so you know, fig addresses the matter this way:

block commands and function definitions have to go on their own line (“own-line” commands)

most commands (“shared-line” commands) start with a “main variable” and share the line– each command in fig has a fixed number of parameters, and the translator has a dictionary that tells it how many parameters to expect for each command

“own-line” commands can have any (unexpected) number of parameters. the only command in fig that takes advantage of this is “function”, which lets you define a function in fig like this:

function identifier

 

or this:

function identifier paramname

 

or this:

function identifier paramname paramname

 

…etc. this is like using def identifier(paramname, paramname): #in python

as a result, fig does not require punctuation for most code– but it does tolerate some. it requires “quotes for strings” and # hashes for comments and decimals do what you would tend to expect (and dont use them unless you need them for something.)

 

 

Advertisements

16 thoughts on “how to write a programming language

        1. id be happy to offer input or advice. if you want code, id probably like to offer code contributions via blog posts. but we can see how that works out.

          for one thing, i like to license my code in cc0. that should be compatible with any license youre using– you can just fold my code right in, no problem. we can talk about that, but suggestions might be the best thing for me to offer you. and feel free to keep me posted on what youre looking for or would like to know– i cant promise you anything yet (i dont know what you even need yet.) –no worries.

          Liked by 4 people

            1. cc0 is a public domain dedication (by default) and a license (where one is needed– such as finland.) its a way of making something more universally public. its from the creative commons people, and its listed on the gpl-compatible page on the fsf website.

              now whatever license you normally use (as long as its a free/floss software license) is probably the one you should use for your language implementation. since cc0 is compatible with basically everything, you can take my cc0 code and put it in your gpl code (or mit code, etc.) and its fine. i cant do the same– i cant make your gpl code cc0, but you can use my code in your gpl code, so i should be able (imo) to use cc0 for my contributions. does that make any sense? you can view the cc0 license by following the link at the bottom of almost all the entries on this blog, including this entry!

              Liked by 4 people

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s