Wednesday, 21 December 2011

Documenting Lua

Why is Documentation so Lacking?

As Barbie the Software Engineer might say: Documentation is Hard. It's not intrinsically fun like coding, and often gets written badly as an afterthought. In open source projects, where fun is so often the guiding principle, it can get badly neglected.

There are basically two kinds of comment; one that is aimed at other developers of the code and the other which is aimed at the users of a library. The latter is particularly hard and important, since we are building abstractions for other people to use, and if they have to actually read the code to work out how to use the library, that abstraction has broken. This is a common problem with language bindings to existing libraries; they will be announced as a 'thin wrapper around libfoo' and it's assumed that you can work out how to use the bindings from the libfoo documentation or from libfoo examples in C. Well, not everyone can (or needs) to read C well enough to make that practical. We know the classic rebuke, "RTFM!" but that assumes that there is a manual - it's often "RTFS!" these days.

Testing has become a minor religion in the last ten years, and mostly it's a good thing, especially when using dynamic languages. Tests are supposed to be the best kind of documentation, because they can automatically be shown to be correct. But if the scope of the tests is too narrow, then they can be harder to read than the source. That is, also write examples that can serve as tests.

We cannot ascribe poor documentation to laziness, since these projects are otherwise well-constructed and may have exhaustive tests. The developers clearly don't think documentation correlates with wide use by the community; perhaps communication is not a priority. After all, research mathematicians write papers and do not immediately think how to explain it to undergraduates; they are writing for peers.

This might seem an indirect criticism, but it isn't intended as such. (I resisted the urge to point to some representative Github projects for this reason.) The Open Source universe needs all kinds, from researchers to popularizers. In the 'social coding' universe, fork has become a verb of respect, so documenting an existing project and doing a pull request is a polite way to contribute to the usefulness of a project.

You might think that elaborate conventions and markup could get in the way of documentation. The Go standards simply require you to preface the package and any exported items with a plain comment; just about the only convention is that indented lines are shown preformated. And still those plain comments are often not written. Part of the problem is a variant on the "increment i by one" kind of comment, which is just an awkward paraphrase of the code. So a Stack.Push method might end up with the blindingly obvious "Push an element onto the stack". Anybody with any self-insight is going to find that silly, and so they don't do it. I have noticed that coding libraries and documenting their interfaces are modal activities; it's useful to approach documentation with a fresh head (much as how writing and proof-reading are separate but essential steps in preparing an article.)

The key thing is to care enough to return to the code with "beginner's mind", and finish the job.

The xDoc Tradition of API Documentation

APIs are rarely as self-documenting as their creators might think. Javadoc is probably the grand-daddy of a per-function style of API documentation, and it has inspired numerous xDoc tools for other languages. Here is an example of LuaDoc markup:

 --- foostr stringifies a list of Foo objects.
 -- @param foolist a list of <i>Foo</i> objects
 -- @param delimiter string, as in table.concat
 -- @return a string
 -- @see Foo, table.concat
 function foostr(foolist,delimiter)

Limited code inference occurs, so that we do not have to specify the function name, but otherwise everything is explicit. Dynamic languages pose a particular challenge here because we're forced to specify the parameter types in an informal way. Projects may define their own conventions, but there's no standard way to write 'list of Foo'. You may use see-references to other documented entities but they have to be in a @see section.

People have recognized for a while the limitations of this way of documenting APIs, which focuses on the individual pieces and can be silent on how to put them together, but it's much better than nothing. So the debate is often about making these comments easier to write. For instance, in a comparison between Doxygen and JavaDoc, Gushie says:

The most important problem with the Javadoc comment in the comparison is how much I need to concentrate on formatting issues while writing it. When writing Javadoc I am constantly thinking about what should and shouldn't be linked, whether the list will look right, etc. This is frustrating because, while I do want to document my code well I also want to focus on coding.

And that's it: a tool that forces you out of code flow is not going to make you happy about documentation.

One of the problems I have with LuaDoc is that HTML is awkward to type and ugly to look at, especially when you want to provide something as simple as a list. So one of the main requirements was that Markdown could be used instead. I try to be a good boy and not reinvent too many wheels, but LuaDoc proved fairly resistant to this change, since it helpfully stripped out the blank lines and line feeds from the comments at an early stage. But the tipping point was when I moved the Penlight codebase away from using the module function; LuaDoc loves the module function and it's often necessary to put it in a comment so that it will recognize something as a module. So ultimately the issue was that LuaDoc wanted code to be written in a particular way; futhermore, we all know that 'Lua does not have classes' but it's straightforward to work in an OOP style; again, there was no provision for such an choice. The guiding principle should be: a documentation tool should not limit a developer's choices.

So, wheel-reinvention happened, leading to LDoc. In a few days, I had something that did the job equally well, for myself. Seasoned campaigners will know of course that this is the easy bit; the hard bit is getting a tool that will work on other people's codebases. Thanks to some painstaking testing by Lorenzo Donatti and others, LDoc is now at the point where it's ready for prime time testing.

The lack of inline entity references in LuaDoc means that your references have to be footnotes to your comments. This style felt better:

 --- foostr stringifies a list of Foo objects.
 -- @param foolist a list of @{Foo} objects
 -- @param delimiter string, as in @{table.concat}
 -- @return a string
 function foostr(foolist,delimiter)

Using Markdown helps formatting lists:

 -- creates a new @{Boo} object.
 -- @param spec a table of:
 --  - `age` initial age of object (optional, defaults to 0)
 --  - `name` descriptive name used in caption
 --  - `foolist` list of @{Foo}
 -- @return @{Boo}
 function Boo:create(spec)

(The blank lines are somewhat irritating, but I chose eventually to work with standard Markdown here; it is certainly less awful than the equivalent HTML.)

This kind of 'ad-hoc' table structure is a common pattern in Lua, and I've argued before that it's a good idea to have named types so that documentation can refer to them. An in-between option is to create a dummy type which just exists for documentation purposes.

Plain and Simple

Not everyone likes this 'tag soup' style of documention. For instance, this is the style followed in the Lua manual:

 -- Receives zero or more integers. Returns a string with length equal to
 -- the number of arguments, in which each character has the internal numerical
 -- code equal to its corresponding argument.
 -- Note that numerical codes are not necessarily portable across platforms.
 function string.char(...) end

This can be very effective with documenters who have a good clear prose style.

LDoc can support this by making the usual 'Summary' and 'Parameters/Returns' sections in the HTML template optional.

It was always awkward to configure LuaDoc, so I felt that LDoc needed a configuration file. As is often the case, Lua itself is an excellent format for such files.

 -- config.ld for Lua libraries
 project = 'Lua'
 description = 'Lua Standard Libraries'
 full_description = [[
 These are the built-in libraries of Lua 5.1
 Plus documentation for lpeg and luafilesystem.
 file = {'builtin',exclude = {'builtin/globals.lua'}}
 format = 'discount' -- use lua-discount for formatting
 -- switch off the top summary and 'Parameters/Returns' in the template
 no_summary = true
 no_return_or_parms = true

(You can in fact even define your own template and/or stylesheet and reference them in the config.ld file.)

And here is the result, thanks to the API files from mitchell's TextAdept editor project.

The principle followed here is: allow the documenter as much control as possible, but make it straightforward for end-users to build the documentation. So a simple invocation of 'ldoc .' in the right directory will find the configuration file and do the rest.

Type Annotations

On the other end of the scale, it's useful to have type annotations for Lua functions. This is a requirement for full IDE support in the Eclipse Koneki project. Fabien Fleutot provided support for tag modifiers, so that one could say things like @param[type=string] name. I have provided convenient alias tags so one can say:

 -- list available servers providing a service
 -- @tparam string service name
 -- @treturn {Server,...} list of servers
 function list_available(service)

Beyond the built-in Lua types ('string','number',etc) there is no agreed-upon way to specify types, so some invention is necessary. The convention is that {T,...} is a list-like table containing T objects; (see the documentation for further discussion.) LDoc will attempt to do lookup on all identifiers within such a type specification, so that Server becomes a link to that type's definition.

Examples, Readmes and Command-Line Help

Per-function API documentation is often not enough. In particular, good examples can clarify the intended use, and narrative documentation can introduce the concepts and fill in the background. Complementary 'modalities' of explanation and extra redundancy allow end-users of different backgrounds and thinking styles to have various ways of understanding the topic.

So LDoc allows integrating the API documentation with examples and Markdown text as linked documentation. The best example I currently have is the winapi documentation. This is actually a C Lua extension, which is another LDoc feature. Here is the config.ld:

 file = "winapi.l.c"
 output = "api"
 title = "Winapi documentation"
 project = "winapi"
 readme = ""
 examples = {'examples', exclude = {'examples/slow.lua'}}
 description = [[
 A minimal but useful binding to the Windows API.
 format = 'markdown'

Those of us who live in the command-line like to look up documentation quickly without messing around with a browser. Generating man-pages is an interesting and very do-able goal for LDoc, but man is showing its age and is pretty Unix-centric. Some great discussions with Norman Clarke inspired an LDoc feature where it looks up libraries on the module path and parses doc information.

So if you have an installed library with documentation, then it's easy to look up a particular function:

 $ ldoc -m pl.pretty.dump
 function        dump(t, ...)
 Dump a Lua table out to a file or stdout.
 t        {table} The table to write to a file or stdout.
 ...      {string} (optional) File name to write to. Defaults to writing
                 to stdout.

Thanks to mitchell's work at luadoc-ifying the Lua manual, this works for the standard Lua libraries (plus lfs and lpeg) as well:

 $ ldoc -m lfs.currentdir
 function        currentdir()
 Returns a string with the current working directory or nil plus an error

(ldoc -m lfs would list all the available functions.)

Further Work

It's possible to process the inner representation generated by LDoc using a custom Lua module, allowing multi-purpose pipelines. We're still working on use cases for this, but the basic machinery is there.

Other output formats could be supported; the program is meant to be extendable and I think generating LaTeX could be an excellent way of getting good-quality output.

Lua is surprisingly flexible at expressing concepts like modules and classes, so simple lexical code inference is probably not going to scale. Eventually we will have to use full static code analysis like David Manura's LuaInspect.

Thursday, 15 December 2011

MoonScript: A Lua-Based Language

Let a Thousand Flowers Bloom

There are literally hundreds of programming languages, most of which are historical curiosities, research tools or personal projects. You would think that most of the important innovations have happened, and existing mature tools available for all tastes. That is true, to a point: but we are still exploring the space of expressive notation for solving our programming problems. Here the practical programmer and the academic researcher diverge, of course; the latter sees most new languages as rehashes of things first done in the golden age of PL development.

The last few years has been a surprisingly fertile time for new programming languages. They range from Go (native code + new runtime), Scala (bytecode + JVM) to languages like CoffeeScript or Dart which translate to JavaScript. The trend is mostly towards static typing but with type inference helping to keep the syntax clear and uncluttered, with functional influences. There is some TIOBE and anecdotal evidence that the dynamic language usage has peaked, possibly because of the maintenance difficulties of large codebases (an explicit motivation for Dart's development, which has optional typing.)

MoonScript is inspired by CoffeeScript, but generates Lua code instead. JavaScript and Lua share many similarities like first-class functions, the m["name"] == equivalence, and no conventional OOP constructs. People like the convenience and familiarity of class, however, so CoffeeScript, Dart and MoonScript provide explicit support.

Trying it Out

When reviewing a new language, it's useful to remember that programming languages take time and exposure to become mature. MoonScript is now on its second release, three months after the first, and getting out of its proof-of-concept phase. 'Leaf' Cocoran is doing all the right things; he has provided syntax files for some common editors, a binary for the notoriously compiler-shy Windows community, and has a good looking website with comprehensive documentation. And he is building on a very solid base, the battle-tested Lua 5.1 release.

Also, reviews are not tutorials; I won't try to cover all the syntax, but will try to give the flavour of the language.

MoonScript could generate Lua bytecode directly, but would hit the basic issue that the "Lua VM" does not have standardized bytecode. In particular, this would exclude MoonScript from using the excellent LuaJIT implementation, which is a known problem with the otherwise powerful Metalua dialect. And having a readable high-level language as the compilation output makes it easier to develop and trouble-shoot.

Although not compiler-shy, I am lazy and just downloaded the Windows binary (it's hard to resist a language delivered as a 212KB zip file. Besides, the 'non-POSIX' world doesn't get enough attention these days.) If you want to avoid the 'moon script.moon' incantation, then it's straightforward to tell Windows how to handle MoonScripts:

 ftype moonscript=d:\utils\bin\moon.exe %1 %*
 assoc .moon=moonscript

(These make global permanent changes, so you need to run as Administrator). Then add .moon to the system PATHEXT variable. The incantation is now just 'script'.

 -- script.moon
 print "Hi Ma", "No Parentheses!"

Wish-list Driven Design

In many ways, MoonScript is like a collection of features from a Lua user's wishlist; list comprehensions, shorthand for lambdas, switch statements, explicit classes, statements-as-expressions, switch statement, local-as-default. But Lua is the C of the dynamic languages; it has a small, very powerful core and 'conventional' syntax which has made it popular as an extension language. It moves slowly and carefully, and does not pander to 'Python envy'. MoonScript, on the other hand, will appeal to people with Python exposure, since indentation is used in the same way to delimit blocks.

I would not recommend MoonScript to a beginner in the same way I could for Lua itself, since it has a very concise syntax that appeals more to seasoned professionals. There is a lot more to remember, and always a number of ways to do similar things. This example shows the conciseness of expression:

 sqr = (x) -> x*x
 print sqr 4, sqr 5

The last statement in a function becomes its value; there's still return available but usually it's only needed explicitly when leaving a function before its end. print is still the same function it is in Lua, but parentheses are usually optional.

Well, cool: we've saved a lot of typing. That translates to something like this:

 local sqr
 sqr = function(x) return x*x end
 print (sqr(4),sqr(5))

Now saving on typing is an ambiguous virtue; if we wanted the ultimate compression, we would all be using APL and playing Code Golf. Classic Lua programmers prefer keywords to sigils, and fear that their language would end up like Perl. They say that reading code is a much more common operation than writing it. And that is fine and dandy (I mostly agree), which is why we need projects like MoonScript and Metalua to get a feeling for new notation.

Personally I like how MoonScript makes the functional aspects of Lua stand out more clearly. Higher-order functions are less cluttered with keywords:

 compose = (f,g) ->
   (...) -> f g ...
 printf = compose print, string.format
 printf "the answer is %d, '%s'", 42, "Everything"

List comprehensions are another favourite functional idiom:

 range = [i for i = 1,10,0.1]
 squares = [sqr(x) for x in *range]

Note the * syntax, which satisfies the common need to iterate directly over the elements of an array. (You can still say for _,x in ipairs range) These constructs result in inlined Lua code, which is a good decision given the very fast translation times for Lua and its modest resource needs.

Multiple for clauses are allowed:

 points = [{x,y} for x in *x_coords for y in *y_coords]

That generates about 17 lines of Lua output; hand-written would be about 10, which would still slow readers down and make them have to guess the programmer's intention.

Statements can be used as expressions, which is useful in functions:

 choose = (c,a,b) -> if c then a else b

In fact, there's a clever feature that for statements return tables:

 show = (t) ->
     for k,v in pairs t
         print k, v
 res = for i = 1,10 do 2*i
 show res

This is sufficiently clever to cause unintended consequences, so the rule is that this does not happen in function expressions and cause needless table generation:

 loop = -> for i = 1,10 do i
 print loop()

It's probably too clever to be truly wise, given that list comprehensions do the job more explicitly. However, it can be used to go beyond the list comprehension syntax:

 values = {1,4,3,5,2}
 v = for v in *values do switch v
     when 1 then 'bad'
     when 4 then 'cool'
     when 5 then 'excellent'
     else "uh-huh"
 -- v is now {'bad','cool','uh-huh','excellent','uh-huh'}

New with this release are table comprehensions, which are a generalization to map-like tables. These use curly braces to distinguish themselves, and the expression must be double-valued, with the first value becoming the new key and the second the associated value. So this expression turns an array into a set, that is t["hello"] is true if ls contains the value "hello".

 t = {k,true for k in *ls}

At this point, it's good to remember that we are still in Lua land, and there is no automatic pretty-printing of tables - which is wise because tables are the ultimate flexible data structure. (You can choose the string representation of your own types easily enough.) There is a small library included which provides a generic table dumper:

 require 'moon'
 moon.p {
     one: 'hello',
     [1] = 10
     [2] = 20
     [one] = "hello"

(For those used to Lua, note the use of : for key-value assignments in table constructors; the key may also be a keyword. For those who are not, note that tables can both be 'arrays' and 'hashes' - they are general associative arrays which are particularly efficient at regular array indexing. Like electrons, they are both waves and particles.)

However, despite the perception, there are batteries for Lua and MoonScript can consume them happily.

Using Existing Lua Libraries

MoonScript already includes LuaFileSystem and LPeg as standard libraries.

 total = 0
 for f in lfs.dir "." if string.match f "%.lua$"
   size = lfs.attributes(f).size
   print f, size
   total += size
 print 'total',total

(It also comes with lua-getopt for argument parsing, but that's an awkward, no-documentation kind of library.)

My favourite command-line parser is Lapp, which is part of the Penlight suite of libraries, since it parses the actual flag names and types from the usage string:

 -- mhead.moon
 require 'pl'
 args = lapp [[
     A simple head utility
     -n,--lines (default 10) number of lines to print
     <file> (default stdin) file to print; can be stdin
 for i = 1,args.n
     line = args.file\read!
     if not line then break
     print line

On Windows, the directory for Lua packages is relative to the executable. So if I install moon.exe to d:\utils\bin, then Lua packages must go in d:\utils\bin\lua. To install Penlight simply involves copying its pl directory to this lua directory. For other systems, MoonScript shares the usual global Lua module path.

To read configuration files (both Unix and INI-style) we can use pl.config

 require 'pl'
 test = [[
 # test.config
 # Read timeout in seconds
 # Write timeout in seconds
 #acceptable ports
 ports = 1002,1003,1004
 t = test
 pretty.dump t
 [[ --output--
     ports = {
     write_timeout = 5,
     read_timeout = 10


By the time people get to Lua (or Go) they have been sufficiently indoctrinated to instantly miss classes. Lua prefers to define more general metaprogramming techniques, which allows a competent user to roll any kind of custom OOP framework. (Unfortunately, too many people do this, so interoperability becomes a problem.) But ultimately all that matters is that the duck quacks. Lua does require you to indicate that the quacking applies to a particular duck, or to all ducks, much as a C++ programmer uses -> and ::.

In Lua, a method call is:


And in Moonscript:


(Idiomatic MoonScript would actually be duck\quack! but let's change one variable at a time.)

I'm not 100% crazy about the backslash here, but : has now got a more conventional job defining value/key pairs in MoonScript (and there was no other punctionation available ;))

So here is a MoonScript class, bit by bit:

 import insert,concat,remove from table

the import statement is one of those nice-to-haves which would be very useful in Lua, since the equivalent is the noisy and redundant:

 local insert,concat,remove = table.insert, table.concat, table.remove

A class is an intented block containing methods

 class List
     new: (t) =>
         @ls = t or {}

Note the colon, instead of =; this reminds us that we are putting these functions into a table. The fat arrow => means that the function has an implicit self argument; here it's equivalent to (self,t) ->.

@ls is just sugar for, as in Ruby. (The new method will be indirectly called as a constructor).

Our chosen representation for a list is a table field ls. This can be manipulated by the usual Lua table functions, which we previously imported:

 add: (item) =>
     insert @ls,item
 insert: (idx,item) =>
     insert @ls,idx,item
 remove: (idx) => remove @ls,idx
 len: => #@ls
 __tostring: => '['..(concat @ls,',')..']'

table.insert is basically two functions wrapped up as one - it both appends and inserts. #@ls is one of those collisions of punctuation that one just has to get used to in any language - it helps to read it as (and you can use self explicitly if you like.)

Any Lua metamethod can be specified in a class. A MoonScript class is the metatable for all its instance objects, with __index pointing back at the metatable so the object can find its methods. This is the most common way of implementing 'classes' in Lua.

Unfortunately, # cannot be overloaded in Lua 5.1 for tables; __len only applies to userdata defined by C extensions. (This is one of the welcome Lua 5.2 improvements)

 find: (item) =>
     for i = 1,#@ls
         if @ls[i] == item then return i
 remove_value: (item) =>
     idx = self\find item
     self\remove idx if idx
 remove_values: (items) =>
     for item in *items do self\remove_value item

if statements come in two flavours: multiline, where the following block is indented, and single statements that use then. The same applies to for, where the single statement version has an explicit do.

Any statement/expression can also have an if clause, called a 'line decorator'.

Now list comprehensions get their turn:

 index_by: (indexes) =>
     List [@ls[idx] for idx in *indexes]
 copy: => List [v for v in *@ls]

And finally we define extension, iteration and concatenation:

 extend: (list) =>
     other = if list.__class == List then else list
     for v in *other do self\add v
 __concat: (l1,l2) -> l1\copy!\extend l2
 iter: =>
     i,t,n = 0,@ls,#@ls
         i += 1
         if i <= n then t[i]

Any instance of a MoonScript class will have a __class field; I make this check because it's useful to allow a list to be extended by a plain vanilla table.

The iter method needs some explanation: in Lua, the for in statement expects a function that it can repeatedly call to get each successive value of the sequence, until that function returns nil. This method returns a closure with upvalues i, t and n.

And, exercising List:

 ls = List!
 ls\add 10
 ls\add 20
 ls\remove_value 10
 ls\remove_value 15
 ls\add 30
 ls\remove 1
 for v in *{1,2,3} do ls\add v
 print ls, ls\len!
 print ls\index_by {2,3}
 for v in ls\iter!
     print v
 print (List{'a','b'})..(List{'A','B'})

No type is complete without a definition of equality. == in MoonScript, as in Lua, compares simple values, not referencews. Objects are equal if they are the same object; (string equality works as expected because all Lua strings are guaranteed to be internally unique.) Generally, this is a good idea (none of JavaScript's 'helpfullness' here!). The relavant metamethod is __eq:

 __eq: (l1,l2) ->
     return false if l1\len! != l2\len!
     l1, l2 =,
     for i = 1,#l1
         return false if l1[i] != l2[i]
     return true

As another example of a Lua wish-item, note that ls\add without arguments is a valid expression, called a 'function stub',

 ls = List {1,2}
 add = ls\add
 add 3
 add 4
 assert ls == List {1,2,3,4}

ls\add 3 is short for ls.add ls 3; ls\add is a partial function application that binds the self argument.

See here for the complete code listing.

use as a DSL

One of the things I look for in new languages is how amenable their syntax is to the creation of embedded domain-specific languages.

The MoonScript author has a static website generator called sitegen where you specify your configuration with a site.moon file:

 require "sitegen"
 site = sitegen.create_site =>
   deploy_to "", "www/sitegen"
 @title = "Sitegen"
 @url = ""
 add "index.html"
 add "doc/"

We've met the fat arrow (=>) before in methods, but it can be used anywhere. Here => is shorthand for (self) ->. Now the @title shortcut makes sense, since it's just self.title.

The Lua equivalent is certainly more verbose, although I'm not sure we're losing much on readability:

 require "sitegen"
 local site = sitegen.create_site(function(self)
 self.title = "sitegen"
 self.url = ""
 add "index.html"
 add "doc/"

sitegen is the best existing example of real working MoonScript. As the author says modestly, "apart from the compiler". (That's right, MoonScript is now self-hosting.)

Problems and Irritations

You'll notice the necessary parentheses when concatenating two List expressions. Although we are encouraged to leave them out unless they are necessary, the rules are not as straightforward as they could be. For instance, this equivalence contradicts the above rule; note how whitespace matters:

 L"frodo" + L"bilbo" == L("frodo") == L("bilbo")
 L "frodo" + L "bilbo" = L("frodo" + L("bilbo"))

This comes from a Lua convention that parens are not needed if the function has a single string argument, but results in a special case that is likely to bite more than one person. It's easy enough to learn to read indentation as syntax, but making a single space syntactically significant is going too far. I would see this as destructive interference from the base language.

There are other inconsistencies, like you cannot say @\f to call another method when inside a method body. (This would lead to serious punctuation clusters like @\f! so perhaps this is a feature ;))

Generally, the compiler is not so very helpful and gives non-specific parse errors, and the line it does complain about is often not the problem line anyway. So I found myself having to read the manual carefully to work out issues. It reminded me of my Seventies experience with mainframe Fortran, where the result of painstaking work at the card puncher would lead to the equivalent of 'Que?'. And run-time errors would usually end up as a six-digit error code which you would then have to look up in the IBM manuals, which were bolted down like Medieval manuscripts. We've come a long way, and it shows how important good diagnostics are to learning a system: the compiler as tutor.

But hey, it's a young language, these are all fixable faults. It's a must-play for anybody interested in the future evolution of Lua, and could find an audience among CoffeeScript enthusiasts who might appreciate a lighter, faster runtime, especially with more generally useful batteries.

Friday, 9 December 2011

Ten Programming Languages

This is a personal list of programming languages that influenced me. It isn't a representative survey, nor is it complete (being something of a programming language slut I've tried everything, once). There are several common threads in this discussion. One is that I've always enjoyed interactive prompts; there is nothing better for exploring a language or a new library. (And any language can be made interactive.) The other is that I'm allergic to long build times, which probably indicates borderline ADD or at least a lack of rigour.

Another theme is over-specialization, like FORTRAN, or paradigm-driven design, like Java. Such languages find it hard to move with the times and can end up in evolutionary cul-de-sacs.

One language-nut thing I never did was design a new language. No doubt very entertaining, but it feels like adding a new brick to the Tower of Babel. The hardest part of writing a program is to make it clear to other humans, since computers aren't fussy once you feed them correct syntax. Every good programmer needs to be a polyglot and pick the right language for the job; there is no one programming language that stretches comfortably across all domains.


This is the grand daddy, and I can even give you a pop culture reference.

This was the language we were taught for first year CS at Wits University, Johannesburg. In fact, it was its last year before Pascal became the teaching language. but FORTRAN remained in my life as a scientific programmer for many years. It's widely considered a relic of the computing Stone Age, a dinosaur, although dinosaurs were sophisticated animals that dominated the world for many millions of years, and are still represented in their modern form as birds. So the clunky FORTRAN IV of my youth became Fortran 90. The old reptile is still used for high-performance computing.

This is the style of code which got people onto the Moon. The fixed format comes straight from the limitations of punch cards, with code having to start in column seven and end at column 72.

 C = 5.0 * (F - 32) / 9.0
 F = F + FSTEP

It has evolved into a modern-looking language (programmers have since learnt not to shout).

 real f,low,high
 low = 0
 high = 100
 step = 10
 f = low
 do while ( f <= high )
    c = 5.0 * (f - 32) / 9.0
    write(*,fmt="(2F8.3)") f,c
    f = f + step
 end do

Lots of old dinosaur code is still found in the wild, however! Trying to discover what it does in order to reimplement it is a challenging exercise, which is why keeping algorithms only in code is a bad idea, particularly if the code is written in Olde Fortran by scientists and engineers.

It remains popular, because many scientists still learn it at their mother's knee and it's a simpler language than C (despite the best efforts of the Fortran 90 committee.) This simplicity, plus built-in array operations, makes it easier for the compiler to optimize common numerical operations. And there's always the inertia of the numerical community and the availability of so many libraries in Fortran.


Judy Bishop's second-year course in Pascal helped me to recover from Fortran and taught me structured programming.

Nicklaus Wirth's book was probably my first intellectual experience in the field. Pascal was always intended to be a teaching language, and Modula was supposed to be the version for grown-ups. Classic Pascal was a (deliberately) restricted language, and Brian Kerighan famously wrote Why Pascal is Not My Favourite Programming Language. But then you must remember that compared to FORTRAN IV it was like awakening from the primal ooze.

I remember reading somewhere (probably from the LOGO experiments) that when kids are exposed to a programming environment, they do not spontaneously structure their programs. They readily grasp the idea of giving an action a name, and then repeating it, but do not refactor complicated actions into a series of simpler actions. Such things have to be taught, and this was the single biggest thing I learnt about programming at university. After all, students have to be taught to write, since organizing language in the form needed to communicate effectively does not emerge from their native spoken language skills.

Subsequent paradigm shifts in programming education have happened, like object-orientation and the move to Java, but I can't help feeling that basic structuring skills are neglected when students' minds are immediately poured into classes. OOP downplays functions, but they are important building blocks and people need to know how to work with them. Perhaps the mental development of programming students should follow historical development more closely?

In the late Eighties, Borland sprung on the scene with Turbo Pascal, which was a marvelous, fast and practical implementation. This was my workhorse until I discovered C, but later I used Delphi for Windows GUI development and still consider it to be one of the best options available.

One of Borland Pascal's distinctive features was a very smart linker. The equivalent of object files had all the necessary type information, so that compiling and linking programs was fast even on modest hardware.


The only interactive language on the Wits mainframe was LISP, which started a lifelong love of conversational programming languages and introduced to me the notion that programs could manipulate symbols. (And by extension, manipulate programs.)

I found an old copy of John McCarthy's LISP 1.5 manual in the library, and my head was seriously opened. Recursion is a beautiful thing, and I first saw it in LISP.

Unlike others, I didn't pick up LISP later, but it planted seeds that flowered when I used Lua much later.


My first C experience resulted in a crashed machine, which happens when you are spoilt by Pascal's run-time array bounds checking. C is the ultimate adult programming language, and does not get in your way if you insist on shooting yourself in the foot. Again, Turbo C was the entry drug and I've been hooked ever since; I could replace my mixture of Pascal and x86 Assembly with one language.

C is not a particularly good language for GUI development. I've used it for both Windows and GTK+ development, and it gets tedious (and remains unforgiving.)

A characteristic feature of C is that the language itself is very small; everything is done with libraries. With earlier languages, doing I/O was built in; Fortran's write and Pascals writeln are part of the language and are statements; printf is a library function. So C is the language of choice for the embedded world, where programmers still worry about getting a program into 4K.

The joy of C is knowing pretty much exactly what each bit of code does, without worrying about hidden magic or operator overloading.


As a meta-observation, none of the languages in this list are considered particularly cool or state-of-the-art, except LISP which has continuously evolved since 1962. This is because they are pragmatic tools for getting things done. In this respect, C++ gets a lot of respect, because it raises C to a higher level, while keeping the abstraction penality as low as possible. Learning the language was challenging (using Stanley Lippman's book) but very rewarding. Building GUIs is a good fit to object-oriented design and I went through the usual rite of passage, which was creating a minimal Windows API binding.

A fair amount of my work was maintaining C++ MFC systems. Microsoft's MFC framework is widely considered one of the worst class libraries in the known universe, and for many years the word 'Visual' in 'Visual Studio' was a cruel joke. One of its many sins was encouraging storing data as persistent object binary archives; this breaks any attempt at easy continuous evolution of a program's structure by tying it directly to the structure of the persistent store.

However, bad programs and frameworks can happen to good languages, and I developed an increasingly obsessive relationship with C++. The core idea of the obsession was that people learn best by conversational interaction with a programming language, and led to the development of UnderC and a book, C++ by Example organized around this principle. This gave me the interactive prompt that I had been missing, and that I hypothesized that students were missing.

The other idea is that a good deal of the awkwardness of big C++ development comes from an antiquated build model inherited from C. C++ compilers aren't actually slow, but they have to make a heroic effort to cope with enormous header files. With an interpreter you can keep a program alive, and only recompile the bits that need to change, reducing the need for frequent rebuild-the-world steps and the necessity to get back to the program state you need to test. Furthermore, with an interpreter a language like C++ can serve many of the roles occupied by 'scripting languages', so that there is no need for Osterhout's Dichtomy. (There are big assumptions running around wild in this paragraph, and they should be subject of a more specialized article.)

Getting the interpreter to parse the full STL was my Waterloo, and I subsequently developed a dislike for the language and its sprawling complexity. In truth it was too big a job for one primary developer, and some UnderC design deficiencies proved too great an obstacle to progress. In particular, I was not storing an internal AST representation of templates, but rather keeping their bodies as text. Blame that on my youthful decision to do physics rather than CS, I suppose.

UnderC remains bound to x86 by some assembler shenanigans, and a logical next step would be to use libffi to get more platform independence, at least for linking to C. The grand vision that it could consume any C++ shared library was briefly viable, but the continuous changes in the C++ ABI and name-mangling schemes makes that an endless Red Queen race.

I became increasingly out of step with the direction that modern C++ was taking, with growing use of templates leading to a language that was increasingly slow to compile, and often generating some of the most obscure error messages ever presented to unfortunate users.

C++ remains the hardcore choice for many applications (although I suspect that often it's premature optimization) and seems to have a long life ahead of it. I may even forgive it enough to give it another chance one day.


C# is an interesting language, because it is the direct result of a commercial dispute with religious undertones. Microsoft had brought over Anders Heljsberg (the chief architect of Delphi) from Borland, to work on their Visual J++ product. This dialect implemented delegates, which are method pointers similar to Delphi's events. This was not only heretical, but against the Java licensing conditions, and for this and more egregious sins Sun won a lawsuit against Microsoft.

By this time, Microsoft had the Java bug, but to keep moving in the direction they wanted to go they needed to fork the language, and C# was born.

C# felt like a natural fit to GUI programming for me since I knew Delphi well, and anybody who has used both the Delphi VCL and Windows.System.Forms will know the family resemblance.

Unlike its estranged cousin Java, C# has kept evolving and has never been particularly afraid of breaking binary compatibility. Its model of genericity is efficient for native types (unlike collections in Java which can only contain boxed primitives), it has type inference and anonymous functions.

A consequence of this constant evolution is that .NET people are not seeking 'a better C#' in the same way that programmers in the JVM ecosystem are.

The flip side is that the .NET world has always been hooked into Microsoft's own ADD and need to sell new tools, with old acronyms dying like flies and the constant need to chase new novelties.

All in all, a great girl. But a pity about the family, as they say.


Boo isn't a well-known language, but has some interesting features. Syntactically rather like Python, except it is a statically-typed compiled .NET language with aggressive use of type inference. It also has support for compile-time metaprogramming and so can do 'smart macros'.

I did a Scintilla-based editor in Boo, and learnt its strengths and weaknesses well (Writing editors is a common hobby among programmers, since we have such a personal relationship with our tools.)

Ultimately the disillusioning realization was that the extra expressiveness of the language did not particuarly bring much advantage over using C# directly. To do Sciboo in C# would have required maybe five times as much code, but that extra code would have compiled a good deal faster than the equivalent Boo code and would have had come with much better tool support.

Writing a program in a new programming language is taking a bet on an outsider, and Sciboo would probably have caught more traction if it had been done in C#.

Still, Rodrigo "Bamboo" de Oliveira's Boo remains the second-greatest programming language to come out of Brazil.


The Java moment for programmers is when they feel the advantages of a simplified C++ with an extensive library and GUI toolkit, without the need to worry about memory management and having rich run-time type information available. I had already had this epiphany with C#, but increasingly appreciated the cross-platform JVM and the excellent tooling provided by Eclipse. Ultimately, it's the ecosystem that matters, the language, libraries and tools.

Java is not a fashionable language, for various reasons. It was deliberately designed as a restricted language built around object-oriented principles, a purified C++, much as Pascal was a simplification of earlier big languages like Algol 68. A programming language built around a particular programming paradigm is like an overspecialized species in a particular ecological niche. The big dinosaurs flourished in a different climate, hotter, wetter and with higher oxygen levels. The little rats running around their feet adapted to the changed conditions and eventually conquered the world. Together with its dogmatic design (the fundamental unit was always the class) came an attitude that the designers knew best, and that is fundamentally off-putting to hackers. Modern (or at least more recently mainstream) ideas like functional programming are inherently hard to implement on such a platform.

Some of Scala's woes come from the fact that a simple little anonymous function has to be implemented as a separate class, and classes don't come cheap on the JVM; even the tiniest classes take up about half a kilobyte, mostly metadata. (The slowness of scalac means that I personally would be unlikely to enjoy Scala, despite its attractive features, as I concluded with Boo.)

If one works within the limitations, exploits its dynamic features, and ignores the endless moralizing and pontification, Java can still be a useful and efficient solution to many problems. 'Expressiveness' is not a virtue that can be used in isolation.


Lua and LISP are the only dynamic languages in this list. Of the dynamic languages, Lua has the smallest implementation and the fastest performance. Lua was designed to be easily embeddable in a host C/C++ program, and often used in big commercial games. The language implementation is about 160K, and is about two times faster than Python. LuaJIT is a heavily optimized JIT implementation that can often give compiled C a good run, which is worth bringing up when people repeat the old assertion that statically typed languages have an inherent performance advantage.

I'm not a natural candidate for learning a 'gaming' scripting language (since computers are too interesting without needing games) but I learnt Lua scripting for the SciTE editor, and naturally moved onto the desktop.

In many ways, Lua is the C of dynamic languages: lean, fast and without batteries. The team at PUC-Rio do the core language, and the community provides the batteries, much as Linus Torvalds does the Linux kernel and GNU, etc provide the userland. The analogy is not exact, since core Lua provides math, io and a very capable string library. Lua is based on the abstract platform defined by C89, and so if you need things like directory operations or sockets you need extensions outside the core. Lua Distributions are available that provide similar experience to the 'big' scripting languages, but they are not 'canonical' or 'blessed'.

One reason that Lua is used as an extension language is that the syntax is conventional and unthreatening. I often come across engineers or scientists who just cannot read curly-brace C-like languages, and they naturally start counting at one.

A common criticism from people used to Python or Ruby is that there is no actual concept of 'class' in the language. Lua provides more general metaprogramming mechanisms that makes it easy to implement a 'class' system in a few lines of code. There are in fact a number of distinct (and imcompatible) ways to do OOP, which does lead to interoperability issues.

But trying to shoehorn everything into an existing concept like 'classes' is a sign of not 'getting' a language. In Lua functions are first-class citizens and are proper closures, making a more functional style possible. Lua's single data structure is the table, which is an associative array; like JavaScript, m["key"]==m.key so that you can use tables like structs.

LuaJIT is a concrete proof that serious performance and dynamic typing are not orthogonal. However, the main challenge for dynamic languages is what used to be called 'programming-in-the-large'; how to compose large systems from subsystems and reason about them. It's easy to produce mad unmaintainable messes in dynamic languages.

It's obvious that dynamic function signatures contain less information than their static equivalent: sqr(x) is less self-documenting than sqr(double x). So for dynamic languages, there is need to document the actual types so that users of a module can use it effectively. The compiler never reads comments, of course, and no compiler in the world can make useful comments mandatory. It's a matter of discipline and having structured ways to express types as documentation.

Having explicit types also allows tools like IDEs to provide intelligent browsing and code completion. A surprising amount of information can be extracted by static analyis (for instance, David Manura's LuaInspect) but you still need in addition a machine-readable way to provide explicit type annotations.

A very interesting generalization of Lua is Fabien Fleutot's Metalua, which is an implementation which allows language extensions to be written in Metalua. For instance, this is a Metalua module which allows the following code:

 --- normalized length of foo string.
 function len(s :: string) :: number

This inserts type assertions into the code (although this can be disabled). This is reminiscent of the direction taken by Google Dart.

My own modest approach with LDoc involves extended JavaDoc-style tags:

 --- normalized length of foo string.
 -- @tparam string s the foo string
 -- @treturn number length
 function len(s)


Go is modern, bold attempt to build a better C. It is garbage-collected, but compiles to native standalone executables, and has built-in strings, maps and CSP-style concurrency organized around 'goroutines' and channels. Go functions may return multiple values, like Lua. Type inference means that code outside function and struct declarations rarely needs explicit types.

Such boldness is unlikely to be universally popular, of course. For diehard C fans, 'a better C' is a nonsensical idea. For Java/C++ fans, the lack of classic OOP is a problem. The advocates of Go consider this lack to be a strength; polymorphism is supported by interfaces (as in Java) but there's no inheritance in the usual sense; you can 'borrow' implementation from a type using embedding. Inheritance has its own issues, after all, such as the Fragile Base Class problem, where a class is at the mercy of its subclasses (or vice versa, depending on the interpretation.)

Likewise, classic exception handling is replaced by handle-errors-at-source and panicking (with optional restore) when that is not feasible.

It's natural to miss the comforts of home, but learning a new language often involves unlearning the mental habits of an earlier language.

Go feels like a dynamic language, not only in time getting from Save to Run, but by the lack of explicit typing. Interfaces work as a user of a dynamic language would expect: if a function expects an interface that only contains a Read() []string method, then any object that provides such a method is acceptable. It is rather like compile-time duck-typing, and eases the interdependency problem that plagues languages like Java. But these things are all vigorously checked (Go does not do implicit typecasts); typos tend not to blow up prograns at run-time.

The language fits easily in the head. Goroutines+channels are easier and less problematic than traditional threading models. Go is clearly designed to do serious work on the server, and the standard library provides good support for networking and protocols, so that's where most of the adoption will take place.

A criticism from the C++ perspective is that too much functionality is built-in, and the language lacks the abstraction mechanism to further extend itself. For instance, Go's maps cannot be defined using Go itself. But making maps a primitve type remains a good pragmatic decision for Go, because std::map and friends in C++ impacts on end users in terms of the large amount of template headers needed to be parsed and the resulting unhelpful errors if anything does wrong.

I've personally used it for processing binary data files where I would otherwise use C, and it feels productive in this role. I miss the standard C promotion rules, so that an int will not be promoted to a float64 in the right context - this means a lot of explicit typecasting. Also, 'variable not used' errors are irritating when prototyping. But mostly I'm enjoying myself, and learning a new language. Which ultimately is what this article is all about.

Friday, 2 December 2011

Reading CSV Data in Go using Reflection

Filling in Structs

A common data format is CSV (Comma-Separated Values). Although simple in principle, there are some tricky quoting issues fully discussed in the relevant RFC. Fortunately, the Go csv package handles all of this decoding for you.

For instance, this reads CSV data from standard input:

 r := csv.NewReader(os.Stdin)
 row,err := r.Read()
 for err != nil {
     ... do something with row, which is a slice of strings ...
     row,err = r.Read()
 if err != os.EOF {

The delimiter need not be a comma; for instance you can set r.Comma to '\t' for tab-separated fields before reading.

What you then do with the row is up to you; this article shows how to map selected columns onto the fields of a struct, performing conversions as needed. This provides a declarative way of describing your data.

The assumption is that your data has an explicit header giving column names:

 First Name,Second Name,Age

The corresponding struct definition looks like this:

 type Person struct {
    FirstName string `field:"First Name"`
    Second_Name string
    Age int

We are going to use reflection to discover the names and types of any public fields of the struct, and use optional field tags to find the corresponding column name if it isn't the same as the field name. (Note that as a further convenience, underscores in field names will correspond to spaces in the column name.)

Tags are only accessible using reflection, but provide a simple mechanism to annotate struct fields with key/name pairs inside Go raw quotes (backticks).

This is a common pattern used by those Go packages which can read structures from text encodings, for instance xml and json.

Reflection in Go is fairly straightforward if you get the concepts first. In particular, if you are used to Java-style reflection, you have to distinguish between types and pointers to types. We are going to create an iterator that runs over all rows in the file, and pass it a pointer to our struct, which will have its values rewritten each time. This creates less unnecessary garbage, and allows for some optimizations. The pointer is passed as interface{} and we get the actual run-time type of the struct itself like so:

 st := reflect.TypeOf(ps).Elem()

(The Elem() is necessary because the type of ps is a pointer).

Once we have the type of the struct, then st.NumField() is the number of fields and st.Field(i) is the type of each field in the struct. The tag 'field' is accessed using st.Field(i).Tag.Get("field"); if this is an empty string, then we have to use the name of the field. The column index of the field in the data can then be looked up in the first row, which has to be the column names.

Reading and converting each line proceeds as follows: we know the 'kind' of the field type and switch on five possibilities: string, integer, unsigned integer, float and value. We can then safely use the appropriate strconv function to convert the string and the appropriate reflect.Value method to set the field's value; for instance, all integer fields are set with the SetInt method, which is passed the largest integer (int64) for all integer types.

The use of this API is intended to be as simple as possible. As usual with Go code, there is very little explicit typing of variables needed.

 r := csv.NewReader(in)
 p := new (Person)
 rs,_ := NewReaderIter(r,p)
 for rs.Get() {
 if rs.Error != nil {

The first argument to NewReaderIter involves a particularly Go-like solution to the problem of over-specific types. When I was first refactoring test code into a general solution, my structure looked like this:

 type ReadIter struct {
     Reader *csv.Reader

Well, that's a very C-like way of doing things. We actually don't need to have an explicit dependency on the csv package, since all we require of our reader is that it has a Read method like csv.Reader:

 type Reader interface {
     Read() ([]string,os.Error)
 type ReadIter struct {
     Reader Reader

This use of interfaces will be familar to Java programmers, but there is a big difference. If we had a Java package called csv, then we would have to use the same interface as csv.Reader. And if this class was not designed generally enough and did not implement an interface, then we would have to write a bridging class that implemented our interface for csv.Reader. It either leads to close coupling or lots of adapter code. Go interfaces, however, can be considered as 'compile-time duck typing'. All our Reader interface requires of a type is that it implements particular methods. This csvdata package does not depend on the csv package in any way, and you could use any object that knew how to feed us string slices with a Read method. This is similar to the concept of protocols in Objective-C.

Arbitrary Types

So far, the reflection code handles basic types. It's not hard however to find examples of data containing dates and other custom types. You could keep dates as strings and parse them as needed, but it's better to keep a canonical internal form.

A solution is provided by the flag package for command-line parsing, where you can handle new types by making them implement the Value interface:

 type Value interface {
    String() string
    Set(string) bool

Interfaces that support String are ubiquitous in Go (in particular fmt.Println knows about them). The Set method is for custom parsing of strings into your type.

Parsing and displaying times is handled by the time package. It's a little eccentric but easy once you get the pattern. The date/time format is provided as a string that looks just like the format you want, but with standard values for the various parts. So, month is '01', day is '02', hour is '03', minute is '04', second is '05' and year is '06'. Hour may also be '15' for 24-hour format, and '06' may be '2006' for four-digit year format.

 const isodate = "2006-01-02"

This will format and parse dates in ISO order.

A date type that implements Value is straightforward - the only constraint on Value is that the methods must be defined on a pointer value, so that the Set method can modify the the value. We can extend the time.Time struct by embedding this type in our Date struct:

 type Date struct {

Note that this field has no name, just a type! Writing it like this means that all methods and fields of *time.Time are available to *Date.

 func (this *Date) String() string {
    return this.Format(isodate)

To explicitly access the implicit Time field, use its type name:

 func (this *Date) Set(sval string) bool {
    t, e := time.Parse(isodate, sval)
    this.Time = t
    return e == nil

This technique is discussed in Effective Go. Extending a type in this way is not inheritance in the classic sense. If I defined the method Format on Date then this.Format(isodate) would call the local method, and the embedded type's Format method would have to be explicitly called as this.Time.Format(isodate); the new method does not override the original method.

So given data that looks like this:

 Joined,First Name,"Second Name",Age

the struct will be

 type Person struct {
    Joined *Date
    FirstName string `field:"First Name"`
    Second_Name string
    Age int

and reading code will look like this:

 p := new (Person)
 p.Joined = new(Date)
 rs,_ := NewReadIter(r,p)
 for rs.Get() {
    fmt.Println(p, p.Joined.Year)

Again, there is no dependency introduced on the flag package. But we get bonus functionality if we do use flag, since command-line parameters can now be defined using the same Date type.

 var date Date = Date{time.LocalTime()}
 flag.Var(&date, "d", "date to filter records")

This is all straightforward when the underlying value is already a pointer. But sometimes you need to extend a primitive type to customize how it is parsed and presented. An example is the abomination known as the 'thousands separator', where a numerical field may look like "22,433".

 type Float float64
 func (this *Float) String() string {
    return strconv.Ftoa64(float64(*this),'f',10)
 func (this *Float) Set(sval string) bool {
     sval = strings.Replace(sval,",","",-1)
    v, e := strconv.Atof64(sval)
    *this = Float(v)
    return e == nil

(I'll leave the job of printing out the values in the abominated form as an exercise)

The problem is that these values are not pointers. The reflection trick is to do the type assertion val.Addr().Interface().(Value) on each value and if it succeeds, then we know that the pointer to the value (Addr()) satisfies the Value interface. Just checking the kind of the type is not sufficient, because this Float type still has kind refect.Float64.

Issues and Possibilities

Using reflection is always going to be slower than direct manipulation but in practice this does not make that much difference. (It takes less than a second to process a data set with 23000 records with 20 columns on this modest machine and doing direct conversion is not significantly faster.)

Keeping up with a rapidly evolving target like Go is always a challenge, and as I was writing this article a new weekly came out that changed the meaning of time.Time - in short, it's not a pointer to a structure anymore (so lose the stars, basically.)

An orthogonal feature that I would like is the ability to read and convert columns of data from CSV sources. It's common enough to be useful, and not difficult to do, but would not need any reflection techniques.

An obvious extension is writing structures out in CSV format. This should be straightforward and there are lots of opportunities for code reuse, but this code is intended primarily as an example. Without doc comments, it's only about 190 lines of code plus comments, which is within the magic number limit defining 'example'.

If it's useful to people and they need these features, then I'll promote csvdata to a proper goinstallable pakage.

For now, the package and some examples is here.

Thursday, 15 September 2011

Strict Tables in Lua

Spelling Mistakes are Runtime Errors

Undeclared variables in dynamic languages are a well-known pain point. The compiler has to make an assumption about the scope of any undeclared variable; in Python, such variables are assumed to be local, and in Lua they are assumed to be global. (This is not entirely correct because of function environments, but will do for a first pass.) Furthermore, Lua does not distinguish between 'undefined' and 'null'; the value of such variables will always be nil, which is of course a perfectly valid value for a variable.

The consequence is that misspelt variables will evaluate to nil; if you are lucky the code will crash nearby with an error, but the nil can be passed along until it's not clear who to blame. So serious Lua programmers are strict about detecting undefined globals.

There is a dynamic approach provided by etc/strict.lua, which is part of the standard Lua distribution. Globals are looked up in the global table _G, which is otherwise a normal Lua table. You can give it a metatable and a __index metamethod, which is only called if the variable was not already present in the table. By keeping a table of known declared globals, this can throw an error on any undeclared variable access. This is definitely a case where you need to fail hard and early as possible!

(Another approach is static analysis, for instance David Manura's LuaInspect which has the advantage of making mispellings into compile-time errors. It has experimental integration into the SciTE editor.)

The pattern I will describe brings stricter run-time checking to other Lua tables using the same principle. But first let's examine some practices that are common in Lua code. Named arguments to functions are not supported, but it's easy to use tables to get the same result:

 function named(t)
   local name = or 'anonymous'
   local os = t.os or '<unknown>'
   local email = or'@'..t.os
 named {name = 'bill', os = 'windows'}

Although this is not the most efficient way to pass arguments to a function, it serves the purpose well, particularly with Lua's function-call syntax sugar that lets us leave out the parentheses in this case.

This idiomatic use of anonymous structures is common in Lua code. Need to keep a few things together in one place? Make a table! The pattern also appears in JavaScript which has the same identity between t['key'] and t.key in dictionaries. And LuaJIT can optimize field accesses in inner loops to C-like speeds.

Such ad-hoc tables are always open to extension, so people add extra fields as the need occurs. At this point strong-typing people are starting to hyperventilate: how can such code be ever reasoned about correctly? The answer is, fine - up to a certain point of program complexity. Thereafter you need to get disciplined.

Lua is not inherently a weakly-typed language, but there is a tendency to use it like so. One problem is how to document this pattern, for instance, how to document the function named above? If functions take specific named types for their arguments, then we can refer to the definition of those types in their documentation. Identity is also very useful for debugging and dynamic type-checking.

Give a Dog a Name ....

I'll show how the idea of 'strict tables' helps this structuring problem and also how to catch another class of misspellings accurately. First, a strict strict has a known set of fields with default values (which also establish the type of the fields):

 Strict.Alice {
     x = 1;
     y = 2;
     name = '';

This statement causes a constructor Alice to be introduced into the current environment:

 a = Alice {x = 10, y = 20, name = "alice"}

Any of these fields may be absent in the constructor call, and in particular Alice() will give you a new Alice-struct with the initial default values. However, setting an unknown field or setting a known field to the wrong type is a run-time error:

 b = Alice {x = 10, Y = 20, name = "bob"} -- > error: field 'Y' is not in Alice
 c = Alice {x = 10, y = 20, name = 44} -- > error: field 'name' must be of type string

What does 'wrong type' mean? Here the convention is that we compare the type names, but if the types are tables or userdata we compare their metatables.

Trying to access an unknown field of an Alice-object will likewise result in an error:

 print(a.Y) -- > error: field 'Y' is not in Alice
 a.Y = 2  -- > error: field 'Y' is not in Alice

The second failed assignment shows that Alice-objects are closed, which is a further restriction. (In the case of the original strict.lua, the global table is always open to new variables.) A little less freedom often relieves us of later anxiety, and here allows for checking field spelling mistakes immediately when they happen.

We basically incur no extra run-time overhead for this checking. This is because __index (and its cousin '__newindex`) only fires if lookup fails. (In Lua 4, such functions were called fallbacks, which remains a fine bit of descriptive terminology.)

By default, these objects know how to convert themselves to strings sensibly:

 print(a) --> Alice {y=20,x=10,name=alice}

They directly solve the documentation problem by being themselves documentable:

 -- Check alices.
 -- @param anAlice an @{Alice} object
 -- @see Alice
 function check_alice(anAlice)
 -- Alice.
 -- @field x
 -- @field y
 -- @field name
 -- @table Alice

The benefits of named, unique types are well known:

  • error messages are explicit (we know what kind of dog caused the problem.)
  • debugging is easier, since these dogs know their name and their kind
  • type identity makes type checking possible (Is this dog an Alsatian?)

This will be self-evident to any Java (or Python) programmer; why isn't this pattern more widely used in Lua? First, anonymous types are very good at creating recursive structures akin to S-expressions - which will be self-evident to any Lisp programmer. Second, the idea of extended type is not built-in to the language, resulting in multiple ways of declaring dogs and interoperability issues. And finally, because it can be so convenient to play fast and loose with your data. For instance, you want to deliver something to another part of the system and there's no 'official' way to do it. But you know that there's an object which will end up there later, so you attach some data to the object and make a special case at the receiver. It's a bit like the informal arrangements people make like 'when you're in Singapore can you pick up that new toy for me?' rather than using a courier. (I have enough disciplined instincts to know that this is not the way to produce software that's clear to others, but it's a convenient way to get what you need.)

Although not strictly necessary, it was hard to resist adding a few features. Consider this:

 Strict.Boo {
     name = 'unknown';
     age = 0;
     __tostring = function(self) -- can override __tostring
         return '#'
     __eq = true;   --- default member-equals comparison
     __lt = 'age';  --- use this field for less-than comparison

The default string conversion may not be desired, so you may override it using the __tostring metamethod. Note in fact that the 'specification' table becomes the metatable for all named objects, so you can specify any of the available metamethods. (In Lua 5.2 you can even specify a finalizer using __gc.)

As a convenience, setting __eq to true means that you want a plain naive field-by-field comparison (otherwise objects are all unique table values.) And __lt can be given the name of the field for < comparisons.

Defining the < operator means that you can directly sort lists of these objects:

 ls = {Boo{age=13,name='F13'},Boo{age=5,name='F5'},Boo{age=20,name='F20'}}
 for _,b in ipairs(ls) do print(b) end
 --- > output:

A Question of Notation

The notation for defining new strict tables I've chosen needs some elaboration. Without any sugar, it would look like this:

 Strict("Person",{ name = '', age = 0 })

We get closer to the desired notation by currying (that is, expressing a function of several arguments as a chain of single-argument functions):

 function Strict(name)
     return function(spec)

Which would allow

 Strict "Person" {name = '', age = 0}

To get Strict.Person {...} is easy; Strict is a table which is given a __index metamethod; accessing any new name causes it to be called, and it will then return the actual function as above:

 Strict = setmetatable({},{
     __index = function(self,name)
         return function(spec)

Further Elaborations

There are other approaches to the problem of checking the structure/type of Lua tables, of course: have a look at how LuaRocks does a sanity check on the rockspec format. This is akin to the idea of defining a 'schema' for XML, although a good deal more readable.

The implementation discussed here focuses on creating a closed, named type with declared fields and their types. It's currently just over a hundred lines, which is my self-imposed limit with these article demonstrations, but there's a number of cool features which could be added. Any library has a real implementation and an imaginary part, which is the space of possibilities; it would be boring to neglect the latter.

If we wanted to enforce type-checking on these fields (such as = 42) then we need an extra indirection using the proxy-table pattern. That is, the Alice-object itself would not contain anything except a table containing its values; then __index will always fire and would look up the key in this value table, after performing some sanity checks. In that case, performance would be affected (although if such lookups did not happen in inner loops then the penalty may well be not measurable anyway.)

It could be useful to optionally allow for field ordering:

 Strict.OrderedPerson {
     "Represents a Person for our catalog";
     {name = ''; "full name of person"},
     {age = 0; "age in years"}

which would allow the constructor syntax OrderedPerson("Dilbert",38). The doc-strings would not be obligatory but can be used (for instance) to automatically generate a basic entry dialog for the data.

It would be possible to also add methods to these strict tables but they are really meant to represent classic dumb data; for more conventional OOP in Lua there are many options (As the great Grace Hopper said, the wonderful thing about standards is that there's so many of them to choose from; my personal favorite which I use in Penlight is here).

An extended concept of 'type' is problematic precisely because there are so many object-oriented schemes possible. For instance, there are a number of imcompatible inheritance schemes that make asking the question 'is A a subtype of B?' particularly tricky.

But we could extend the type specifications used by strict tables in useful ways. Currently the type of a field is the type of its default; what if the type was a Lua string pattern? So '^%S+$' would mean 'a string with no spaces'. Or if {0} meant 'an array of numbers'?

 Strict.Data {
     name = '^%S+$';  --- or give the pattern a name, e.g. Strict.no_spaces
     factor = Strict.range(0,1);
     children = {Child()}

(This would further enable our hypothetical data-to-dialog generator to give more specific validation of user input.)

The idea of 'documentability' can be taken further; if I had a function that received an object that had a read method, then the obvious equivalent to strict tables would be interfaces:

 Interface.Readable {
     read = true

My function could then say Interface.check(1,rdr,Readable) to give a meaningful assertion on the first argument rdr. Even if I didn't use these assertions, simply making the argument more documentable could be a sufficient reason to have named interfaces. The contract can be specified and optionally enforced; it cannot be specified in such great detail as in statically-typed language, but it's still useful.

Finally, not everyone is as lax as I am about creating new globals. Strict returns the new type, but also creates it in the current environment. A more careful person would probably want to remove the line that creates this global. Maybe Stricter ?

This article was based on the original lua-users wiki entry; here is the code.

Thursday, 8 September 2011

Comparing Go with Lua

The Go language developed in the last few years by Rob Pike and Ken Thompson from Google has gathered a lot of interest, particularly for its built-in concurrency support. Anything released by these gentlemen (who did the first Unix and later Plan9, which introduced UTF-8 to the world) is naturally going to cause a stir, since it's a little like a new album by Led Zeppelin. And unlike the reappearance of most rock dinosaurs, Go does not disappoint.

Most blogs naturally focus on the concurrency support, but here I'd like to point out some striking similarities with Lua. It may be convergent evolution, much as sharks and dolphins look similar to non-zoologists, or it may be a case of unconscious influence.

Both languages have anonymous functions, and these are proper closures, which is a fantastic thing to have. In Lua, closures reduce the need for classical classes enormously. This Go code works like the equivalent in Lua or JavaScript - the function will keep its own reference to name when it is later called.

 var name = "dolly"
 DoSomethingLater(func() {

But that is just common sense if your functions are first-class values, and applies to Scheme and Javascript as well (Scheme is an acknowledged influence on the design of both Lua and JavaScript). Go also has coroutines, although 'goroutines' are more than the single-OS-thread cooperative-scheduling model of Lua, being scheduled as OS threads whenever needed and having a language mechanism called channels for communicating between them. This is an area of weakness for Lua, where multi-threading can be an exercise in masochism.

But the big similarity is this: functions can return more than one value. This is quite distinct from returning tuples, as Python does and is very much more efficient. (Bruce Eckel errs in speaking of tuples in Go) So the following style is common: a function returning a value plus an optional error:

 var f, err = os.Open(filename)
 if err != nil {
     return "", err

Although statically-typed, Go uses static type inference to make life easier for developers so we don't have to remember the exact types (*File,os.Error) of these values.

The whole exception-handling movement was a reaction against the sorry state of error handling. A C function like fopen will return NULL on error, but then you have to look up the actual error in errno which often was implemented as a global variable. Not cool in any way! This was replaced by a mechanism in which a function could return anything it liked, but also return an error non-locally. The vision was that you could then write code in an optimistic way, the so-called 'happy path' and put it all in a try block and catch the exceptional cases.

Things fail all the time; not being able to open a file is not really exceptional. Handling multiple return values makes it possible to go back to an older way of dealing with errors directly, but this time doing it properly with the means to return both results and errors cleanly.

The equivalent Lua idiom is similar, practically identical if you regard curly-braces vs keywords as superficial syntax differences:

 local f, err =
 if err ~= nil then
     return "",err

Both Lua and Go generalize this to statements like this, which swaps two values:

 x,y = y,x

They both allow multiple return values to be consumed by another function which takes the same values as arguments:

 func sum (x, y int) int {
     return x + y
 func multiple () (first, second int) {
     return 10, 20
 s = sum(multiple())  // == 30

Lua goes a little further and will add the multiple returned values to the end of the argument list of a function; that is, this will also work:

 hello  10  20

Go and Lua provide mechanisms for protected calls which are similar in spirit. Something bad will happen in a function, and the idea is to prevent that badness from bringing down the process.

For go, the panic() function throws an error, which can be of any type; the recover() function allows you to stop that error from propgating. For instance, an example from the language specification: this function is passed a function of no arguments and guarantees that any panics happening in the function will be caught:

 func protect(g func()) {
     defer func() {
         if x := recover(); x != nil {
             log.Printf("run time panic: %v", x)
 protect(func() {
     panic("something bad happened")

The defer mechanism is very cool (and Lua could benefit from something similar); it is used to ensure that an expression is always executed, no matter how we leave the function (a common use is to close a file safely, e.g. defer f.Close()). g() executes and panics, which calls all the deferred functions until we get to the one defined in protect, which uses recover to get the original value passed to panic.

The equivalent in Lua is error() and pcall(), which stands for 'protected call':

 local ok, err = pcall(function()
     error("more bad things!")
 -- ok will be false, err will be the string

The strength of the Go approach is that there is an explicit mechanism for calling arbitrary code when a function stack frame goes out of scope. With Lua (as with most garbage-collected languages) the best you can do is define finalizers on objects and wait for the garbage collector to call them (eventually.)

To generalize protect above, you need to use reflection to call the function; see this implementation. With that, apart from arbitrary calls like pcall, it's possible to construct an equivalent to try/catch in Go:

 P := new(Protect)
 s := make([]int,3)
 if ! P.Try(func() { // try execute some code
   println("never get here")
 }) { // and catch...
   println("error:",P.Error) // runtime error: index out of range

Continuing with the list of resemblances: both languages use _ in similar ways. Compare:

 func sumv(values []float64) float64 {
     sum := 0.0
     for _,v := range(values) {
         sum += v
     return sum


 function sumv(values)
     local sum = 0.0
     for _,v in ipairs(values) do
         sum = sum + v
     return sum

The difference is that with Go you have to use an underscore if you want to ignore the value (it is very strict about unused variables) whereas in Lua it is merely a convention. Note also the similar for-iterator syntax which naturally flows out of multiple return values.

Compare the way that the languages attach methods to a type:

 type Map map[string]int
 func (self Map) set(key string, value int) {
     self[key] = value
 var mm = make(Map)

Compare this to:

 Map = {}
 Map.__index = Map
 function Map:set(key,value)
     self[key] = value
 local mm = setmetatable({},Map)

There's a little more ceremony with the Lua example, which does not directly support 'classes', but the style is similar. Types in Go and Lua are always 'open' in this way, although obviously this freedom has to be used in a disciplined fashion.

Update: Thanks for Gé Weijers for pointing out the influential CLU programming language, developed at MIT in the Seventies by Barbara Liskov and her students. Lua's multiple returns came from CLU according to The Evolution of Lua. It is interesting how long it can take for new programming language concepts to reach the mainstream.

I've taken out an over-generalized paragraph about languages and the restrictions they impose: Ruben Thomas' comment seems fair but hinges on the definition of 'restriction' which is an important topic, but not particularly relevant to this comparison.

Finally, Rob Pike has answered the question: it's convergent evolution. (He likes the classic rock revival reference.)

(The updated script for generating pretty HTML for Blogger now has support for multiple languages in the same document.)