Thursday 15 September 2011

Strict Tables in Lua

Spelling Mistakes are Runtime Errors

Undeclared variables in dynamic languages are a well-known pain point. The compiler has to make an assumption about the scope of any undeclared variable; in Python, such variables are assumed to be local, and in Lua they are assumed to be global. (This is not entirely correct because of function environments, but will do for a first pass.) Furthermore, Lua does not distinguish between 'undefined' and 'null'; the value of such variables will always be nil, which is of course a perfectly valid value for a variable.

The consequence is that misspelt variables will evaluate to nil; if you are lucky the code will crash nearby with an error, but the nil can be passed along until it's not clear who to blame. So serious Lua programmers are strict about detecting undefined globals.

There is a dynamic approach provided by etc/strict.lua, which is part of the standard Lua distribution. Globals are looked up in the global table _G, which is otherwise a normal Lua table. You can give it a metatable and a __index metamethod, which is only called if the variable was not already present in the table. By keeping a table of known declared globals, this can throw an error on any undeclared variable access. This is definitely a case where you need to fail hard and early as possible!

(Another approach is static analysis, for instance David Manura's LuaInspect which has the advantage of making mispellings into compile-time errors. It has experimental integration into the SciTE editor.)

The pattern I will describe brings stricter run-time checking to other Lua tables using the same principle. But first let's examine some practices that are common in Lua code. Named arguments to functions are not supported, but it's easy to use tables to get the same result:

 function named(t)
   local name = t.name or 'anonymous'
   local os = t.os or '<unknown>'
   local email = t.email or t.name..'@'..t.os
   ...
 end
 named {name = 'bill', os = 'windows'}

Although this is not the most efficient way to pass arguments to a function, it serves the purpose well, particularly with Lua's function-call syntax sugar that lets us leave out the parentheses in this case.

This idiomatic use of anonymous structures is common in Lua code. Need to keep a few things together in one place? Make a table! The pattern also appears in JavaScript which has the same identity between t['key'] and t.key in dictionaries. And LuaJIT can optimize field accesses in inner loops to C-like speeds.

Such ad-hoc tables are always open to extension, so people add extra fields as the need occurs. At this point strong-typing people are starting to hyperventilate: how can such code be ever reasoned about correctly? The answer is, fine - up to a certain point of program complexity. Thereafter you need to get disciplined.

Lua is not inherently a weakly-typed language, but there is a tendency to use it like so. One problem is how to document this pattern, for instance, how to document the function named above? If functions take specific named types for their arguments, then we can refer to the definition of those types in their documentation. Identity is also very useful for debugging and dynamic type-checking.

Give a Dog a Name ....

I'll show how the idea of 'strict tables' helps this structuring problem and also how to catch another class of misspellings accurately. First, a strict strict has a known set of fields with default values (which also establish the type of the fields):

 Strict.Alice {
     x = 1;
     y = 2;
     name = '';
 }

This statement causes a constructor Alice to be introduced into the current environment:

 a = Alice {x = 10, y = 20, name = "alice"}

Any of these fields may be absent in the constructor call, and in particular Alice() will give you a new Alice-struct with the initial default values. However, setting an unknown field or setting a known field to the wrong type is a run-time error:

 b = Alice {x = 10, Y = 20, name = "bob"} -- > error: field 'Y' is not in Alice
 c = Alice {x = 10, y = 20, name = 44} -- > error: field 'name' must be of type string

What does 'wrong type' mean? Here the convention is that we compare the type names, but if the types are tables or userdata we compare their metatables.

Trying to access an unknown field of an Alice-object will likewise result in an error:

 print(a.Y) -- > error: field 'Y' is not in Alice
 a.Y = 2  -- > error: field 'Y' is not in Alice

The second failed assignment shows that Alice-objects are closed, which is a further restriction. (In the case of the original strict.lua, the global table is always open to new variables.) A little less freedom often relieves us of later anxiety, and here allows for checking field spelling mistakes immediately when they happen.

We basically incur no extra run-time overhead for this checking. This is because __index (and its cousin '__newindex`) only fires if lookup fails. (In Lua 4, such functions were called fallbacks, which remains a fine bit of descriptive terminology.)

By default, these objects know how to convert themselves to strings sensibly:

 print(a) --> Alice {y=20,x=10,name=alice}

They directly solve the documentation problem by being themselves documentable:

 -------
 -- Check alices.
 -- @param anAlice an @{Alice} object
 -- @see Alice
 function check_alice(anAlice)
 ...
 end
 -----
 -- Alice.
 -- @field x
 -- @field y
 -- @field name
 -- @table Alice

The benefits of named, unique types are well known:

  • error messages are explicit (we know what kind of dog caused the problem.)
  • debugging is easier, since these dogs know their name and their kind
  • type identity makes type checking possible (Is this dog an Alsatian?)

This will be self-evident to any Java (or Python) programmer; why isn't this pattern more widely used in Lua? First, anonymous types are very good at creating recursive structures akin to S-expressions - which will be self-evident to any Lisp programmer. Second, the idea of extended type is not built-in to the language, resulting in multiple ways of declaring dogs and interoperability issues. And finally, because it can be so convenient to play fast and loose with your data. For instance, you want to deliver something to another part of the system and there's no 'official' way to do it. But you know that there's an object which will end up there later, so you attach some data to the object and make a special case at the receiver. It's a bit like the informal arrangements people make like 'when you're in Singapore can you pick up that new toy for me?' rather than using a courier. (I have enough disciplined instincts to know that this is not the way to produce software that's clear to others, but it's a convenient way to get what you need.)

Although not strictly necessary, it was hard to resist adding a few features. Consider this:

 Strict.Boo {
     name = 'unknown';
     age = 0;
     __tostring = function(self) -- can override __tostring
         return '#'..self.name
     end;
     __eq = true;   --- default member-equals comparison
     __lt = 'age';  --- use this field for less-than comparison
 }

The default string conversion may not be desired, so you may override it using the __tostring metamethod. Note in fact that the 'specification' table becomes the metatable for all named objects, so you can specify any of the available metamethods. (In Lua 5.2 you can even specify a finalizer using __gc.)

As a convenience, setting __eq to true means that you want a plain naive field-by-field comparison (otherwise objects are all unique table values.) And __lt can be given the name of the field for < comparisons.

Defining the < operator means that you can directly sort lists of these objects:

 ls = {Boo{age=13,name='F13'},Boo{age=5,name='F5'},Boo{age=20,name='F20'}}
 table.sort(ls)
 for _,b in ipairs(ls) do print(b) end
 --- > output:
 #F5
 #F13
 #F20

A Question of Notation

The notation for defining new strict tables I've chosen needs some elaboration. Without any sugar, it would look like this:

 Strict("Person",{ name = '', age = 0 })

We get closer to the desired notation by currying (that is, expressing a function of several arguments as a chain of single-argument functions):

 function Strict(name)
     return function(spec)
       ...
     end
 end

Which would allow

 Strict "Person" {name = '', age = 0}

To get Strict.Person {...} is easy; Strict is a table which is given a __index metamethod; accessing any new name causes it to be called, and it will then return the actual function as above:

 Strict = setmetatable({},{
     __index = function(self,name)
         return function(spec)
         ...
         end
     end
 })

Further Elaborations

There are other approaches to the problem of checking the structure/type of Lua tables, of course: have a look at how LuaRocks does a sanity check on the rockspec format. This is akin to the idea of defining a 'schema' for XML, although a good deal more readable.

The implementation discussed here focuses on creating a closed, named type with declared fields and their types. It's currently just over a hundred lines, which is my self-imposed limit with these article demonstrations, but there's a number of cool features which could be added. Any library has a real implementation and an imaginary part, which is the space of possibilities; it would be boring to neglect the latter.

If we wanted to enforce type-checking on these fields (such as a.name = 42) then we need an extra indirection using the proxy-table pattern. That is, the Alice-object itself would not contain anything except a table containing its values; then __index will always fire and would look up the key in this value table, after performing some sanity checks. In that case, performance would be affected (although if such lookups did not happen in inner loops then the penalty may well be not measurable anyway.)

It could be useful to optionally allow for field ordering:

 Strict.OrderedPerson {
     "Represents a Person for our catalog";
     {name = ''; "full name of person"},
     {age = 0; "age in years"}
 }

which would allow the constructor syntax OrderedPerson("Dilbert",38). The doc-strings would not be obligatory but can be used (for instance) to automatically generate a basic entry dialog for the data.

It would be possible to also add methods to these strict tables but they are really meant to represent classic dumb data; for more conventional OOP in Lua there are many options (As the great Grace Hopper said, the wonderful thing about standards is that there's so many of them to choose from; my personal favorite which I use in Penlight is here).

An extended concept of 'type' is problematic precisely because there are so many object-oriented schemes possible. For instance, there are a number of imcompatible inheritance schemes that make asking the question 'is A a subtype of B?' particularly tricky.

But we could extend the type specifications used by strict tables in useful ways. Currently the type of a field is the type of its default; what if the type was a Lua string pattern? So '^%S+$' would mean 'a string with no spaces'. Or if {0} meant 'an array of numbers'?

 Strict.Data {
     name = '^%S+$';  --- or give the pattern a name, e.g. Strict.no_spaces
     factor = Strict.range(0,1);
     children = {Child()}
 }

(This would further enable our hypothetical data-to-dialog generator to give more specific validation of user input.)

The idea of 'documentability' can be taken further; if I had a function that received an object that had a read method, then the obvious equivalent to strict tables would be interfaces:

 Interface.Readable {
     read = true
 }

My function could then say Interface.check(1,rdr,Readable) to give a meaningful assertion on the first argument rdr. Even if I didn't use these assertions, simply making the argument more documentable could be a sufficient reason to have named interfaces. The contract can be specified and optionally enforced; it cannot be specified in such great detail as in statically-typed language, but it's still useful.

Finally, not everyone is as lax as I am about creating new globals. Strict returns the new type, but also creates it in the current environment. A more careful person would probably want to remove the line that creates this global. Maybe Stricter ?

This article was based on the original lua-users wiki entry; here is the code.

Thursday 8 September 2011

Comparing Go with Lua

The Go language developed in the last few years by Rob Pike and Ken Thompson from Google has gathered a lot of interest, particularly for its built-in concurrency support. Anything released by these gentlemen (who did the first Unix and later Plan9, which introduced UTF-8 to the world) is naturally going to cause a stir, since it's a little like a new album by Led Zeppelin. And unlike the reappearance of most rock dinosaurs, Go does not disappoint.

Most blogs naturally focus on the concurrency support, but here I'd like to point out some striking similarities with Lua. It may be convergent evolution, much as sharks and dolphins look similar to non-zoologists, or it may be a case of unconscious influence.

Both languages have anonymous functions, and these are proper closures, which is a fantastic thing to have. In Lua, closures reduce the need for classical classes enormously. This Go code works like the equivalent in Lua or JavaScript - the function will keep its own reference to name when it is later called.

 var name = "dolly"
 DoSomethingLater(func() {
     fmt.Println("hello",name)
 })

But that is just common sense if your functions are first-class values, and applies to Scheme and Javascript as well (Scheme is an acknowledged influence on the design of both Lua and JavaScript). Go also has coroutines, although 'goroutines' are more than the single-OS-thread cooperative-scheduling model of Lua, being scheduled as OS threads whenever needed and having a language mechanism called channels for communicating between them. This is an area of weakness for Lua, where multi-threading can be an exercise in masochism.

But the big similarity is this: functions can return more than one value. This is quite distinct from returning tuples, as Python does and is very much more efficient. (Bruce Eckel errs in speaking of tuples in Go) So the following style is common: a function returning a value plus an optional error:

 var f, err = os.Open(filename)
 if err != nil {
     return "", err
 }

Although statically-typed, Go uses static type inference to make life easier for developers so we don't have to remember the exact types (*File,os.Error) of these values.

The whole exception-handling movement was a reaction against the sorry state of error handling. A C function like fopen will return NULL on error, but then you have to look up the actual error in errno which often was implemented as a global variable. Not cool in any way! This was replaced by a mechanism in which a function could return anything it liked, but also return an error non-locally. The vision was that you could then write code in an optimistic way, the so-called 'happy path' and put it all in a try block and catch the exceptional cases.

Things fail all the time; not being able to open a file is not really exceptional. Handling multiple return values makes it possible to go back to an older way of dealing with errors directly, but this time doing it properly with the means to return both results and errors cleanly.

The equivalent Lua idiom is similar, practically identical if you regard curly-braces vs keywords as superficial syntax differences:

 local f, err = io.open(filename)
 if err ~= nil then
     return "",err
 end

Both Lua and Go generalize this to statements like this, which swaps two values:

 x,y = y,x

They both allow multiple return values to be consumed by another function which takes the same values as arguments:

 func sum (x, y int) int {
     return x + y
 }
 func multiple () (first, second int) {
     return 10, 20
 }
 ....
 s = sum(multiple())  // == 30

Lua goes a little further and will add the multiple returned values to the end of the argument list of a function; that is, this will also work:

 print('hello',multiple())
 -->
 hello  10  20

Go and Lua provide mechanisms for protected calls which are similar in spirit. Something bad will happen in a function, and the idea is to prevent that badness from bringing down the process.

For go, the panic() function throws an error, which can be of any type; the recover() function allows you to stop that error from propgating. For instance, an example from the language specification: this function is passed a function of no arguments and guarantees that any panics happening in the function will be caught:

 func protect(g func()) {
     defer func() {
         if x := recover(); x != nil {
             log.Printf("run time panic: %v", x)
         }
     }()
     g()
 }
 protect(func() {
     panic("something bad happened")
 })

The defer mechanism is very cool (and Lua could benefit from something similar); it is used to ensure that an expression is always executed, no matter how we leave the function (a common use is to close a file safely, e.g. defer f.Close()). g() executes and panics, which calls all the deferred functions until we get to the one defined in protect, which uses recover to get the original value passed to panic.

The equivalent in Lua is error() and pcall(), which stands for 'protected call':

 local ok, err = pcall(function()
     error("more bad things!")
 end)
 -- ok will be false, err will be the string

The strength of the Go approach is that there is an explicit mechanism for calling arbitrary code when a function stack frame goes out of scope. With Lua (as with most garbage-collected languages) the best you can do is define finalizers on objects and wait for the garbage collector to call them (eventually.)

To generalize protect above, you need to use reflection to call the function; see this implementation. With that, apart from arbitrary calls like pcall, it's possible to construct an equivalent to try/catch in Go:

 P := new(Protect)
 s := make([]int,3)
 if ! P.Try(func() { // try execute some code
   println(s[5])
   println("never get here")
 }) { // and catch...
   println("error:",P.Error) // runtime error: index out of range
 }

Continuing with the list of resemblances: both languages use _ in similar ways. Compare:

 func sumv(values []float64) float64 {
     sum := 0.0
     for _,v := range(values) {
         sum += v
     }
     return sum
 }

with:

 function sumv(values)
     local sum = 0.0
     for _,v in ipairs(values) do
         sum = sum + v
     end
     return sum
 end

The difference is that with Go you have to use an underscore if you want to ignore the value (it is very strict about unused variables) whereas in Lua it is merely a convention. Note also the similar for-iterator syntax which naturally flows out of multiple return values.

Compare the way that the languages attach methods to a type:

 type Map map[string]int
 func (self Map) set(key string, value int) {
     self[key] = value
 }
 ...
 var mm = make(Map)
 mm.set("hello",42)

Compare this to:

 Map = {}
 Map.__index = Map
 function Map:set(key,value)
     self[key] = value
 end
 ...
 local mm = setmetatable({},Map)
 mm:set("hello",42)

There's a little more ceremony with the Lua example, which does not directly support 'classes', but the style is similar. Types in Go and Lua are always 'open' in this way, although obviously this freedom has to be used in a disciplined fashion.

Update: Thanks for Gé Weijers for pointing out the influential CLU programming language, developed at MIT in the Seventies by Barbara Liskov and her students. Lua's multiple returns came from CLU according to The Evolution of Lua. It is interesting how long it can take for new programming language concepts to reach the mainstream.

I've taken out an over-generalized paragraph about languages and the restrictions they impose: Ruben Thomas' comment seems fair but hinges on the definition of 'restriction' which is an important topic, but not particularly relevant to this comparison.

Finally, Rob Pike has answered the question: it's convergent evolution. (He likes the classic rock revival reference.)

(The updated script for generating pretty HTML for Blogger now has support for multiple languages in the same document.)

Monday 5 September 2011

Htmlification with Lua

Most developers hate typing HTML; we know it well, but will go on elaborate detours to avoid actually writing it.

The normal way to generate HTML dynamically is to use some template engine. Like any modern language, Lua has a number of these, reflecting the fact that template syntax is a matter of taste; Cosmo is probably the most well thought-out one. However, there is another approach which is to use the flexible data syntax of Lua itself to generate HTML.

The word htmlify is used by the Orbit web framework to describe generating HTML using Lua code. This example is taken from the LuaNova introductory article:

 -- html1.lua
 require"orbit"
 function generate()
     return html {
         head{title "HTML Example"},
         body{
             h2{class="head","Here we go again!"}
         }
     }
 end
 orbit.htmlify(generate)
 print(generate())
 ==>
 C:\basic>lua html1.lua
 <html ><head ><title >HTML Example</title>></head>
 <body ><h2 class="head">Here we go again!</h2></body></html>

You will notice that this script has no declarations for tags like html, and the orbit module does not define them either: instead, the function environment of generate is modified with htmlify so that any unknown symbol is converted into a function which generates the text for an HTML element. Again, Lua's flexible table syntax is used to its best advantage; we don't need parentheses for functions that have a single string or table argument, and these tables can have both array-like and hash-like sections. Any key-value pair becomes an attribute of the element, and the array items become the contents of the element.

There are two obvious issues with this implementation; first, it will cheerfully generate HTML with misspelt tags, and second, the output is not very readable, and needs to go through a prettifier.

Another way of getting the same results is to treat (X)HTML as XML and use XML tools to generate and pretty-print it. The LuaExpat binding defines a standard called LOM for expressing XML as Lua data structures:

 <abc a1="A1" a2="A2">inside tag `abc'</abc>

is expressed as:

 {
     attr = {
         [1] = "a1",
         [2] = "a2",
         a1 = "A1",
         a2 = "A2",
     },
     tag = "abc",
     "inside tag `abc'"
 }

LuaExpat does not provide a pretty-printer, so LuaExpatUtils came into being, adapted from stanza.lua from the Prosody IM server by Mathew Wild and Waqas Hussain. That's the joy of Open Source; with a little care about licensing and giving everyone their due, there is rarely need to write library code ab initio.

LuaExpatUtils provides a single module `lxp.doc':

 local doc = require 'lxp.doc'
 local lom = require 'lxp.lom'
 local d = lom.parse '<abc a1="A1" a2="A2"><ef>hello</ef></abc>'
 print(doc.tostring(d,'','  '))
 --->
 <abc a1='A1' a2='A2'>
   <ef>hello</ef>
 </abc>

It also provides what we might call 'xmlification', directly inspired by Orbit:

 local children,child,toy,name,age = doc.tags 'children,child,toy,name,age'
 d2 = children {
     child {name 'alice', age '5', toy {type='fluffy'}},
     child {name 'bob', age '6', toy {type='squeaky'}}
 }

The key difference in usage is that there is no function environment magic going on; the doc.tags function gives us an explicit list of tag constructor functions. The big implementation difference is that d2 is a LOM document, not a string, and so we can flexibly convert into good-looking XML text later.

Here, I'm going to be less strict, and do the kind of global namespace modification which should always come with a mental health warning. It's just as easy to 'monkey-patch' Lua as it is Ruby, but the community feels it should not be used by libraries. Any global modification changes the environment that all code sees, and breaks the fragile set of expectations that users of the language share. And considering the maintainance nightmares that Ruby programmers routinely inflict on each other, this seems a sensible attitude.

But here I just want to write standalone Lua scripts that generate HTML without too much ceremony:

 doc = require 'lxp.doc'
 setmetatable(_G,{
   __index = function(t,name)
     _G[name] = doc.tags(name)
     return _G[name]
   end
 })

Everything is stored in tables in Lua, and _G is the value of the global table. We want to catch any undefined global symbols, and this is exactly what the __index metamethod does for us here. Any undefined name such as 'div' will be made into a tag constructor, which we store back in the global table so that we don't create a new instance each time (that is, once 'div' has been encountered, this metamethod will not be fired for any subsequent div) That's basically it; (there are some details, of course; table and select are existing Lua globals, for instance, so we explicitly pull these tags in as table_ and select_)

 -- test2.lua
 require 'html'
 html {
     body { title 'Test' },
     h2 'title',
     p 'first para',
     p 'second para'
 }

html is specialized; it constructs the html tag but also writes out the pretty-printed to a file called test2.xml. Findiing the name of the current script is straightforward; this is passed as as the zero-th element of the global arguments array arg; the name can then be extracted from the path.

 if arg then -- not interactive
   name = arg[0]:match('([%w_%-]+)%.lua')
 end

And test2.html looks like this:

 <html>
  <body>
   <title>Test</title>
  </body>
  <h2>title</h2>
  <p>first para</p>
  <p>second para</p>
 </html>

You may consider a little script like this to be another form of HTML template. It is less redundant and easier to type, and editor-friendly (editors like SciTE are good at finding matching braces). It is trivial to parameterize this code. And (nota bene) we have the full power of the language available to make shortcuts for common constructions.

For instance, HTML lists are ubiquitous and follow a very simple pattern:

 function list (t)
     for i = 1,#t do
         t[i] = li(t[i])
     end
     return ul(t)
 end

So list{'one','two','three'} will make an unordered list.

Obviously this is a job for a library, and Orbiter provides orbiter.html that gives us more sophisticated versions of the above function and others like HTML tables. But that project is sufficiently interesting to need its own article.

The code for html.lua is here.