Wednesday, 2 October 2013

C++: Some Consequences of a Design Decision

Top Dog

C++ has a very solid position as the programming language which makes the least performance compromises while providing good abstraction mechanisms that allow code to be written at a high level. In many ways, it's an easier and safer language to learn than C, if you stick to the same imperative style but use std::string and the provided containers.

Both of these statements are of course open to debate. The first statement is true, if we look at the usage of this language in performance-critical applications. The second is often challenged. To quote Andrei Alexandrescu's comment in this Reddit comment:

In my opinion C++ is a language for experts and experts only [...] It has enough pitfalls in virtually all of its core constructs to make its use by benevolent amateurs virtually impossible except for the most trivial applications.

Ouch! That's fair enough; we were comparing it to C anyway (which is definitely not for sissies). It is not really a programming language for civilians, and not a good first language for anyone other than a would-be professional. (In fact I'd say you would get a better all-round education in C, even if later you turn in relief to non-military languages; learning C++ is mostly good for ... programming in C++.)

Error, Error on the -Wall

The big hurdle is the first one, and that's making sense of the error messages:

 #include <iostream>
 #include <string>
 #include <list>
 using namespace std; // so sue me
 int main()
     list<string> ls;
     string s = "hello";
     ls.append(" world");
     cout << ls << endl;
     return 0;

This isn't a bad attempt at a C++ program at all, leaving aside the pedantic belief that using namespace std is bad. (I take the pragmatic view that anybody is free to inject whatever namespaces they care to within their files, and not take away this freedom from others by injecting namespaces within header files. C++ is very good at resolving ambiguous name references and everyone should know the contents of std anyway.)

We get nearly four hundred lines of error messages, full of implementation details. In this case, the abstractions are leaking all over the user!

Verity Stob once suggested that the thing to do was to write a Perl script to parse the error output. This was very funny and true, but using Perl would increase the number of problems. My practical way of realising Verity's joke is to use lake and a suitable plugin:

 C:\Users\steve\dev\dev>lake -L filter.cpp error.cpp
 g++ -c -O2 -Wall -MMD  error.cpp -o error.o
 error.cpp: In function 'int main()':
 error.cpp:10:8: error: 'list<string >' has no member named 'append'
      ls.append(" world");
 error.cpp:11:10: error: no match for 'operator<<' (operand types are 'ostream {a
 ka ostream}' and 'list<string >')
      cout << ls << endl;
 lake: failed with code 1

Now our noob has a fighting chance, and can now go to the reference and actually find the appropriate method.

Templates Considered Harmful

The real issue is that the C++ standard libraries over-use generics. std::string and std::stream could be plain classes, as they once were. At this point, there will be someone suggesting that I am a plain ASCII bigot and forgetting the need for wstring and so forth. Fine, let them be plain classes as well. An incredible amount of ingenuity went into making templated string types work, and the library designers could have made their life easier by using a low-tech solution. Generally, we should not pander to library designers and their desires, since they chose the hard road: their job is to use the right level of abstraction and not complicate things unnecessarily.

C++'s standard generic containers are fantastically useful, but their design is overcomplicated by being also parameterized by an allocator. This is a useful feature for those that need it, but there could be two versions of (say) std::list overloaded by template parameters, which can be done in C++11 with variadic templates. This makes life a bit harder for library implementers, but they are precisely the people who can manage complexity better than users.

The Standard is the Standard, no point in moaning. But let's do an experiment to see what the consequences of a simplified standard library. I emphasize that tinycpp is an experiment, not a proposal (modest or otherwise). It originally was done for the UnderC project, since the interpreter could not cope with the full STL headers, and I've since filled in a few gaps. Here it's purpose is allow some numbers to be generated, since qualitative opinion is all too common.

These simplified 'fake' classes directly give us better error messages, especially if the compile bombs out on the first error. (Often after the first error the compiler is merely sharing its confusion.)

 $ g++ -Wfatal-errors -Itiny error.cpp tiny/iostream.o tiny/string.o
 error.cpp: In function 'int main()':
 error.cpp:10:8: error: 'class std::list<string>' has no member named 'append'
      ls.append(" world");
 compilation terminated due to -Wfatal-errors.

It's easy to forget the initial difficulty of learning to ride the bicycle, and to scorn training wheels as a useful means to that end.

Templates Slow you Down

People say 'C++ compiles slowly' but this not really true. A little C++ program will involve in about 20Kloc of headers being processed, a lot of that being template code. Using the tinycpp library that goes down to 1.4Kloc.

The three compilers tested here are mingw 4.8 on Windows 7 64-bit, MSVC 2010 on the same machine, and gcc 4.6 in a Linux Mint 32-bit VM.

Here is a comparison of build times for standard vs tinycpp:

  • mingw 0.63 -> 0.33
  • gcc 0.60 -> 0.20
  • msvc 0.82 -> 0.17

As always, gcc works better on Linux, even in a VM, and it's no longer slower than MSVC. In all cases the tinycpp version compiles significantly faster.

C++ programmers can get a bit defensive about compile times, and often end up suggesting throwing hardware at the problem. There seems to be a "You're in the Marines now boy!" macho attitude that wanting to build faster is a sign of civilian weakness and poor attention span. This attitude is off-putting and gets in the way of better engineering solutions. Most people just suck it up and play with light sabres.

Templates Make you Fat

With small programs, these compilers produce small executables when they are allowed to link dynamically to the C++ library. This is not considered a problem on Linux, because obviously everyone has upgraded to the latest distro. But if you want to chase cool new C++11 features you may find that most of your users don't have the cool new libstdc++ needed to run your program.

It is (curiously enough) easier to get a new shiny GCC for Windows precisely because it's not considered part of the system. Executables rule in Windows, so it's alarming to find that a small program linked statically against libstc++ is rather large, nearly 600kb for Windows. And since libstc++ is not part of Windows you (again) have to suck it up. (And this is definitely what Alexandrescu would consider a 'trivial application'.)

You can get down to 174Kb using the fake tinycpp libraries, which suggests that an up-to-date and properly engineered version of std-tiny would be useful for delivering executables, not just for speed and noob-friendliness.

MSVC does static linking much more efficiently; the numbers are 170Kb (std) and 95Kb (tiny). The resulting executables have no C runtime dependencies whatsoever. Which suggests that MSVC is (at least) a good choice for building releases for distribution. Using a cross-platform compiler-aware tool like CMake or Lake can make that less painful. Not an ideologically comfortable recommendation to accept, true, but whatever works best. (The command-line version of MSVC 2010 is freely available.)

This preoccupation with executable sizes seems last-century by now (after all, Go users are fine with megabyte executables since they see that as the price of no other runtime dependencies.) And large executables are not slower, providing the actual code that's executing at any point is compact enough to be cache-friendly. So perhaps I'm just showing my age at this point, although please note that resource-limited devices are much more common than desktop computers.

No Free Lunches

C++ programmers like the phrase 'abstraction overhead' because C++ is very good at reducing this to zero in terms of run-time. Often this is at the cost of executable size, compile time and confusing errors. This may be an acceptable price, but it is not free.

C++ is what it is; it is unlikely to change that much, except get even slower to compile as the Boost libraries move into the Standard library. But I think that there are some lessons to be learned for new languages:

  • keep the standard library as simple as possible: library developers should not have too much fun (They should write cool applications that use their libraries instead to get excess cleverness out of their system.)
  • error messages should not burden the user with implementation details; this means that the abstraction is leaking badly.
  • compile time still matters. Perhaps the people who use C++ more regularly are more likely to be those who like to think upfront (like embedded programmers) but this is not the only cognitive style that flourishes in programming. It is a mistake to think that long build times are a necessary evil, since with C++ they largely come from an outdated compilation model. New languages can do better than that.

Monday, 2 September 2013

Nimrod: The Return of Pascal

Why learn Another Language?

The first answer is: because it's fun. Just as a botanist is excited to find a new plant, programming language nerds like trying out new languages. Secondly, any new language uses new strategies for dealing with the basic problems of communicating algorithms to computers and intents to other programmers. So it is the most sincere form of criticism: a working implementation to constrast with the approaches taken by other languages. There's far too much armchairing and bikeshedding involved in discussions about languages, and you have to admire a guy who has spent a sizeable chunk of his life trying something new like Nimrod's author, Andreas Rumpf.
If you're not a language nerd, a new language might provide a solution to an actual computing problem you are facing. (Who would have guessed?)

Hello, Nimrod

For this exercise, I'm assuming a Unix-like system, but pre-compiled installers for Nimrod on Windows are available.
First, download and build Nimrod from here. It only takes a few minutes, and after making the suggested symlink nimrod will be on your path. In that directory, you will find a most useful examples folder, and the documentation is doc/manual.html for the manual,doc/tut1.html for the tutorial,2doc/lib.html for the standard library.

Here is a slightly non-trivial Hello-world application, just to test the compiler:

 # hello.nim: Hello, World!
 var name = "World"
 echo("Hello " & name & '!')

Compiling involves the simple incantation nimrod c hello.nim, which will generate a very chatty record of the compilation, and an executable hello. This has no external dependencies apart from libc and comes at about 130Kb; with nimrod c -d:release hello.nim the compiler agressively removes unneeded code and we are down to 39Kb.

This is the first take-home about Nimrod; it compiles into native code using the available C compiler and so can take advantage of all the optimization work that's gone into beasts like GCC and MSVC. There is no special runtime, so these executables can be shared with your colleagues without fuss. In the library documention doc/lib.html, 'pure' libraries will not introduce extra dependencies. Whereas (for instance) the re regular expression library currently implies an external dependency on PCRE.
The verbosity is interesting the first few times, and thereafter becomes tedious. I've defined these bash aliases to get cleaner output:

 $ alias nc='nimrod c --verbosity:0'
 $ alias ncr='nimrod c -d:release --verbosity:0'

Training a programmer's editor to present Nimrod code nicely is not difficult; using Python highlighting works well since the languages share many keywords. The main thing to remember is that Nimrod does not like tabs (thank the Gods!) here are some SciTE property definitions which you can put into your user configuration file (Options|Open User Options File); now F5 means 'compile if needed and run' and F7 just means 'compile'.

After a few invocations to get all the tools in memory, this compilation takes less than 200ms on this rather elderly machine. So the second take-home is that the compiler is fast (although not as fast as Go) and definitely faster than C++ or Scala. In particular, syntax errors will be detected very quickly.

A First Look

This code looks very much like a typical 'scripting' language, with hash-comments, explicitly-declared variables and string operations like concatenation (&). (A separate concatenation operator is a good decision, by the way; Javascript has a number of famous ambiguities that come from + meaning two very different things.)

However, this is not dynamic typing:

 # types.nim
 var s = "hello"
 var i = 10
 s = i
 $ nc types
 examples/types.nim(4, 5) Error: type mismatch: got (int) but expected 'string'

So s is statically-typed as 'string', i is typed as 'int', and no sane conversion should ever make an integer into a string implicitly. Nimrod does local type inference which examines the expression on the right-hand side and uses that type for the declared variable, just like the same statement would do in Go. Another good thing, since a variable cannot change type underneath you and you really need as many errors to happen at compile-time. The resulting code is also much more efficient than dynamically-typed code.

The next program looks very much like Python:

 # args.nim
 import os
 for i in 0..ParamCount():
 $ nc args
 Hint: operation successful (14123 lines compiled; 0.374 sec total; 12.122MB) [SuccessX]
 $ ./args one two three

But beware of surface resemblences; sharks and orcas look much the same, but are very different animals. The language that Nimrod reminds me of here is Rodrigo 'Bamboo' de Oliveira's Boo, the second-greatest programming language to come from Brazil. His comment is "We also love the Monty Python TV show! - but Boo is not Python". So Pythonistas should not assume that they can automatically skip the first semester with Nimrod. The first difference to note is that import brings all functions from the module into the current scope.

Apart from basic syntax, built-in functions like len and repr work mostly as you would expect from Python. Slicing is supported, but note the different syntax:

 var S = "hello dolly"
 var A = [10,20,30,40]
 var B = A[1..2]
 echo(len(A)," ",len(B)," ",len(S))
 for x in B: echo(x)
 # --->
 4 2 11
 [10, 20, 30, 40]

Type inference is fine and dandy, but is not letting us have the full picture. The 'lists' in square brackets are arrays, and they are fixed size.

The Return of Pascal

To a first approximation, an orca is a wolf in shark's clothing. Simularly, the language that Nimrod most matches in nature is Pascal:

 # pascal.nim
     TChars = range['A'..'C']
     TCharArray = array[TChars,int]
 var ch: TCharArray
 for c in 'A'..'C':
     ch[c] = ord(c)
 for c in low(ch)..high(ch):
     echo("char ",c,' ',ch[c])
 # --->
 char A 65
 char B 66
 char C 67

Paws have become flippers (= instead of :=, no semicolons or begin..end blocks) but this is classic Pascal typing, with subranges and array types declared over arbitrary ordinal types. So accessing ch['Z'] is a compile error 'index out of bounds'. Also, 'Z' is of type char and "Z" is of type string - quite distinct as they are in C as well. Like Pascal, arrays are always bounds checked, but this can be disabled by pragmas. The T convention for naming types should be familiar with anyone who was a Borland fan.

Please note that variables are case-insensitive! Underscores are ignored as well. (This may well change.)

Another indication that Nimrod comes from the Niklas Wirth school is that functions are called procedures, whether they return something or not.

 # proc.nim
 proc sqr(x: float): float = x*x
 # -->

You should not assume that float means 32-bits; the manual says "the compiler chooses the processor's fastest floating point type" and this usually is float64; there is also float32 if you wish to be explicit, just as in Go. (The usual conversions between integers and floats are allowed, since they are widening.) In a similar way, int always has the size of a pointer on the system (which is not true for C), and there is intXX where XX is 8,16,32 or 64.
Also as with Pascal, arguments may be passed by reference:

 # var.nim
 proc modifies (i: var int) =
     i += 1
 var i = 10
 for k in 1..3:
 # --->

This is a procedure that returns nothing. Every language draws a line in the sand somewhere and says "I don't think you should do that, Dave". One of Nimrod's rules is that you cannot just discard the results of a function that returns a value, unless you use the keyword discard before it like discard fun(), rather as we say (void)fun(); in C.

There is fairly standard exception handling. A cool novelty is that finally or except can be used as standalone statements:

 proc throws(msg: string) =
     raise newException(E_base, msg)
 proc blows() =
     finally:   echo "got it!"
     echo "pre"
     throws("blew up!")
     echo "post"
 # --->
 got it!
 Traceback (most recent call last)
 finally.nim(10)          finally
 finally.nim(7)           blows
 finally.nim(2)           throws
 Error: unhandled exception: blew up! [E_Base]

This is very similar in effect to Go's defer mechanism, and allows for deterministic cleanup.

Tuples are Structs

It's often better to take the Python strategy and return multiple results using a tuple

 # tuple.nim
     TPerson = tuple [name: string, age:int]
 proc result(): TPerson = ("Bob",42)
 var r = result()
 echo(r)             # default stringification
 echo (       # access by field name
 var (name,age) = r  # tuple unpacking
 echo (name,"|",age)
 # --->
 (name: Bob, age: 42)

Different tuple-types are equivalent if they have the same fields and types in order ('structural equivalence'). Nimrod tuples are mutable, and you should think of them more as akin to C's struct.
Functions defined over a type have a most curious and interesting property. Contining with tuple.nim we write a silly accessor function:

 proc name_of(t: TPerson): string =
 # --->

That last line is something to think about: we've got something like 'methods' by just using the function after the dot, as if it were a field; in fact you typically leave off the () in this case and have something very much like a read-only property.

'List' was a Bad Name Anyway...

I mentioned that [10,20] is a fixed-size array, which is the most efficient representation. Sequences can be extended, like C++'s vector or Python's List types.

 # Using seq constructor and append elements
 var ss = @["one","two"]
 # using newSeq, allocate up front
 var strings : seq[string]
 newSeq(strings, 3)
 strings[0] = "The fourth"
 strings[1] = "assignment"
 strings[2] = "would crash"
 #strings[3] = "out of bounds"

Using sequences of strings and the parseopt module, here is a simple implementation of the BSD head utility. The release executable is 58Kb, which is an order of magnitude smaller than the equivalent Go stripped executable. It's only 54 lines, but a little big to be an inline example. The case statement is very Pascal-like:

 case kind
 of cmdArgument:
 of cmdLongOption, cmdShortOption:
   case key
   of "help", "h": showUsage(nil)
   of "n":
     n = parseInt(val)
   of "version", "v":

parseopt isn't fully GNU compatible: in particular, you have to say ./head -n=3 head.nim rather than -n3 or -n 3. The code style is a bit low-level for my taste; compare with lapp; a well-behaved command-line tool must always provide its usage, so why not reuse that text to describe flags, arguments and their types? This style works particularly well with dynamic languages, but it can be done with Nimrod. Here is head, revised:

 # head.nim
 import lapp
 let args = parse"""
 head [flags] filename
   -n: (default 10) number of lines
   -v,--version: version
   <files> (default stdin...)
     n = args["n"].asInt
     files = args["files"].asSeq
 proc head(f: TFile, n: int) =
     var i = 0
     for line in f.lines:
         i += 1
         if i == n: break
 if len(files) == 1:
     for f in files:
         echo("----- ",f.fileName)

Associative arrays are the key here, plus a variant value type. lapp ensures that numerical flags are correctly converted, files are opened (and afterwards closes them on a exit hook set with addQuitProc). There are some conventions to be followed:

  • flags may have a short alias; the long name is always used to access the value
  • flags are bool values that default to false
  • parameters are enclosed in <...> and are string values with no default
  • you can specify the type explicitly: bool,int,float,string,infile',outfile, or set the default and have the type infered from that:stdinandstdout` have their usual meanings

One of the really cool things about type inference is that so many of the implementation details are hidden from users of a library. This is obviously good for the user, who has less to remember, but also for the library implementer, who has freedom to change the internal details of the implementation. It leads to a style which looks and feels like dynamic code, but is strictly typed with meaningful compile-time errors.

Here the type of args is irrelevant; it is an associative array between flag/argument names and some unspecified value type, which has known fields. (In fact, this version of lapp only exports parse, the fields, and a suitable definition of [] from 'tables')

'class' is not a transferable idea

People tend to reason from simularity, so the naive nature watcher constructs a false homomorphism between sharks and orcas. I fell into this trap, assuming 'inheritance' means 'polymorphism using virtual method tables'. Nimrod's optimization attitude is showing here: "Nimrod does not produce a virtual table, but generates dispatch trees. This avoids the expensive indirect branch for method calls and enables inlining". That's right, procedures are always statically dispatched. If you want methods, you need a different construct, multi-methods:

 # class.nim
     TPerson = object of TObject
         name: string
         age: int
 proc setPerson(p: ref TPerson, name: string, age:int) = = name
     p.age = age
 proc newPerson(name: string, age:int): ref TPerson =
 method greeting(p: ref TPerson):string = "Hello " & & ", age " & $p.age
     TGerman = object of TPerson
 proc newGerman(name: string, age:int): ref TGerman =
 method greeting(p: ref TGerman):string = "Hallo " & & ", " & $p.age & " Jahre alt"
 var bob = newPerson("Bob",32)
 var hans = newGerman("Hans",30)
 proc sayit(p: ref TPerson) = echo p.greeting
 # --->
 Hello Bob, age 32
 Hallo Hans, 30 Jahre alt

Here we are making objects which are references (by default they are value types, like tuples, unlike java), initialized with the standard procedure new. Note the Pascal-like special variable result in procedures!

As expected, you may pass Germans to sayit, because a German is a person, but greeting has to be declared as a method for this to work; if it were a proc, we would get a warning about the second greeting being unused, and Germans are then addressed in English.
The cool thing about multi-methods is that they restore symmetry; a traditional polymorphic call is only polymorphic in a. This makes sense in a language where dot method notation is just sugar for procedure calls where the first argument matches the type.

Generics Everywhere

Consider this, where no type is given for the argument of sqr:

 proc sqr (x): auto = x*x
 echo sqr(10)
 echo sqr(1.2)
 # -->
sqr is implicitly generic, and is constructed twice, first for int and then for float. Comparing a similar thing in Boo reveals a key difference:
 def sqr (x):
     return x*x

Here the type of x is duck, where Boo switches to late binding.

Both archieve the same result; sqr can be passed anything that knows how to multiply with itself, but Nimrod wants to generate the best possible code, at the cost of more code generation. The more general way of declaring generic functions goes like:

 proc sqr[T] (x: T): T = x*x

Another example of Nimrod being conservative about your memory needs would be declaring a very large array of strings. In languages where string is a value type like C++ and Go, this would contain valid strings, but in Nimrod the entries are nil until explicitly initialized. So string values can be nil (like Java) which can be a fertile source of run-time errors, but the decision on how much heap to throw at the data structure is left to you, which is a very C-like design decision. Strings in Nimrod (however) are mutable and do copy by value.

Generics make it easy to write operations over containers. Here is map with an anonymous procedure:

   a = [1, 2, 3, 4]
   b = map(a, proc(x: int): int = 2*x)
 for x in b: echo x
 # --->

Anonymous procedures are a little verbose (as they are in Go), but there is a trick. We use a template which is a higher-order generic that rewrites expressions, much like a preprocessor macro in C/C++:

 template F(T: typedesc, f:expr):expr =
     (proc(x:T):T = f)
 b = map(a, F(int, 2*x))

Nimrod achieves the power of the C preprocessor in an elegant fashion, integrated into the language itself. The when statement works with compile-time constants and only generates code for the correct condition, much like a #if chain.

 when sizeof(int) == 2:
   echo("running on a 16 bit system!")
 elif sizeof(int) == 4:
   echo("running on a 32 bit system!")
 elif sizeof(int) == 8:
   echo("running on a 64 bit system!")
   echo("cannot happen!")

LIke if, it can be used in an expression context:

 const dirsep = when hostOS == "windows": '\\' else: '/'

A clever use is to conditionally add testing code to a module when it's compiled and run as a program. These tests can be as detailed as you like, because they will not bloat the compiled library.

 # sqr.nim
 proc sqr *[T](x: T): T = x*x
 when isMainModule:  # predefined constant
     assert(sqr(10) == 100)

As you might expect by now, Nimrod does not provide run-time reflection like Java or Go because it would burden code that does not need it - again, this is C++'s "Don't Pay for what you Don't use". But there is compile-time reflection, implemented by the typeinfo module, which acts as a static equivalent of Go's reflect package.

Second Impressions

There's no doubt that finding errors as early as possible using a compiler (or some other static code analysis tool) is better than finding them later as run-time errors. In dynamic languages we are always at the mercy of a spelling mistake. But static compilation has a cost in time (build times do matter) and in complexity.

Having done about a thousand lines of working Nimrod code, I feel I can express an opinion on the language. Most code is straightforward and free of explicit type annotations, and the compiler quickly gives good error messages. Run-time errors come with a useful stack trace, and mostly come from nil references. It's commonly thought that nillable values are a big mistake (C.A.R Hoare once called it his "billion dollar mistake") but a nil string value is much better at trashing a program. And this is good - fail hard and early!

However, you do need to understand some things to interpret the error messages correctly:

 let a = [1,2,3]
 echo a
 nimex/errors.nim(2, 6) Error: type mismatch: got (Array constructor[0..2, int])
 but expected one of:
 system.$(x: TEnum): string
 system.$(x: int64): string
 system.$(x: string): string
 system.$(x: uint64): string
 system.$(x: T): string
 system.$(x: int): string
 system.$(x: char): string
 system.$(x: T): string
 system.$(x: bool): string
 system.$(x: cstring): string
 system.$(x: float): string

You have to know that echo uses the 'stringify' operator $ on its arguments - then we can interpret this error as being "I don't know how to make a string from an array". The compiler then helpfully presents all the overloaded versions of $ active in this program. Of course, this is scary to people from a dynamic background who were beguiled by Nimrod's surface 'Python-like' syntax. Coming from a C++ background, I'm prepared for this way of doing things, and know that the solution looks like this (quote operators in backticks to define them):

 proc `$`[T](a: openarray[T]): string =
     var s = ""
     for e in a:
         s.add(' ')
     return s

(Mutable strings take some getting used to). This solution will work for any arrays or sequences with elements that understand $, and is very efficient, because Nimrod iterators over sequences are zero-overhead - effectively plain loops over elements.

There is a non-trivial learning curve; a motivated polyglot can learn enough in a week to be productive, but polyglots aren't so common. A new language comes with a shortage of tutorial material/books and a small community. This means that Google is not your friend, and last I checked there were two questions on Stackoverflow, one of which concerned a Brainfuck interpreter. There does however seem to be some action on Github.

A language thrives (like any life form) when it finds a niche in which it is competitive. For Lua, that has been providing a lightweight, powerful yet accessible embeddable scripting language. It has been adopted by many game developers as a way of not writing everything in C++, which is productive in two important ways: small changes to game logic do not need expensive rebuilds and don't require restarting the game; plus lesser mortals can contribute. Professional game programmers tend not to do things simply because they are cool, and so there is a market for Lua skills.

Nimrod is a good fit where C and C++ are often used. We've seen that 'userland' utilities like head can be efficiently implemented in Nimrod, and the resulting executables are typically less than a 100kb and usually have no external dependencies. This makes it a good fit for CGI since they will load as fast as C. With Go, people found statically-linked executables a good way to manage the problem of managing dependencies on deployed machines. Nimrod provides this without Go's approach of reimplementing the whole C runtime.

But the server niche requires well-tested frameworks and libraries, which can only happen with wider adoption. Thus there is a vicious circle that any new language must face; use comes from maturity, and maturity comes from use.
It's well suited to data processing and numerical tasks; operator overloading makes standard mathematical notation possible, and generics make algorithms efficient. Here again having some choice of existing libraries would make all the difference. However, it is relatively easy to bind to C libraries (since the compiler output is C) and there is a c2nim tool for processing C headers.

A particularly cool application is for embedded systems. Here the realities are merciless; embedded processors are everywhere and need to be cheap, and you can't get much memory for pennies. As a consequence, C dominates this field, and it's nasty. I can honestly say that this is my least favourite kind of programming; the preprocessor hacks alone would make Dijkstra lie down and cry. Here Andreas describes how Nimrod can be compiled with a stripped-down library with no OS support, and compiled on a 16bit AVR processor. Nimrod is probably the only new language which has the minimal attitude and metaprogramming capability to be an effective contender in this space, which is traditionally the last bastion of C.

Garbage collection is something that's often used to separate system and application languages. It's hard to add it to an existing language, and hard to remove it from a language, since it is so damn convenient.
A kernel has to manage every byte so that the userland can afford to waste memory; game programmers hate compulsory 'stop the world' GC which tends to happen when you're doing something more important. And embedded controllers often don't even have malloc. See how Nimrod's Garbage Collector works; it is low-overhead, uses reference counting and can be switched off temporarily (unlike with the Dalvik VM on Android)

In summary, Nimrod is a very rich and powerful statically-typed language which relentlessly uses compile-time metaprogramming to achieve its goals of delivering compact and efficient programs. Whether it finds its niche is an open question, but it deserves to be given a chance, and is well worth exploring further.

Update: The title is definitely a mistake, because Pascal represented a simplification of existing practice and was intended as a teaching language.  If I had said Object Pascal then it wouldn't be so bad, since that grew into a genuinely useful language for building large systems.  But Nimrod is influenced by many other languages, so any 'Nimrod is like X'  will always be a simplfication; it is what it is.

It's been pointed out that Javascript's problem is not lack of a concatenation operator, but implicit conversions: C++ lacks the separation also but would never confuse concatenating strings with adding numbers.  There is a similar implicit conversion in Lua (one of the few warts in its design) but the operators are separate, as they should be.

Monday, 12 August 2013

A Question of Notation: Revisiting Moonscript

Growing Up Nicely

Since the last time I reviewed Moonscript here it has matured nicely in ways that make it easier to use. Some nasty gotchas (like a[i-1] being misinterpreted) have gone away and there is better error reporting.

It has found its way into my toolbox for quick utilities and prototypes that I don't need to share with others. (That may change; my colleagues know Python, not Lua, and I suspect that they will find Moonscript easier to read.)

I hope to make the point that even people who use Lua and don't wish to bet on an 'experimental' language can benefit from learning a little Moonscript, since it makes an excellent notation for expressing Lua programs. In particular, its terse syntax is well suited to interactive exploration.

Get Interactive

Out of the box, there is no REPL, but writing a sufficiently-good one was not difficult. mooni was the result. It does not try to solve the tricky problem of when to start a block; you indicate this by ending a line with a backslash.

moon-article$ mooni
MoonScript version 0.2.3
Note: use backslash at line end to start a block
> ls = {10,20,30}
> ls
> m = one:1, two:2
> m
> for x in *ls \
>>  print x

Moonscript works with tables in the same way as Lua, except that the more conventional colon is used for associative key-value pairs. You don't always have to use curly brackets for map-like tables (e.g. the assignment to m above). Since all statements in Moonscript can have a value, we don't need any special way to indicate that a expression is being evaluated. mooni also does pretty-printing of tables.

A language with no libraries is a no-starter, but a language which is essentially a new notation for an existing language starts off with a ecosystem. For instance, we have Penlight available.

> require 'pl'
> utils.split 'one two three'
> utils.printf "hello '%s'\n", 'world'
hello 'world'
> utils.import math
> cos(pi/8) + sin(pi/2)

Function calls don't need parentheses, except when you need to force a function call to bind with an argument - in that case, the opening paren must not be separated by space from the function. (A more formal way of stating Moonscript's semantics here is that the call operator has a much lower precedence)

This sensitivity to whitespace takes a little getting used to, since Lua has practically none, but the payoff is that most function calls require fewer keystrokes, which matters in interactive mode.

The first great thing about an interactive mode is that the beginner has a chance to try statements out one by one, and gets their rewards ('it works!') and punishments ('why did that not work?') in little incremental steps.

The second great thing comes from the fact that we are often beginners; testing out a new library, exploring an API, trying out one-liners before inserting them into thousand-line programs, etc. So even if you are a Lua programmer, using a Moonscript REPL will allow you to experiment with your Lua modules in a less tedious way.

The function notation is very compact, which makes a functional style more pleasant:

> -- 'return' is implicit here
> sqr = (x) -> x^2
> sqr 10
> -- functions of no arguments do not need argument lists
> f = -> 42
> f()
> add = (x,y) -> x + y
> -- partial application
> add1 = (x) -> add x, 1
> add1 10
> ls = List{'alpha','beta','gamma'}
> -- mapping a function over a Penlight list
> ls\map => @sub 1,1
> ls\filter => #@ > 4

The fat arrow is short for a function with an implicit self; @ is a shorthand for self.

Everything is a Value

The use of indentation to indicate blocks is now firmly associated with Python, so people with a Python background might feel superficially at home. But consider this:

> f = -> 10,20
> {f()}
> if 0 then 'ok'
> if 1 > 2 then 'ok' else 'nope'
> ls = for i = 1,3 do i
> ls
> [i for i = 1,3]

As in Lua, functions are first-class values, and they can return multiple values. (In Python there is the illusion of multiple return values, but really they are packed into a tuple and then unpacked in any assignment, which is a lot less efficient). If such a function is evaluated as the last value in a table constructor, all the values are captured. This is all standard Lua semantics. The if construct should come as a surprise to both Lua and Python users, since it returns a value. (The gotcha for a Python user is that 0 is not false; only false and nil evaluate as false in Lua)

for statements are expressions that collect their values into a table, only if they are in an assignment context. So they don't needlessly create tables! For this simple task, list comprehensions are better, but consider this example from the manual:

doubled_evens = for i=1,20
  if i % 2 == 0
    i * 2

(The general rule is that the block versions of if,for and while leave out the then or do keyword. There is no end keyword!)

The with statement works like the similar statement in VB or Pascal; a dot is needed to indicate the fields that will be set within the table. And it's no surprise that it returns that table as a value:

> with {} \
>>  .a = 1
>>  .b = 2

This gives us a really cool way to write modules, because (again) the value of loading a file is the value of the last expression.

-- util.moon
with {}
    .greeting = (name) -> print "Hello ", .quote name

    .quote = (s) -> string.format('%q',s)
> u = require 'util'
> u.greeting 'Dolly'
Hello     "Dolly"

(Note how we can use the dot for reading fields as well as writing them.)

Doing Programs Bit by Bit

require in Moonscript works just like in Lua, except it will load any *.moon files on the module path as well. But require is not so useful for incremental and interactive development, because it will only load once and cache the result. Which is why we will rather use dofile for this purpose - but the global dofile from Lua and only loads Lua scripts. It is easy to make a Moonscript-aware dofile using the built-in moonscript module.

> moon = require 'moonscript'
> dofile = (x) -> assert(moon.loadfile x)()
> u = dofile 'util.moon'
> u
{greeting:function: 0xfd9610,quote:function: 0xfbc140}

So now it's possible to reload a module and try out the changes. The finger-friendly syntax makes interactive use easier. If I had a module which controlled a robot, then Moonscript provides a nice command prompt:

> turn left
> speed 2
> obstacle -> speed -2

Now this isn't such an imaginary scenario. PbLua is a Lua port for the Lego Mindstorms NXT kit, and it has a Lua prompt. Getting Moonscript to work with pbLua does not require that the micro actually runs Moonscript! A custom mooni could translate everything into plain Lua and push that up, ditto for scripts.

This point needs emphasizing - moonc compiles Moonscript to Lua. The Lua environment that then runs that code could be stock Lua 5.1, 5.2, LuaJIT or whatever. The decision to use Lua as the intermediate language has given us a lot more flexibility in applications.

In mooni, if you want to see what Lua code was generated by the last executed statement, use this common expression of puzzlement:

> t = one:1, two:2
> ?que
t = {
  one = 1,
  two = 2

Making up New Constructs

For instance, it seems self-evident to most people that a modern language should have syntax for exception handling. Coating Lua's pcall in some convenient sugar is very straightforward:

try = (block) ->
    ok,err = pcall block
    if not ok then err\gsub '^[^:]+:%d+: ',''
    else nil, err

test = (foo) ->
  err,val = try ->
    if foo then return 2*foo
    print a.x
  if err
    print 'error!',err

print test nil
--> error!  attempt to index global 'a' (a nil value)
print test 2
--> 4

There is still an issue if the function did want to return nil, but I'll leave the resolution of this to any interested parties. (hint: use select)

This kind of thing has been possible in Lua for a long time now, but people get put off by the necessity for (function() ... end) here, and anywhere where we need a lazy way to get 'lazy evaluation'.

For instance, when working with many GUI toolkits it's useful to schedule an action to be run later on the main thread. This could be expressed as later 300,-> do_something(). GUI toolkits are all about firing events; for instance in AndroLua one can create a button and attach it to an action in two lines:

@button "Please Click Me!",->
    @toast "Thanks!"

The equivalent Java code is a lesson in how boilerplate obscures otherwise straightforward code, and explains why Java simply has to get lambdas to compete.

Moonscript's syntax can play nicely in the niche established by Ruby. For instance, this is a rakefile.

task :codeGen do
  # do the code generation

task :compile => :codeGen do
  #do the compilation

task :dataLoad => :codeGen do
  # load the test data

task :test => [:compile, :dataLoad] do
  # run the tests

And here is the equivalent lakefile for Lake

-- lakefile.moon
task = target

task.codeGen nil, ->
    print 'codeGen'

task.compile 'codeGen',->
    print 'compile'

task.dataLoad 'codeGen',->
    print 'dataLoad'

task.test 'compile dataLoad',->
    print 'test'

-- without any explicit targets, lake fires this ....
default 'test'

moonc lakefile.moon would creae lakefile.lua, which lake understands. If anything, the syntax is even cleaner - I've cheated slightly by passing dependencies to the targets as space-separated strings; they can also be written as tables like {'compile','dataLoad'} which is the internal representation anyway.

Imagine a hypothetical environmental monitoring system:

rule.too_hot -> temp > 37 and humid > 80
--- handle the rule asynchronously...
If.too_hot -> print 'get them out!'

Which suggests that if I were designing a DSL (Domain Scripting Language) for such a rule-based application then my users might find Moonscript easier than plain Lua. (Embedding Moonscript in an application is not difficult, since it's mostly Lua code with dependencies on LPeg and LuaFileSystem. The Windows binary has already compiled this all into a DLL.)

Tooling and Documentation

This is something that separates the goats from the sheep when evaluating a new language. Fortunately, leaf has provided editor support - the repackaged SciTE is a good choice for Windows users. You will probably have a better experience if you edit the configuration file `Options|Open Global Options' and put these lines at the end:


Assuming that you do want other people to use your modules, it helps to have a documentation tool that understands the language. This simple List class has basic LDoc markup. Note the use of the @within tag to put the metamethods in their own section.

The output from ldoc -f markdown -S List.moon is here.

(This is all hot off the presses so you'll have to grab from the LDoc master. I'm considering whether metamethods should automatically be put into their own section by default.)

Differences and Distinctions

Moonscript is compiled to reasonably straightforward Lua, and its detailed semantics are a superset of Lua so it finds easily into the existing Lua ecosystem. The surface syntax is different, but comes from several design choices

  • indentation is syntax - no more end
  • : for key-value pairs in tables; purely map-like tables often don't need curly brackets
  • line ends become significant, so commas are not needed in multiline tables
  • function(self,y) becomes (self,y) -> or (y) => depending on taste
  • function calls have very low precedence, hence less parens needed
  • every statement can return a value; explicit return is usually not needed
  • local-by-default means local is only needed to make scoping explicit
  • there is sugar for list comprehensions, classes and extended assignment operators like += and *=. != is a synonym for ~=

In other words, it is a determined attempt to reduce the typing needed for common operations in Lua, at the cost of more rules. This makes it a good notation for interactive work, even if your work remains mostly in Lua.

Could a person learn Moonscript without knowing Lua? No reason why not, but it will require a good self-contained tutorial and there are more syntactical gotchas. It could make a good educational language, since there you do not necessarily want a language that some of the class already know; familiarity breeds conplacency, if not actual brain damage (as Dijkstra asserted about Basic.)

Moonscript is available using LuaRocks or Windows binary - On Unix, sudo luarocks install mooni will bring in Moonscript as well, since it's a dependency. mooni itself is a single Moonscript file and can be found here.

Thursday, 5 January 2012

Building Programs with Lake

Carpenters Bitching About Tools

One of the irritating things about programming is the set of available tools for building software. A lot of the appeal of dynamic languages comes from the simple fact that you don't build, you just run. A great deal of the fuss is accidental complexity, which is Fred Brooks' term for the gap between the complexity of the task and the actual complexity of the solution.

In the begining, there was Stu Feldman's make. People have subsequently wondered what deep reason there was behind the need for tabs, but the referenced quote from The Art Of Unix Programming shows us that it was an unhappy accident:

Why the tab in column 1? Yacc was new, Lex was brand new. I hadn't tried either, so I figured this would be a good excuse to learn. After getting myself snarled up with my first stab at Lex, I just did something simple with the pattern newline-tab. It worked, it stayed. And then a few weeks later I had a user population of about a dozen, most of them friends, and I didn't want to screw up my embedded base. The rest, sadly, is history.

make is a fantastic tool, firmly in the Unix tradition of doing one thing well. It runs commands when any of their inputs are more recent than their output. Its power comes from all the other powerful little tools that come with a traditional Unix environment. (This naturally becomes an issue on Windows, where you basically have to mimic that environment closely enough, for instance MSYS.) Its pain comes from having to cope with all the weirdness of different systems, even within the POSIX world, with tools of inadequate expressiveness and power. In other words, it's a bad programming language.

There are many alternative build systems which came out of this frustration. And an entertainingly profane plea on Reddit to simply stop building new ones.

Human nature being what it is, it did not stop me. I have no particular great hopes of it gaining any fans, but Lake does provide a working example of how Lua is suited for embedded Domain Specific Languages (DSLs).

Using Lake to cope with the C

Say we have the canonical first program, hello.c. Running the lake command works as expected; there must be a lakefile:

 c.program 'hello'

and then

 $> lake
 gcc -c -O1 -Wall -MMD  hello.c
 gcc hello.o  -o hello.exe

Running lake again will give the message 'lake: up to date'. If hello.c changes (or we deleted hello.o) then things will rebuild.

This seems fairly underwhelming at first, but then we knew this was a trivial program to build in the first place. Now, if Lake finds the Microsoft command-line compiler cl.exe on the path, then it changes its tune:

 $> lake
 cl /nologo -c /O1 /WX /showIncludes  hello.c
 link /nologo hello.obj  /OUT:hello.exe

(This is what will happen on Windows if you execute Lake inside a Visual Studio command Prompt)

The whole idea about lakefiles is that they express the build on a higher level, and let the tool decide on the incantation. This is particularly useful if you aren't familiar with cl.exe, for instance.

But there is more. This simple lakefile provides:

  • cross-platform, compiler-agnostic builds (On Unix it knows to drop off the '.exe')
  • an automatically generated 'clean' target, so lake clean will do its job
  • a debug build by saying lake -g or lake DEBUG=1

I do work on embedded Linux sometimes. If I wanted my hello to work on a Gumstix then the incantation would simply be lake PREFIX=arm-linux and the correct compiler and linker would be invoked.

OK, let's get more fancy. The hello program has a second file utils.c and a shared header common.h. The lakefile now looks like this:

 c.program{'hello', src = 'hello utils'}

Please note the curly braces: program is a function of one argument, which is here a Lua table constructor. You can put parentheses around the table, but it isn't required. The sources are provided as a simple space-separated list of names; Lake already knows that the extension must be .c.

This build works as desired; if any of the two C files change then they will be recompiled, and the program linked. Lake knows that hello.exe depends on hello.o and utils.o, and it knows that these in turn depend on the corresponding source files. But it even knows that compiling the source files depends on the shared header - so that editing common.h (or just updating its timestamp with touch) will cause both of them to be rebuilt. Both of these compilers can be told to show what include files they depend on, using the -MMD and /showIncludes flags respectively. Lake uses this output to add extra dependencies to the compile rule for the files. Managing the dependencies manually is irritating, and easy to get wrong.

Say utils.c had a reference to sqrt. The lakefile should now be:

 c.program{'hello', src = 'hello utils', needs = 'math'}

Now for Unix builds, the math library will be linked in with -lm; on Windows, this is unncessary since the runtime already includes the math library.

Everyone has Needs

Lake uses this idea of needs for builds to specify their requirements on a higher level.

pkg-config is a marvelous utilty that provides exactly what is required here. Unfortunately, it is not used widely (or consistently) enough to be a one-stop shop for providing the gory details about every library. But Lake will try to use pkg-config if it is available to match needs. So a simple GTK+ C program can be built like so:

 c.program{'button',needs = 'gtk+-2.0'}

Lake resolves needs like this: first, whether it is built-in (like 'socket' or 'lua'), second, whether there are pkg-config aliases like 'gtk' available, and third, whether suitable global variables have been defined. So in resolving the unknown need 'foo' Lake will see if the globals FOO_INCLUDE_DIR, FOO_LIB_DIR and FOO_LIBS have been defined and point to existing directories. (Thereafter, it will try pkg-config.)

A lakefile is ultimately just a Lua script, and can have code that sets these variables explicitly.

 FOO_LIBS = 'foo3'
 if WINDOWS then
     FOO_DIR = 'c:\\foolib'
     FOO_INCLUDE_DIR = '/usr/include/foo3'

Explicit Rules: running Tests

Say I have a little Lua C extension. That's straightforward because Lake knows about Lua:

 mylib = c.shared{'mylib',needs = 'lua'}

Now I wish to run some Lua test files against the generated or mylib.dll. For this, we make an explicit rule that connects Lua source files with an output file like so:

 lt = rule('.lua','.output','lua $(INPUT) > $(TARGET)')
 // populate the rule with targets; it depends on mylib
 lt ('test/*',mylib)
 // the default target depends on both the library and the test targets

One important take-home here is that Lake works with targets in a very similar way to Make; the first target defined in a lakefile becomes the default, but if there are multiple targets then we have to define a dummy target that depends on these targets.

Now, maybe there is also a requirement that tests can always be run directly using lake tests. So we have to create a target dependent on the test targets, which first resets the tests by deleting the fake targets:

 target.tests {
   action(utils.remove, '*.output'),

Depending on an unconditional action does the job. (However, this is not entirely satisfactory, since in an ideal world the order of dependencies being resolved should not matter, but this will do for now.)

Making the World a Better Place, one Semicolon at a Time

I remember an entertaining article by the famous Verity Stob on what to do when encountering C++ errors. One of the options was to write a Perl script to unmangle the errors so that they could be read by humans. She was writing satire, but like most good humour it was more than just a joke.

For instance, here is a wrong C++ program. Not terribly wrong, in fact almost competent:

 // errors.cpp
 #include <iostream>
 #include <string>
 #include <list>
 using namespace std;
 int main()
   list<string> ls;
   cout << "that's all!" << endl;
   return 0;

The response is pretty scary:

 errors.cpp:9: error: 'class std::list<std::basic_string<char,
 std::char_traits<char>, std::allocator<char> >,
 std::allocator<std::basic_string<char, std::char_traits<char>,
 std::allocator<char> > > >' has no member named

Seasoned C++ programmers learn to filter their error messages mentally, but this is the kind of initial experience that drives kids to sniffing glue.

lake provides the ability to filter the output of a compiler, and reduce irrelevant noise. Here is the lakefile:

 if CC ~= 'g++' then quit 'this filter is g++ specific' end
   return line:gsub('std::',''):
     gsub('class ',''):gsub('struct ','')
 cpp.program {'errors'}

And now the error is reduced to:

 errors.cpp:9: error: 'list<string >' has no member named 'append'

And another case of rampant template trickery gone bad has been tamed, and our hypothetical beginner gets to Nirvana quicker.

This was, incidently, an accidental feature. I needed to parse the output of cl.exe to get the header dependency information (it is not written to a .d file like with gcc) so a postprocessing hook was needed.

What Next?

Naturally, this is not a new idea in the Lua universe. PrimeMover is similar in concept, and also has a bootstrap stage to construct a completely self-contained interpreter, which is definitely a strategy worth emulating.

I haven't dealt with topics like dependency-based programming because this is not intended as a manual (which is to be found here) This article is more about showing the advantages of a higher-level, needs-based build system based on a real programming language, which is compact enough that a fully self-contained Lake executable would be less than 300K.

A number of kind people have pointed out that 2,500 lines of code is a bit much for a single script, which is true, and of course I know better. Unfortunately I have too many projects and they keep me awake at night, demanding to be fed; the next evolution of Lake will have to take its turn.

A single file does make installing Lake easier; it just needs Lua and LuaFileSystem (known as the lua5.1 and liblua5.1-filesystem0 packages in the Debian/Ubuntu world) and for the script to be made executable and put on the path. If you have installed LuaRocks (also available on Debian/Ubuntu) then installing Lake is as simple as sudo luarocks install lake.

The priority is a sound system that is flexible enough to meet working programmer's needs, to get the right balance between declarative/dependency-driven and imperative. It is already possible to provide new needs for Lake by defining Lua modules that look like 'lake.needs.NAME', which can then be easily installed by LuaRocks or some more ad-hoc delivery system.

Using all the processing cores that developers have available is also a priority, which requires some interesting work in a language that does not do the necessary concurrency out of the box. The best cross-platform candidate would be Lua Lanes which provides a non-shared concurrency model with explicit data messaging using 'Lindas'.

Wednesday, 21 December 2011

Documenting Lua

Why is Documentation so Lacking?

As Barbie the Software Engineer might say: Documentation is Hard. It's not intrinsically fun like coding, and often gets written badly as an afterthought. In open source projects, where fun is so often the guiding principle, it can get badly neglected.

There are basically two kinds of comment; one that is aimed at other developers of the code and the other which is aimed at the users of a library. The latter is particularly hard and important, since we are building abstractions for other people to use, and if they have to actually read the code to work out how to use the library, that abstraction has broken. This is a common problem with language bindings to existing libraries; they will be announced as a 'thin wrapper around libfoo' and it's assumed that you can work out how to use the bindings from the libfoo documentation or from libfoo examples in C. Well, not everyone can (or needs) to read C well enough to make that practical. We know the classic rebuke, "RTFM!" but that assumes that there is a manual - it's often "RTFS!" these days.

Testing has become a minor religion in the last ten years, and mostly it's a good thing, especially when using dynamic languages. Tests are supposed to be the best kind of documentation, because they can automatically be shown to be correct. But if the scope of the tests is too narrow, then they can be harder to read than the source. That is, also write examples that can serve as tests.

We cannot ascribe poor documentation to laziness, since these projects are otherwise well-constructed and may have exhaustive tests. The developers clearly don't think documentation correlates with wide use by the community; perhaps communication is not a priority. After all, research mathematicians write papers and do not immediately think how to explain it to undergraduates; they are writing for peers.

This might seem an indirect criticism, but it isn't intended as such. (I resisted the urge to point to some representative Github projects for this reason.) The Open Source universe needs all kinds, from researchers to popularizers. In the 'social coding' universe, fork has become a verb of respect, so documenting an existing project and doing a pull request is a polite way to contribute to the usefulness of a project.

You might think that elaborate conventions and markup could get in the way of documentation. The Go standards simply require you to preface the package and any exported items with a plain comment; just about the only convention is that indented lines are shown preformated. And still those plain comments are often not written. Part of the problem is a variant on the "increment i by one" kind of comment, which is just an awkward paraphrase of the code. So a Stack.Push method might end up with the blindingly obvious "Push an element onto the stack". Anybody with any self-insight is going to find that silly, and so they don't do it. I have noticed that coding libraries and documenting their interfaces are modal activities; it's useful to approach documentation with a fresh head (much as how writing and proof-reading are separate but essential steps in preparing an article.)

The key thing is to care enough to return to the code with "beginner's mind", and finish the job.

The xDoc Tradition of API Documentation

APIs are rarely as self-documenting as their creators might think. Javadoc is probably the grand-daddy of a per-function style of API documentation, and it has inspired numerous xDoc tools for other languages. Here is an example of LuaDoc markup:

 --- foostr stringifies a list of Foo objects.
 -- @param foolist a list of <i>Foo</i> objects
 -- @param delimiter string, as in table.concat
 -- @return a string
 -- @see Foo, table.concat
 function foostr(foolist,delimiter)

Limited code inference occurs, so that we do not have to specify the function name, but otherwise everything is explicit. Dynamic languages pose a particular challenge here because we're forced to specify the parameter types in an informal way. Projects may define their own conventions, but there's no standard way to write 'list of Foo'. You may use see-references to other documented entities but they have to be in a @see section.

People have recognized for a while the limitations of this way of documenting APIs, which focuses on the individual pieces and can be silent on how to put them together, but it's much better than nothing. So the debate is often about making these comments easier to write. For instance, in a comparison between Doxygen and JavaDoc, Gushie says:

The most important problem with the Javadoc comment in the comparison is how much I need to concentrate on formatting issues while writing it. When writing Javadoc I am constantly thinking about what should and shouldn't be linked, whether the list will look right, etc. This is frustrating because, while I do want to document my code well I also want to focus on coding.

And that's it: a tool that forces you out of code flow is not going to make you happy about documentation.

One of the problems I have with LuaDoc is that HTML is awkward to type and ugly to look at, especially when you want to provide something as simple as a list. So one of the main requirements was that Markdown could be used instead. I try to be a good boy and not reinvent too many wheels, but LuaDoc proved fairly resistant to this change, since it helpfully stripped out the blank lines and line feeds from the comments at an early stage. But the tipping point was when I moved the Penlight codebase away from using the module function; LuaDoc loves the module function and it's often necessary to put it in a comment so that it will recognize something as a module. So ultimately the issue was that LuaDoc wanted code to be written in a particular way; futhermore, we all know that 'Lua does not have classes' but it's straightforward to work in an OOP style; again, there was no provision for such an choice. The guiding principle should be: a documentation tool should not limit a developer's choices.

So, wheel-reinvention happened, leading to LDoc. In a few days, I had something that did the job equally well, for myself. Seasoned campaigners will know of course that this is the easy bit; the hard bit is getting a tool that will work on other people's codebases. Thanks to some painstaking testing by Lorenzo Donatti and others, LDoc is now at the point where it's ready for prime time testing.

The lack of inline entity references in LuaDoc means that your references have to be footnotes to your comments. This style felt better:

 --- foostr stringifies a list of Foo objects.
 -- @param foolist a list of @{Foo} objects
 -- @param delimiter string, as in @{table.concat}
 -- @return a string
 function foostr(foolist,delimiter)

Using Markdown helps formatting lists:

 -- creates a new @{Boo} object.
 -- @param spec a table of:
 --  - `age` initial age of object (optional, defaults to 0)
 --  - `name` descriptive name used in caption
 --  - `foolist` list of @{Foo}
 -- @return @{Boo}
 function Boo:create(spec)

(The blank lines are somewhat irritating, but I chose eventually to work with standard Markdown here; it is certainly less awful than the equivalent HTML.)

This kind of 'ad-hoc' table structure is a common pattern in Lua, and I've argued before that it's a good idea to have named types so that documentation can refer to them. An in-between option is to create a dummy type which just exists for documentation purposes.

Plain and Simple

Not everyone likes this 'tag soup' style of documention. For instance, this is the style followed in the Lua manual:

 -- Receives zero or more integers. Returns a string with length equal to
 -- the number of arguments, in which each character has the internal numerical
 -- code equal to its corresponding argument.
 -- Note that numerical codes are not necessarily portable across platforms.
 function string.char(...) end

This can be very effective with documenters who have a good clear prose style.

LDoc can support this by making the usual 'Summary' and 'Parameters/Returns' sections in the HTML template optional.

It was always awkward to configure LuaDoc, so I felt that LDoc needed a configuration file. As is often the case, Lua itself is an excellent format for such files.

 -- config.ld for Lua libraries
 project = 'Lua'
 description = 'Lua Standard Libraries'
 full_description = [[
 These are the built-in libraries of Lua 5.1
 Plus documentation for lpeg and luafilesystem.
 file = {'builtin',exclude = {'builtin/globals.lua'}}
 format = 'discount' -- use lua-discount for formatting
 -- switch off the top summary and 'Parameters/Returns' in the template
 no_summary = true
 no_return_or_parms = true

(You can in fact even define your own template and/or stylesheet and reference them in the config.ld file.)

And here is the result, thanks to the API files from mitchell's TextAdept editor project.

The principle followed here is: allow the documenter as much control as possible, but make it straightforward for end-users to build the documentation. So a simple invocation of 'ldoc .' in the right directory will find the configuration file and do the rest.

Type Annotations

On the other end of the scale, it's useful to have type annotations for Lua functions. This is a requirement for full IDE support in the Eclipse Koneki project. Fabien Fleutot provided support for tag modifiers, so that one could say things like @param[type=string] name. I have provided convenient alias tags so one can say:

 -- list available servers providing a service
 -- @tparam string service name
 -- @treturn {Server,...} list of servers
 function list_available(service)

Beyond the built-in Lua types ('string','number',etc) there is no agreed-upon way to specify types, so some invention is necessary. The convention is that {T,...} is a list-like table containing T objects; (see the documentation for further discussion.) LDoc will attempt to do lookup on all identifiers within such a type specification, so that Server becomes a link to that type's definition.

Examples, Readmes and Command-Line Help

Per-function API documentation is often not enough. In particular, good examples can clarify the intended use, and narrative documentation can introduce the concepts and fill in the background. Complementary 'modalities' of explanation and extra redundancy allow end-users of different backgrounds and thinking styles to have various ways of understanding the topic.

So LDoc allows integrating the API documentation with examples and Markdown text as linked documentation. The best example I currently have is the winapi documentation. This is actually a C Lua extension, which is another LDoc feature. Here is the config.ld:

 file = "winapi.l.c"
 output = "api"
 title = "Winapi documentation"
 project = "winapi"
 readme = ""
 examples = {'examples', exclude = {'examples/slow.lua'}}
 description = [[
 A minimal but useful binding to the Windows API.
 format = 'markdown'

Those of us who live in the command-line like to look up documentation quickly without messing around with a browser. Generating man-pages is an interesting and very do-able goal for LDoc, but man is showing its age and is pretty Unix-centric. Some great discussions with Norman Clarke inspired an LDoc feature where it looks up libraries on the module path and parses doc information.

So if you have an installed library with documentation, then it's easy to look up a particular function:

 $ ldoc -m pl.pretty.dump
 function        dump(t, ...)
 Dump a Lua table out to a file or stdout.
 t        {table} The table to write to a file or stdout.
 ...      {string} (optional) File name to write to. Defaults to writing
                 to stdout.

Thanks to mitchell's work at luadoc-ifying the Lua manual, this works for the standard Lua libraries (plus lfs and lpeg) as well:

 $ ldoc -m lfs.currentdir
 function        currentdir()
 Returns a string with the current working directory or nil plus an error

(ldoc -m lfs would list all the available functions.)

Further Work

It's possible to process the inner representation generated by LDoc using a custom Lua module, allowing multi-purpose pipelines. We're still working on use cases for this, but the basic machinery is there.

Other output formats could be supported; the program is meant to be extendable and I think generating LaTeX could be an excellent way of getting good-quality output.

Lua is surprisingly flexible at expressing concepts like modules and classes, so simple lexical code inference is probably not going to scale. Eventually we will have to use full static code analysis like David Manura's LuaInspect.

Thursday, 15 December 2011

MoonScript: A Lua-Based Language

Let a Thousand Flowers Bloom

There are literally hundreds of programming languages, most of which are historical curiosities, research tools or personal projects. You would think that most of the important innovations have happened, and existing mature tools available for all tastes. That is true, to a point: but we are still exploring the space of expressive notation for solving our programming problems. Here the practical programmer and the academic researcher diverge, of course; the latter sees most new languages as rehashes of things first done in the golden age of PL development.

The last few years has been a surprisingly fertile time for new programming languages. They range from Go (native code + new runtime), Scala (bytecode + JVM) to languages like CoffeeScript or Dart which translate to JavaScript. The trend is mostly towards static typing but with type inference helping to keep the syntax clear and uncluttered, with functional influences. There is some TIOBE and anecdotal evidence that the dynamic language usage has peaked, possibly because of the maintenance difficulties of large codebases (an explicit motivation for Dart's development, which has optional typing.)

MoonScript is inspired by CoffeeScript, but generates Lua code instead. JavaScript and Lua share many similarities like first-class functions, the m["name"] == equivalence, and no conventional OOP constructs. People like the convenience and familiarity of class, however, so CoffeeScript, Dart and MoonScript provide explicit support.

Trying it Out

When reviewing a new language, it's useful to remember that programming languages take time and exposure to become mature. MoonScript is now on its second release, three months after the first, and getting out of its proof-of-concept phase. 'Leaf' Cocoran is doing all the right things; he has provided syntax files for some common editors, a binary for the notoriously compiler-shy Windows community, and has a good looking website with comprehensive documentation. And he is building on a very solid base, the battle-tested Lua 5.1 release.

Also, reviews are not tutorials; I won't try to cover all the syntax, but will try to give the flavour of the language.

MoonScript could generate Lua bytecode directly, but would hit the basic issue that the "Lua VM" does not have standardized bytecode. In particular, this would exclude MoonScript from using the excellent LuaJIT implementation, which is a known problem with the otherwise powerful Metalua dialect. And having a readable high-level language as the compilation output makes it easier to develop and trouble-shoot.

Although not compiler-shy, I am lazy and just downloaded the Windows binary (it's hard to resist a language delivered as a 212KB zip file. Besides, the 'non-POSIX' world doesn't get enough attention these days.) If you want to avoid the 'moon script.moon' incantation, then it's straightforward to tell Windows how to handle MoonScripts:

 ftype moonscript=d:\utils\bin\moon.exe %1 %*
 assoc .moon=moonscript

(These make global permanent changes, so you need to run as Administrator). Then add .moon to the system PATHEXT variable. The incantation is now just 'script'.

 -- script.moon
 print "Hi Ma", "No Parentheses!"

Wish-list Driven Design

In many ways, MoonScript is like a collection of features from a Lua user's wishlist; list comprehensions, shorthand for lambdas, switch statements, explicit classes, statements-as-expressions, switch statement, local-as-default. But Lua is the C of the dynamic languages; it has a small, very powerful core and 'conventional' syntax which has made it popular as an extension language. It moves slowly and carefully, and does not pander to 'Python envy'. MoonScript, on the other hand, will appeal to people with Python exposure, since indentation is used in the same way to delimit blocks.

I would not recommend MoonScript to a beginner in the same way I could for Lua itself, since it has a very concise syntax that appeals more to seasoned professionals. There is a lot more to remember, and always a number of ways to do similar things. This example shows the conciseness of expression:

 sqr = (x) -> x*x
 print sqr 4, sqr 5

The last statement in a function becomes its value; there's still return available but usually it's only needed explicitly when leaving a function before its end. print is still the same function it is in Lua, but parentheses are usually optional.

Well, cool: we've saved a lot of typing. That translates to something like this:

 local sqr
 sqr = function(x) return x*x end
 print (sqr(4),sqr(5))

Now saving on typing is an ambiguous virtue; if we wanted the ultimate compression, we would all be using APL and playing Code Golf. Classic Lua programmers prefer keywords to sigils, and fear that their language would end up like Perl. They say that reading code is a much more common operation than writing it. And that is fine and dandy (I mostly agree), which is why we need projects like MoonScript and Metalua to get a feeling for new notation.

Personally I like how MoonScript makes the functional aspects of Lua stand out more clearly. Higher-order functions are less cluttered with keywords:

 compose = (f,g) ->
   (...) -> f g ...
 printf = compose print, string.format
 printf "the answer is %d, '%s'", 42, "Everything"

List comprehensions are another favourite functional idiom:

 range = [i for i = 1,10,0.1]
 squares = [sqr(x) for x in *range]

Note the * syntax, which satisfies the common need to iterate directly over the elements of an array. (You can still say for _,x in ipairs range) These constructs result in inlined Lua code, which is a good decision given the very fast translation times for Lua and its modest resource needs.

Multiple for clauses are allowed:

 points = [{x,y} for x in *x_coords for y in *y_coords]

That generates about 17 lines of Lua output; hand-written would be about 10, which would still slow readers down and make them have to guess the programmer's intention.

Statements can be used as expressions, which is useful in functions:

 choose = (c,a,b) -> if c then a else b

In fact, there's a clever feature that for statements return tables:

 show = (t) ->
     for k,v in pairs t
         print k, v
 res = for i = 1,10 do 2*i
 show res

This is sufficiently clever to cause unintended consequences, so the rule is that this does not happen in function expressions and cause needless table generation:

 loop = -> for i = 1,10 do i
 print loop()

It's probably too clever to be truly wise, given that list comprehensions do the job more explicitly. However, it can be used to go beyond the list comprehension syntax:

 values = {1,4,3,5,2}
 v = for v in *values do switch v
     when 1 then 'bad'
     when 4 then 'cool'
     when 5 then 'excellent'
     else "uh-huh"
 -- v is now {'bad','cool','uh-huh','excellent','uh-huh'}

New with this release are table comprehensions, which are a generalization to map-like tables. These use curly braces to distinguish themselves, and the expression must be double-valued, with the first value becoming the new key and the second the associated value. So this expression turns an array into a set, that is t["hello"] is true if ls contains the value "hello".

 t = {k,true for k in *ls}

At this point, it's good to remember that we are still in Lua land, and there is no automatic pretty-printing of tables - which is wise because tables are the ultimate flexible data structure. (You can choose the string representation of your own types easily enough.) There is a small library included which provides a generic table dumper:

 require 'moon'
 moon.p {
     one: 'hello',
     [1] = 10
     [2] = 20
     [one] = "hello"

(For those used to Lua, note the use of : for key-value assignments in table constructors; the key may also be a keyword. For those who are not, note that tables can both be 'arrays' and 'hashes' - they are general associative arrays which are particularly efficient at regular array indexing. Like electrons, they are both waves and particles.)

However, despite the perception, there are batteries for Lua and MoonScript can consume them happily.

Using Existing Lua Libraries

MoonScript already includes LuaFileSystem and LPeg as standard libraries.

 total = 0
 for f in lfs.dir "." if string.match f "%.lua$"
   size = lfs.attributes(f).size
   print f, size
   total += size
 print 'total',total

(It also comes with lua-getopt for argument parsing, but that's an awkward, no-documentation kind of library.)

My favourite command-line parser is Lapp, which is part of the Penlight suite of libraries, since it parses the actual flag names and types from the usage string:

 -- mhead.moon
 require 'pl'
 args = lapp [[
     A simple head utility
     -n,--lines (default 10) number of lines to print
     <file> (default stdin) file to print; can be stdin
 for i = 1,args.n
     line = args.file\read!
     if not line then break
     print line

On Windows, the directory for Lua packages is relative to the executable. So if I install moon.exe to d:\utils\bin, then Lua packages must go in d:\utils\bin\lua. To install Penlight simply involves copying its pl directory to this lua directory. For other systems, MoonScript shares the usual global Lua module path.

To read configuration files (both Unix and INI-style) we can use pl.config

 require 'pl'
 test = [[
 # test.config
 # Read timeout in seconds
 # Write timeout in seconds
 #acceptable ports
 ports = 1002,1003,1004
 t = test
 pretty.dump t
 [[ --output--
     ports = {
     write_timeout = 5,
     read_timeout = 10


By the time people get to Lua (or Go) they have been sufficiently indoctrinated to instantly miss classes. Lua prefers to define more general metaprogramming techniques, which allows a competent user to roll any kind of custom OOP framework. (Unfortunately, too many people do this, so interoperability becomes a problem.) But ultimately all that matters is that the duck quacks. Lua does require you to indicate that the quacking applies to a particular duck, or to all ducks, much as a C++ programmer uses -> and ::.

In Lua, a method call is:


And in Moonscript:


(Idiomatic MoonScript would actually be duck\quack! but let's change one variable at a time.)

I'm not 100% crazy about the backslash here, but : has now got a more conventional job defining value/key pairs in MoonScript (and there was no other punctionation available ;))

So here is a MoonScript class, bit by bit:

 import insert,concat,remove from table

the import statement is one of those nice-to-haves which would be very useful in Lua, since the equivalent is the noisy and redundant:

 local insert,concat,remove = table.insert, table.concat, table.remove

A class is an intented block containing methods

 class List
     new: (t) =>
         @ls = t or {}

Note the colon, instead of =; this reminds us that we are putting these functions into a table. The fat arrow => means that the function has an implicit self argument; here it's equivalent to (self,t) ->.

@ls is just sugar for, as in Ruby. (The new method will be indirectly called as a constructor).

Our chosen representation for a list is a table field ls. This can be manipulated by the usual Lua table functions, which we previously imported:

 add: (item) =>
     insert @ls,item
 insert: (idx,item) =>
     insert @ls,idx,item
 remove: (idx) => remove @ls,idx
 len: => #@ls
 __tostring: => '['..(concat @ls,',')..']'

table.insert is basically two functions wrapped up as one - it both appends and inserts. #@ls is one of those collisions of punctuation that one just has to get used to in any language - it helps to read it as (and you can use self explicitly if you like.)

Any Lua metamethod can be specified in a class. A MoonScript class is the metatable for all its instance objects, with __index pointing back at the metatable so the object can find its methods. This is the most common way of implementing 'classes' in Lua.

Unfortunately, # cannot be overloaded in Lua 5.1 for tables; __len only applies to userdata defined by C extensions. (This is one of the welcome Lua 5.2 improvements)

 find: (item) =>
     for i = 1,#@ls
         if @ls[i] == item then return i
 remove_value: (item) =>
     idx = self\find item
     self\remove idx if idx
 remove_values: (items) =>
     for item in *items do self\remove_value item

if statements come in two flavours: multiline, where the following block is indented, and single statements that use then. The same applies to for, where the single statement version has an explicit do.

Any statement/expression can also have an if clause, called a 'line decorator'.

Now list comprehensions get their turn:

 index_by: (indexes) =>
     List [@ls[idx] for idx in *indexes]
 copy: => List [v for v in *@ls]

And finally we define extension, iteration and concatenation:

 extend: (list) =>
     other = if list.__class == List then else list
     for v in *other do self\add v
 __concat: (l1,l2) -> l1\copy!\extend l2
 iter: =>
     i,t,n = 0,@ls,#@ls
         i += 1
         if i <= n then t[i]

Any instance of a MoonScript class will have a __class field; I make this check because it's useful to allow a list to be extended by a plain vanilla table.

The iter method needs some explanation: in Lua, the for in statement expects a function that it can repeatedly call to get each successive value of the sequence, until that function returns nil. This method returns a closure with upvalues i, t and n.

And, exercising List:

 ls = List!
 ls\add 10
 ls\add 20
 ls\remove_value 10
 ls\remove_value 15
 ls\add 30
 ls\remove 1
 for v in *{1,2,3} do ls\add v
 print ls, ls\len!
 print ls\index_by {2,3}
 for v in ls\iter!
     print v
 print (List{'a','b'})..(List{'A','B'})

No type is complete without a definition of equality. == in MoonScript, as in Lua, compares simple values, not referencews. Objects are equal if they are the same object; (string equality works as expected because all Lua strings are guaranteed to be internally unique.) Generally, this is a good idea (none of JavaScript's 'helpfullness' here!). The relavant metamethod is __eq:

 __eq: (l1,l2) ->
     return false if l1\len! != l2\len!
     l1, l2 =,
     for i = 1,#l1
         return false if l1[i] != l2[i]
     return true

As another example of a Lua wish-item, note that ls\add without arguments is a valid expression, called a 'function stub',

 ls = List {1,2}
 add = ls\add
 add 3
 add 4
 assert ls == List {1,2,3,4}

ls\add 3 is short for ls.add ls 3; ls\add is a partial function application that binds the self argument.

See here for the complete code listing.

use as a DSL

One of the things I look for in new languages is how amenable their syntax is to the creation of embedded domain-specific languages.

The MoonScript author has a static website generator called sitegen where you specify your configuration with a site.moon file:

 require "sitegen"
 site = sitegen.create_site =>
   deploy_to "", "www/sitegen"
 @title = "Sitegen"
 @url = ""
 add "index.html"
 add "doc/"

We've met the fat arrow (=>) before in methods, but it can be used anywhere. Here => is shorthand for (self) ->. Now the @title shortcut makes sense, since it's just self.title.

The Lua equivalent is certainly more verbose, although I'm not sure we're losing much on readability:

 require "sitegen"
 local site = sitegen.create_site(function(self)
 self.title = "sitegen"
 self.url = ""
 add "index.html"
 add "doc/"

sitegen is the best existing example of real working MoonScript. As the author says modestly, "apart from the compiler". (That's right, MoonScript is now self-hosting.)

Problems and Irritations

You'll notice the necessary parentheses when concatenating two List expressions. Although we are encouraged to leave them out unless they are necessary, the rules are not as straightforward as they could be. For instance, this equivalence contradicts the above rule; note how whitespace matters:

 L"frodo" + L"bilbo" == L("frodo") == L("bilbo")
 L "frodo" + L "bilbo" = L("frodo" + L("bilbo"))

This comes from a Lua convention that parens are not needed if the function has a single string argument, but results in a special case that is likely to bite more than one person. It's easy enough to learn to read indentation as syntax, but making a single space syntactically significant is going too far. I would see this as destructive interference from the base language.

There are other inconsistencies, like you cannot say @\f to call another method when inside a method body. (This would lead to serious punctuation clusters like @\f! so perhaps this is a feature ;))

Generally, the compiler is not so very helpful and gives non-specific parse errors, and the line it does complain about is often not the problem line anyway. So I found myself having to read the manual carefully to work out issues. It reminded me of my Seventies experience with mainframe Fortran, where the result of painstaking work at the card puncher would lead to the equivalent of 'Que?'. And run-time errors would usually end up as a six-digit error code which you would then have to look up in the IBM manuals, which were bolted down like Medieval manuscripts. We've come a long way, and it shows how important good diagnostics are to learning a system: the compiler as tutor.

But hey, it's a young language, these are all fixable faults. It's a must-play for anybody interested in the future evolution of Lua, and could find an audience among CoffeeScript enthusiasts who might appreciate a lighter, faster runtime, especially with more generally useful batteries.