Thursday, 11 August 2011

Pretty (and Lightweight) Code Highlighting with Lua

When I write articles, I like to write text in a plain editor, with no formatting options available. So cute on-line editors are not my cup of tea; they do not scale to more than a few paragraphs.

Blogger does allow you to by-pass the cute but irritating editor, and work directly with HTML, which can be directly pasted in. However, blank lines are considered to be paragraph breaks which is inconvenient if you are generating the HTML with some preprocessor.

Markdown is a good way to write text with hyperlinks, emphasis and maybe the occaisional list; it has been designed to be a good publishing format and deliberately does not give you too many options. Any indented block is rendered as-is as <pre><code>. However, I am also a programmer and like source examples to look reasonably pretty. So feeding this blog presented a technical challenge.

One way to get syntax highlighting is using client-side JavaScript like Alex Gorbatchev's SyntaxHighlighter. You can include this in Blogger, and the result is indeed good looking. The approach works by marking up code samples so that the highlighter can scan the code and re-arrange the DOM on the fly.

The first downside is that this involves a fair amount of extra typing for each snippet:

 <script type='syntaxhighlighter' class='brush: c#; wrap-lines: false'>
 foreach (var str in new List<string> { "Hello", "World" })
    Console.Write(" ");

The second one is that pulling in all this extra JavaScript can lead to an appreciable lag, for instance the Lua snippets site. So a better solution is generate the pretty HTML up front; a few hundred extra bytes of markup will render practically instantly.

So this felt like a job for Captain Scripting.

The difference between a script and a program has been endlessly debated; the classic position is often called Ousterhout's Dichotomy where the world is divided into system and scripting languages. In this view, the ideal script is just 'glue' that flexibly connects parts written in some other language, which is an extension of the classic Unix Way.

It's a ultimately a scale thing, not a language thing; a small program that scratches a single itch, often for a single person. By this measure you can do scripting in any language (although it can get tedious with C because it's too low level.) Dynamic languages do have a big advantage for shorter programs, which is one reason why any serious programmer should have one of them in their toolkit.

Lua is my favourite little language, and a fair amount of the open-source work I've done over the last few years has been a response to the often-quoted remark "Lua does not come with batteries". In many ways, Lua is the C of dynamic languages: compact and efficient. This is more than an analogy, since the authors of Lua base the core functionality of the language around the abstract platform provided by ISO C. They see it as the job of the commmunity to provide libraries, and generally you can find them for most needs. If you prefer the batteries-included approach there is Lua for Windows; LuaRocks is a packaging tool, like Ruby's Gems; either can easily provide the prerequisites for this script: the markdown and penlight packages.

Niklas Frykholm's markdown.lua can be used as a command-line Markdown-to-HTML converter, but is most useful as a library. This is the basic engine needed by our script.

 require 'pl'
 require 'markdown'
 local text = utils.readfile('')
 text = markdown(text)

Here I'm leaning on Penlight to do some of the grunt work. You will notice some of the hallmarks of the scripting spirit; no error checking. That's cool for our purposes at first; error handling can always be added later.

The entertaining part is syntax highlighting. Penlight provides a lexical scanner which can read source and break it up into classified tokens like 'string', 'keyword', etc.

 local spans = {keyword=true,number=true,string=true,comment=true}
 function prettify (code)
    local res = List()
    res:append '<pre>\n'
    local tok = lexer.lua(code,{},{})
    local t,val = tok()
    if not t then return nil,"empty file" end
    while t do
       val = escape(val)
       if spans[t] then
       t,val = tok()
    return res:join ()

The scanner tok is a function which returns two things, the token type and its string value. We find the types to be highlighted by looking them up in the table spans. (This is an example of the 'pythonic' style that Penlight enables by providing a List class.)

This generates the HTML spans:

 local function span(t,val)
    return ('<span class="%s">%s</span>'):format(t,val)

That is, we just assume that the CSS contains classes that give the token names a particular colour, etc.

 .keyword {font-weight: bold; color: #6666AA; }
 .number  { color: #AA6666; }
 .string  { color: #8888AA; }
 .comment { color: #666600; }

Finally we have to escape things like < as &lt;:

 local escaped_chars = {
    ['&'] = '&amp;',
    ['<'] = '&lt;',
    ['>'] = '&gt;',
 local function escape(str)
    return (str:gsub('[&<>]',escaped_chars))

Lua's string.gsub is a marvelous function, here mapping any special characters to their HTML representations using a lookup table.

All that remains to be done is to identify the indented blocks and convert them using prettify. That's straightforward but tedious.

The final bit of massaging happens to the generated HTML after Markdown processing has taken place: normally blank lines mean nothing in HTML, but Blogger is making up its own rules here. So we scrub out extra lines after paragraphs and code blocks:

 function remove_spurious_lines (txt)
    return txt:gsub('</p>%s*','</p>\n'):gsub('</pre>%s*','</pre>\n')

And that's basically it; prettify.lua is passed the article name (without the .md extension) and the language to use, and writes out a file name.html in a form that can be directly pasted into Blogger. If there is a third argument, then a HTML document with inlined style is generated, which is useful for previewing.

To see the colours, you have to modify your blog's CSS, but this is straightforward; just skip down until you see the CSS and paste in the above CSS snippet. (I found a pre { font-weight: bold } helped readability; the rest is a matter of taste.) This technique will of course work with other blog engines that allow direct HTML entry.

The moral of the story: batteries are important. Having the right tools around makes it easier to do the job without too much copy-and-pasting from the World Wide Scratchpad.

The final script is available here.

No comments:

Post a Comment