Thursday, 2 October 2025

DS Hell: Nu as a powerful data DSL

HTML

HTML is the Domain Specific Language (DSL) that really has taken over the world. It is meant to be used as semantic markup, e.g <em> is for emphasis which can be then mapped to bold, or whatever. Indeed, without a good stylesheet it looks like utter crap. This form has been nicely styled with PicoCSS:

<form>
  <fieldset class="grid">
    <input 
      name="login"
      placeholder="Login"
      aria-label="Login"
      autocomplete="username"
    />
    <input
      type="password"
      name="password"
      placeholder="Password"
      aria-label="Password"
      autocomplete="current-password"
    />
    <input
        type="submit"
      value="Log in"
    />
  </fieldset>
</form>

HTML itself (and its data cousin XML) is a simplified form of SGML - which I present here as a example of a standards body having way too much fun.

HTML is a markup language, for presenting documents with structure. Since actually writing lots of HTML is tedious, there is a need to generate it from data.

One of the things buried in browsers is XSLT which was designed for the case of taking XML (such as an RSS feed) and converting it into HTML. It is not very pretty; here is a little taste:

…
<xsl:template match="myNS:Author">
    --   <xsl:value-of select="." />

  <xsl:if test="@company">
    ::   <b>  <xsl:value-of select="@company" />  </b>
  </xsl:if>

  <br />
</xsl:template>

XSLT involves an XML document converting one kind of XML document into another kind. But conceptual simplicity does not necessarily mean easier and more convenient

More common these days to see templates, like this Go template example:

<h1>{{.PageTitle}}</h1>
<ul>
    {{range .Todos}}
        {{if .Done}}
            <li class="done">{{.Title}}</li>
        {{else}}
            <li>{{.Title}}</li>
        {{end}}
    {{end}}
</ul>

This is certainly easier to read and write, but now we have two languages intermingling with each other - PHP is a good example of this style.

Let's see what a Nushell representation would look like. XML documents are records with tag (a string), attributes (a record) and content (a list of child elements). Defining a command tag to conveniently construct these elements, then the original form example looks like this:

form
    (fieldset -c grid
       (input -a {
            name: 'login'
            placeholder: 'Login'
            autocomplete: 'username'
            aria-label: 'Login'
        })
        (input -a {
            type: 'password'
            name: 'password'
            placeholder: 'Password'
            aria-label: 'Password'
            autocomplete: 'current-password'
        })  
        (input -a {
            type: 'submit'
            value: 'Log in'
        })
    )
)

Pop that through to xml -i 2 -s (some indenting, and closing empty tags) and ... HTML.

Not bad at all, but a little more fussy than the original. The power of this representation is that it is code; can now have fun defining more specialized and better self-documenting input constructors. Or instead of 'Login' have (gettext 'Login') and get localization.

Here's a more dynamic example, which wraps each string in a <li> tag:

tag ul ([
    'Here we go'
    'A whole list of us'
] | each {|v| tag li $v})

We define a helper list that does this. Together with a helper link for <a> elements we get this useful idiom:

(list ul li (
        [
            ['Home' '/index.html']
            ['Help' '/manual.html']
        ] | each { link }
    ))

The complete little library is here

YAML

YAML (Stood for 'Yet Another Markup Language' initially, until its creators got more serious) is a popular way to represent arbitrarily nested data. The attraction is that there is little punctuation needed (as in what makes a shell work ) - JSON involves a lot more quoting and fussy positioning of commas.

a-map:
  # with some key value pairs
  one: 1
  two: 2
  three: drei
  four: # and here's an array
   - 4
   - 40
   - 400
title: a apparently straightforward notation

Unlike JSON, comments are allowed and whether something is a string or not is worked out for you. However, those rules are not obvious and the takeaway from years of experience is "just use quotes, man".

(Also from experience: normal users will completely fuck up things like indentation unless given an opinionated editing environment)

How did something so straightforward in intention get so unwieldly? First proposed in 2001, and first version in 2004: it had three parents - (contrast with JSON, which was a strictly-inforced subset of Javascript data literals proposed by Douglas Crockford, also in 2001.) They worked hard on a specification, and (in the ultimate analysis) simply had too much fun, like with the SGML process. It would have been better to have someone knock out a prototype over the weekend because they needed a more expressive data format.

YAML is declarative, like HTML. So templating is common, see this example from the Salt Stack

# Declare Jinja list
{% set users = ['fred', 'bob', 'frank']%}

# Jinja `for` loop
{% for user in users%}
create_{{ user }}:
  user.present:
    - name: {{ user }}
{% endfor %}

I had to deal with this shit at one point, and I'm still triggered by Captain Yaml when he's teamed up with Kid Jinja.

Ewww, as the mean girls say.

I have done some experiments with YAML expansion, inspired by these bad experiences.

This is a semi-useful example. Here is the data:

# animals.yml
data:
  dog:
    version: 1.2
    ports: [2555]
  cat:
    version: 0.8
    ports: [1023]
    volumes: [/:/hostfs.ro]

And the template - the MAP special form takes applies all the values in the original data map and constructs a new object for each one. The LIST form does the same for each value in an array.

# services.yml
services:
   MAP-k,v-in-data:
    image: 'ourtech/((k)):((v.version))'
    ports:
      LIST-p-IN-v.ports: '((p)):((p))'
    IF-v.volumes?:
      volumes: ((v.volumes))

And the result is in a known format:

# docker-compose.yml
services:
  cat:
    image: ourtech/cat:0.8
    ports:
    - 1023:1023
    volumes:
    - /:/hostfs.ro
  dog:
    image: ourtech/dog:1.2
    ports:
    - 2555:2555

This experiment is in Go and I can clean it up and make it available if there's interest - I am not entirely convinced of the exact notation yet.

It's no surpise that Nu is a pleasant alternative to YAML, since it is also low-punctuation. This is an Ansible file, rendered in the Nu equivalent of JSON: Nuon (Nu Object Notation):

[{
  name: 'Write hostname'
  hosts: all
  tasks: [
      {
        name: 'write hostname using jinja2'
        ansible.builtin.template: {
          src: templates/test.j2
          dest: /tmp/hostname
        }
      }
    ]
 }]

This is a tasteful way of integrating Jinja templating, since the templating is in separate files and one doesn't get that mad feeling from seeing two completely different notations in the same file:

# templates/test.j2
My name is {{ ansible_facts['hostname'] }}

With Nu, we can lean into the idea that configuration is executable. This is the same docker-compose example as previous:

use expand.nu *

{
  services: (MAP $data.data {|v| NON-NULL { 
      image: $"ourtech/($v._KEY):($v.version)"
      ports: (LIST $v.ports "{_}:{_}")
      volumes: $v.volumes?
    } 
  })
} | to yaml

MAP operates on the values of a record (tho inserts the key as _KEY); LIST operates on a list, much like each except with a special case for scalar values.

NON-NULL lets us define a record literal that does not insert null values. So the volumes entry will not be filled if $v does not have a volumes key.

These helper commands are to be found here

'Executable Configuration' Are you Insane?

Nu is very shell-focused, so (a) can trivially shell out to external commands and (b) there are platform-independent ways to create and remove files. It isn't designed for sandboxing like Lua.

The safest bet would be running in a container with limited access to the host filesystem, CPU quotas, the works. What the Bomb Squad would call a controlled explosion.

A less defeatest option would be to use the view ir command to dump and traverse the IR looking for call opcodes and apply some sensible allow-list rules (or deny-list? I'm not sue yet. The command ir-scan in the above module returns the list of commands referenced.)

But this is not intended as a solutions article; document/data DSLs often need to be generated, and templating involves two entirely different layers of language co-existing with each other. This is particularly hard to read as a human if dealing with YAML. A more expressive language offers a way out - at the cost of allowing arbitrary computation.

Thursday, 28 August 2025

Two Kinds of Shells

Two Different Command-line Shells

The Unix shell

The first Unix shell was the Thompson Shell from 1973 which already looks very familiar (here's an example) The if command would jump to labels, like in Assembly language, so certainly not a great programming language at this point. But an excellent shell:

Indirection - sending the output of a command into a file:

$ prog > myfile.txt

Piping - sending output through filters:

$ prog | sort
$ cat big.txt | head -n 10

If a command has a return code of 0, then it is successful; commands can write to standard output, and standard error; by default redirection and piping works with standard output.

Note big.txt may be indeed ginormous in the last example, but only enough will be read to show the first 10 lines

Originally Unix was developed with electromechnical teletypes, which are noisy and slow. encouraging short names (like ls and cd etc). C was first written on teletypes with a line editor - vi only appears in the late 70s. So the terminal was first very physical, then running on a monochrome monitor over a serial link, and finally the multicoloured glory of modern terminal 'emulators' (Nice overview) So the Unix shell uses the minimum number of keystrokes for its functionality; consider the brilliant notation prog & for putting a process in the background.

The Bourne shell first appears in 1979. By this time people needed scripting, and the new shell was much better for this. Bear in mind that C, with all its beauty and sharp edges, is not a very approachable language for once-offs and little utilities, as it is low-level with a very basic standard library.

for i in `seq 1 10`
do
  if test $i -gt 5
  then
     echo "larger $i"
  fi
done

The weirdness of fi and esac is because Stephen Bourne was an Algol 68 fan, although he decided against using od to terminate do loops (or maybe they had a 'Really, Stephen?' conversation at Bell Labs)

Nearly everything is done with external commands - seq, test and echo. All the shell is doing here is expanding the variable i. String interpolation also happens in double-quoted strings.

The Unix principle involves composition of specialized commands, each doing their one job very well.

To port the Unix shell necessarily means porting the 'little languages' (the domain-specific languages) that made the shell so powerful: grep, sed, awk and so on. (Perl consolidated these tools into an alledgely coherent whole, but this happens almost a decade later)

sh has been re-implemented many times (like the GUN bash) and for other operating systems. It is not entirely a good fit for Windows (although until PowerShell arrived there was nothing better) mostly because it is more expensive to shell out a command than in Unix; Windows prefers its native threading model (which arrives pretty late in Unix/Linux history). The Unix tradition of 'everything is a file' does not cover Windows functionality like the Registry, etc.

Typing in a Shell, Writing a Script

What makes a language both a good shell and good for scripting? Even if the language has a good interactive prompt, it is not usually convenient as a shell because there are too many key presses involved (particularly shift-key):

# Python >>> exec("prog", "-f", os.environ['HOME']").out("temp.txt")
# Shell  $ prog -f $HOME > temp.txt

Even with library support the extra parentheses and commas are going to slow the shell user down; there are more keystrokes, and many of them are punctuation. Part of what makes shell work is implicit strings everywhere (so not having to 'quote' everything) and explicit $ meaning variable access.

sh is an excellent shell, but is it a good scripting language?

Brian Kernighan wrote a famous paper entitled "Why Pascal is not my favourite programming language" and it would not be difficult to write a companion piece for standard POSIX shell. In that paper, the main criticism is that the type system is too rigid (in particular, array types); in the case of the Bourne shell there is only one type, text. A string might contain a number, and then you would have to compare it in a different way. Lists are done in an ad-hoc way with space-separated words. It is easy to mess things up, and even easier to be judged - any attempt to write a shell script will bring forth rock-dwelling critics.

When dealing than anything except one-liners, error handling is crucial. Bash has a scary default mode where it just keeps going whether errors happen or not. So you have to code very defensively, as in Go, always checking return codes and explicitly deciding what to do.

A lot of non-trivial shell is converting one ad-hoc text format into another. The classic text mangling tools are good, but they have a learning curve and in fact most of the skills needed to be competent at shell are outside the shell itself; it is mostly an 'empty shell'.

There has been a move for newer commands to optionally produce JSON output to be parsed in a standard way by other commands; there remans a nice presentation for human users, but machine users don't have to bother parsing this. The jc project aims to convert the output of popular commmand-line tools into JSON, and jq provides a powerful DSL for processing JSON.

Nushell

So an idea emerged in Microsoft early this century: what if data passed through shell pipelines, not as a particular serialized text format like JSON, but as raw .NET objects? The data could be operated on with the methods and properties of these objects, and at the end of the pipeline, the objects would be converted into a default presentation for human users. Microsoft Powershell was first released in 2006 and becoming part of Windows with vs 2.0 in 2009.

It was a hit, because frankly the situation with Windows admin was a mess. Grown adults reduced to clicking on buttons, or forced to work with some of the most clunky command-line tools known to humanity, accessed with a uniquely brain-dead command shell.

I'm not really a fan, since administering Windows is not where I like to be, and I still think it's a rrevolutionary idea held back by a second-class implementation. It is the slowest shell to start, easily 500ms on a decent machine, since all those .NET assemblies have to be pulled in at startup.

The idea of a cross-platform shell organized around the data-pipe principle remained powerful, and Nushell started happening in 2019.

All values in Nushell have a type; the main types are:

numbers (int and float are distinct)
filesize, duration, datetime
strings
lists
records (corresponding to Javasript objects or Python dicts)
tables lists of records with the same keys

By default, you get a pretty view of tables (this is themeable, if you find the default a bit heavy) - can instead convert the data to YAML etc. In Nushell ls creates a table:

/work/dev/llib> ls
╭───┬─────────────────┬──────┬─────────┬──────────────╮
│ # │      name       │ type │  size   │   modified   │
├───┼─────────────────┼──────┼─────────┼──────────────┤
│ 0 │ LICENSE.txt     │ file │  1.4 kB │ 3 years ago  │
│ 1 │ build           │ file │    33 B │ 3 years ago  │
│ 2 │ build-mingw.bat │ file │    60 B │ 3 years ago  │
│ 3 │ examples        │ dir  │  4.0 kB │ 6 months ago │
│ 4 │ llib            │ dir  │  4.0 kB │ 3 months ago │
│ 5 │ llib-p          │ dir  │  4.0 kB │ 3 years ago  │
│ 6 │ readme.md       │ file │ 39.9 kB │ 3 years ago  │
│ 7 │ tests           │ dir  │  4.0 kB │ 3 months ago │
╰───┴─────────────────┴──────┴─────────┴──────────────╯
/work/dev/llib> # render the table in YAML
/work/dev/llib> ls | to yaml
- name: LICENSE.txt
  type: file
  size: 1486
  modified: 2022-02-04 16:49:34.466300249 +00:00
- name: build
  type: file
  size: 33
  modified: 2022-02-04 16:49:34.466300249 +00:00
- name: build-mingw.bat
  type: file
  size: 60
  modified: 2022-02-04 16:49:34.466300249 +00:00
- name: examples
  type: dir
  size: 4096
  modified: 2025-02-08 19:07:39.755974376 +00:00
- name: llib
  type: dir
  size: 4096
  modified: 2025-05-04 15:44:33.971085666 +00:00
- name: llib-p
  type: dir
  size: 4096
  modified: 2022-02-04 16:49:34.470300295 +00:00
- name: readme.md
  type: file
  size: 39915
  modified: 2022-02-04 16:49:34.470300295 +00:00
- name: tests
  type: dir
  size: 4096
  modified: 2025-05-08 17:10:16.706859509 +00:00

Piping the table into the describe command gives you the actual type of the data created by ls (the Powershell equivalent is Get-ChildItem | Get-Member -MemberType Property)

/work/dev/llib> ls | describe
table<name: string, type: string, size: filesize, modified: datetime> (stream)

get extracts a column as a list:

/work/dev/llib> ls | get name
╭───┬─────────────────╮
│ 0 │ LICENSE.txt     │
│ 1 │ build           │
│ 2 │ build-mingw.bat │
│ 3 │ examples        │
│ 4 │ llib            │
│ 5 │ llib-p          │
│ 6 │ readme.md       │
│ 7 │ tests           │
╰───┴─────────────────╯

help <cmd> gives help with examples and help commands gives the whole lot - 618 on my system! And these are builtins and plugins, not executables. You can of course call external commands, but this shell is very full-featured out of the box, which explains why its 22mb on my system. It has built-in sqlite, http is a built-in command, and with the polars plugin (part of the standard distribution) it can do dataframe manipulation, read Parquet files, etc.

/work/dev/llib> cat LICENSE.txt | lines | take 5
╭───┬──────────────────────────────────────────────────────────────────────╮
│ 0 │ -------------------------------------------------------------------- │
│ 1 │ Copyright (c) 2013 Steve Donovan                                     │
│ 2 │ All rights reserved.                                                 │
│ 3 │                                                                      │
│ 4 │ Redistribution and use in source and binary forms, with or without   │
╰───┴──────────────────────────────────────────────────────────────────────╯

The nushell language (called Nu) was developed in Rust by fans of Rust, so it looks like a scripting variant of Rust - this is the equivalent of the shell example earlier:

for i in 1..10 {
    if $i > 5 {
        print $"larger ($i)"
    }
}

Normal comparison operators are available, the range iterator is builtin. The strangest thing is the string interpolation syntax $"...."

Nushell is not available on any random machine you might ssh into, so using it as your shell requires justification. There is always an investment of time and energy needed.

First, it makes simple queries on data easy, and commands return data. There is an actual filesize type which can be written with the usual postfixes:

/work/dev/llib> ls | where size > 10kb
╭───┬───────────┬──────┬─────────┬─────────────╮
│ # │   name    │ type │  size   │  modified   │
├───┼───────────┼──────┼─────────┼─────────────┤
│ 0 │ readme.md │ file │ 39.9 kB │ 3 years ago │
╰───┴───────────┴──────┴─────────┴─────────────╯

There is a Unix find command for going over a directory tree, which I can never remember how to use. But this ls command can take a glob pattern meaning 'everything under this directory':

/work/dev/llib> ls **/* | where size > 50kb
╭───┬──────────────────────────────┬──────┬──────────┬──────────────╮
│ # │             name             │ type │   size   │   modified   │
├───┼──────────────────────────────┼──────┼──────────┼──────────────┤
│ 0 │ examples/example.db          │ file │  11.1 MB │ 7 months ago │
│ 1 │ examples/json.db             │ file │  11.3 MB │ 7 months ago │
│ 2 │ examples/pkgconfig/pkgconfig │ file │  51.7 kB │ 3 years ago  │
│ 3 │ examples/web/simple          │ file │  62.0 kB │ 3 years ago  │
│ 4 │ examples/web/use-select      │ file │  71.8 kB │ 3 years ago  │
│ 5 │ llib/libllib.a               │ file │ 327.7 kB │ 3 months ago │
│ 6 │ tests/test-json              │ file │  59.8 kB │ 5 months ago │
│ 7 │ tests/test-pool              │ file │  56.6 kB │ 5 months ago │
│ 8 │ tests/test-template          │ file │  64.0 kB │ 5 months ago │
╰───┴──────────────────────────────┴──────┴──────────┴──────────────╯

It is then easy to apply a command to each one of these files:

/work/dev/llib> ls **/* | where size > 50kb | 
    get name |  each { path parse  }
╭───┬────────────────────┬───────────────┬───────────╮
│ # │       parent       │     stem      │ extension │
├───┼────────────────────┼───────────────┼───────────┤
│ 0 │ examples           │ example       │ db        │
│ 1 │ examples           │ json          │ db        │
│ 2 │ examples/pkgconfig │ pkgconfig     │           │
│ 3 │ examples/web       │ simple        │           │
│ 4 │ examples/web       │ use-select    │           │
│ 5 │ llib               │ libllib       │ a         │
│ 6 │ tests              │ test-json     │           │
│ 7 │ tests              │ test-pool     │           │
│ 8 │ tests              │ test-template │           │
╰───┴────────────────────┴───────────────┴───────────╯

Second, the pipeline model makes function application go from left to right; the usual f(g(h(x))) is right to left from the argument. It is easier to successively refine the result by applying extra operations if we write this $x | h | g | f - easier to read and to edit in an interactive shell.

Why should you consider using it for shell scripting? Apart from the straightforward syntax and sensible try..catch error handling, for me it's how elegant it is to write self-documenting custom commands:

# Greet guests along with a VIP
#
# Use for birthdays, graduation parties,
# retirements, and any other event which
# celebrates an event # for a particular
# person.
def vip-greet [
  vip: string        # The special guest
   ...names: string  # The other guests
] {
  for $name in $names {
    print $"Hello, ($name)!"
  }

  print $"And a special welcome to our VIP today, ($vip)!"
}

And help vip-greet will work as expected.

That's pretty classy.

Thursday, 29 September 2016

A Modest Alternative II: Taming stdio's scanf

The Ugly Brother

My favourite Irish joke (and I'm Irish enough to tell it) concerns a posh gent who is lost in Dublin. He asks a bystander, "Tell me my good man, how can I get to the National Museum?". Who replies: "Well sir, I wouldn't go from here". This is the received opinion about scanf, the ugly brother of printf.

It tends to appear only in the kind of beginner tutorals where programs ask the user to provide two numbers and then add them up. Serious people avoid scanf because it's hard to use properly: easy to confuse, and tricky to get meaningful errors. Beginners read the tutorials and then ask questions about how to bullet-proof scanf and get told by serious people not to use it.

scanf's glamorous cousin std::cin is easier to use, because it matches against strongly-typed non-const references - reading a constant is a compile error. But error handling is not straightforward.

Besides, sometimes you don't have the resources nor the inclination to use iostreams, as I discussed in my last article

Line by Line

A common strategy is to read a text file, line by line. Here std::getline is superior to fgets; it allocates more buffer if needed and trims end-of-line characters.

The outstreams library provides a very similar interface:

string line;
stream::Reader in("myfile.txt");
while (in.getline(line)) {
   do_something(line);
}
if (! in) {
   stream::errs(in.error())('\n');
}
// if failed:
// --> No such file or directory

Always a good idea to check for errors; if the error state is set, then no further read operations will take place, so it's perfectly fine to check afterwards.

So far, very much like iostreams's istream interface, except you get a sensible error without having to call perror yourself (which belongs to stdio, the irony)

From now on I'll assume that you have broken down and said 'using namespace stream'. Here's a one-liner which populates a container:

vector<string> lines;
Reader in("myfile.txt");
in.getlines(lines);
// errors!?

The container can be anything which understands push_back - a std::list would work just as well here. There is an optional second argument to getlines which gives the maximum number of lines to grab. (It's not how these things are typically organized; you would pretend that Reader was a container - or write an adapter - and use std::copy and std::back_inserter. This is more general but significantly uglifies the common case.)

Then there is a readall method which grabs all of the file and puts it into a std::string. The weakness of the standard string - that it knows nothing about character encoding - becomes a strength when you treat it as sliceable, appendable bag of bytes.

Having captured the lines, those serious people will typically then use serious conversion functions like std::stoul etc to actually extract values and get errors.

A Comedy of Errors

Things become less than simple with both stdio and iostreams when reading items one by one. scanf returns the number of items successfully processed, or EOF if we run out of stream. If the number scanned is too big for the type, garbage results. (The man page does claim that it will set errno if a conversion resulted in a value out of range, but it appears to be lying.) So serious use involves lots of checking.

I've tried to tame these errors by putting a facade around the actual scanf calls, rather as outstreams wrapped printf:

double x;
string s;
short b;

StrReader is ("4.2 boo 42");
is (x) (s) (b);

You always should be aware of errors: files may be mistyped, mangled by the network, or maliciously altered.

is.set("x4.2 boo 42");
is (x) (x) (b);
errs(is.error())('\n');
// --> error reading double
// at 'x4.2'
is.set("x4.2 boo 344455555");
is (x) (x) (b);
errs(is.error())('\n');
// --> error converting int16 
// --> out of range
//  344455555
is.set("x4.2 boo x10");
is (x) (x) (b);
errs(is.error())('\n');
// --> error reading int64 at
// 'x10'

('int64' for reading a short? Because I cannot assume that the read value is in range.)

There's an even more thorough way to handle errors; there is an error struct which can be 'read' from a stream.

Reader::Error err;

if (! is (x) (x) (b) (err)) {
// err.errcode  EOF or errno
// err.msg as returned by error()
// err.pos position in file
}

// safe one-liner:
// capture the error state
// before reader object dies
if (Reader("tmp.txt")
    .getline(line)
    (err)
) {
   // cool
} else {
   // bummer! But at least
   // you know _where_
}

So, in summary so far, Reader (and its string-oriented cousin StrReader) provides a safer way to use stdio for input, with better error handling, without the baroque contortions of iostream errors.

Why no exceptions? It can be a matter of taste; it's often forbidden in embedded coding and it can be argued that paying attention to the error where it happens leads to better code. You are completely free to throw your own exception of course after checking for an error, but it then it won't be some generic 'file not found' exception which makes no sense several stack traces down.

Some useful tricks

It's possible to read a file of numbers directly:

outstreams$ cat numbers.txt
10 20 30 40
50 60 70 80
...
int i;
Reader in("numbers.txt");
while (in (i)) {
    outs(i);
}
outs(eol);
// -> 10 20 30 40 50 60 70 80

Of course, the read could be replaced by in >> i with istream and it would work in the same way, except that any errors would not give as much information.

What if we only wanted the first three numbers on each line?

in (i) (j) (k) ();
outs(i)(j)(k)(eol);
// -> 10 20 30    
in (i) (j) (k) ();
outs(i)(j)(k)(eol);
// -> 50 60 70

The no-argument overload of the call operator is equivalent to the skip method, which generally takes the number of lines to skip.

Binary files

The read method comes in two flavours. The first is given a buffer with its size, and returns the number of bytes actually read.

Reader self ("rx-read");
self.setpos(0,'$');
auto endp = self.getpos();
outs("length was")
    (endp)(eol);
// --> length was 42752
self.setpos(0,'^');
auto buff = 
    new uint8_t [endp];
auto res = 
    self.read(buff,endp);
outs("read")(res)
    ("bytes")(eol);
// --> read 42752 bytes

setpos is a little eccentric, but will make sense to people who use regular expressions - '^' means 'start of file' and '$' means 'end of file'; anything else means current position.

(The std::istream method of the same name and signature returns the stream, which is consistent, but awkward if you're interested in the bytes actually read.)

The other form of read is a template method that takes one argument and returns the stream. It's useful for objects that have fixed size at compile time, like structs and arrays

MyHeader st;
MyInfo info;
inf.read(st).read(info);
// error state will be EOF
// if we didn't read all

Commands

Executing and capturing the output of external problems is a pain point in C++ for me. So naturally I wanted to capture the output of that excellent old dog popen and make it easier to use.

CmdReader ls("ls *.cpp");
vector<string> cpp_files;
ls.getlines(cpp_files);

string uname =
    CmdReader("uname").line();
if (uname == "Linux") { 
    outs("that's a relief")(eol);
}

CmdReader derives from Reader - all it need do is override close_handle so that pclose is called instread of fclose. It has an extra convenience method line for grabbing first line of output. stdout and stdin are merged.

If you aren't particularly interested in the output and just success/failure, there are a few convenient patterns:

// quick cool/uncool check
string res = CmdReader(
    "true", 
    cmd_ok
    ).line();
if (res != "OK") {
    outs("very weird shell")
        (eol);
}

// actual return code
int retcode;
CmdReader("false",cmd_retcode)
    (retcode);

Wrapping Up

We have lost some scanf functionality by breaking up the format into little bits.

However, if you already have a means to split the input stream into parts, then Reader machinery can be re-used to do the actual conversion - see templ-read.cpp:

int n;
double x;
string s;
auto parts = {"42","5.2","hello"};
auto rdr = make_parts_reader(
    parts.begin(),
    parts.end()
);
rdr (n) (x) (s);

This pattern can be made to work with any source of strings, like regular expression matches, database queries, and so forth.

I tried to show that scanf can be tamed, and it then becomes more pleasant and reliable to use. This library is not a heavy dependency (instream and outstream do not depend on each other and are each a single source file) and provides a middle option when choosing between stdio and iostreams.

Highly structured input formats are best read with the correct tools - we do have JSON, CSV, config-file, etc parsers. I suspect the problem here is that there is no browsable and discoverable repository of useful C++ libraries like Cargo for Rust, etc.

Saturday, 9 April 2016

stdio or iostreams? - A Modest Alternative

An Awkward Choice

The standard C++ iostreams library is generally easier to use for output than stdio, if you don't particularly care about exact formatting output and have no resource constraints. No one can say that stdio is pretty.. Portable use of printf requires the use of ugly PRIu64 style macros from and noisy calls to c_str(). It's fundamentally weakly-typed and can only work with primitive types.

However, using iostreams gets ugly fast, with a host of manipulators and formatting specifiers, all tied together with <<. There are also some serious misfeatures - beginners learn to obsessively spell 'end of line' as std::endl but never learn that it also flushes the stream, until their program proves to be an utter dog at I/O. Generally standard C++ streams are slower and allocate far too much to be considered seriously for embedded work. In the immortal words of the readline(3) man page, under the Bugs section "It's too big and too slow".

Embedded systems which can benefit from the abstraction capabilities of C++ don't necessarily have space for the large sprawling monster that the standard iostreams has become. In this realm, printf (or a stripped-down variant) still rules.

It is true that we no longer depend on simple text i/o as much as the old-timers did, since many structured text output standards have emerged. We still lean on stdio/iostreams to generate strings, since string manipulation is still relatively poor compared to other languages. With some classes of problems, debug writes are heavily used. And all non-trivial applications need logging support.

The library I'm proposing - outstreams - is for people who want or need an alternative. It presents another style of interface, where overloading operator() leads to a fluent and efficient style for organizing output. It is still possible to use printf flags when more exact formatting is required. Building it on top of stdio gives us a solid well-tested base.

It is not, I must emphasize, a criticism of the standard C++ library. That would be the technical equivalent of complaining about the weather.[0]

Chaining with operator() for Fun and Profit

By default, the standard outstreams for stdout and stderr define space as the field separator which saves a lot of typing when making readable output with little fuss.

double x = 1.2;
string msg = "hello";
int i = 42;

outs (x) (msg) (i) ('\n');
// --> 1.2 hello 42
outs("my name is")(msg)('\n');
// --> my name is hello

Chaining calls like this has advantages - most code editors are aware of paired parentheses, and some actually fill in the closing bracket when you type '('. It's easy to add an extra argument to override the default formatting, in a much more constant-friendly way than with printf:

const char *myfmt = "%6.2f";
double y = 0;
outs(x,myfmt)(y,myfmt)
    (msg,quote_s)('\n');
// -->   1.20  0.00 'hello'

Another useful property of operator() is that standard algorithms understand callables:

vector<int> vi {10,20,30};
for_each(vi.begin(), vi.end(), outs);
outs('\n');
// -> 10 20 30

Containers

The for_each idiom is cute but outstreams provides something more robust where you can provide a custom format and/or a custom separator:

// signatures
//   Writer& operator() (It start, It finis,
//      const char *fmt=nullptr, char sepr=' ');
//  Writer& operator() (It start, It finis, 
//     char sepr);

outs(vi.begin(),vi.end(),',')('\n');
// --> 10,20,30

string s = "\xFE\xEE\xAA";
outs(s.begin(),s.end(),hex_u)('\n');
// --> FEEEAA

In the C++11 standard there is a marvelous class called std::intializer_list which is implicitly used in bracket initialization of containers. We overload it directly to support brace lists of objects of the same type; there is also an convenient overload for std::pair:

// signature: 
// Writer& operator() (
//    const std::initializer_list<T>& arr,
//    const char *fmt=nullptr, char sepr=' ');
outs({10,20,30})('\n');
// --> 10 20 30

// little burst of JSON
outs('{')({
    make_pair("hello",42),
    make_pair("dolly",99),
    make_pair("frodo",111)
},quote_d,',')('}')();
// --> { "hello":42,"dolly":99,"frodo":111 }

// which will also work when iterating
// over `std::map`

This works because the quote format only applies to strings - anything else ignores it.

Writing to Files and Strings: Handling Errors

The Writer class can open and manage a file stream:

Writer("tmp.txt")(msg)('\n');
// -> tmp.txt: hello\n

C++ manages the closing of the file automatically, as with the iostreams equivalent.

Of course, there's no general guarantee that 'tmp.txt' can be opened for writing: outstreams borrow a trick from iostreams and convert to a bool:

Writer wr("tmp.txt");
if (wr) {
   wr(msg)('\n');
} else {
   errs(wr.error())('\n');
}

It's straightforward to build up strings using this style of output. (Mucking around with sprintf can be awkward and error-prone. Clue: always spell it snprintf) Here is the ever useful generalized string concatenation pattern:

StrWriter sw;
int arr[] {10,20,30};
sw(arr,arr+3,',');
string vis = sw.str();
// -> vis == "10,20,30"

The Problem with "Hello World"

It is perfectly possible to construct a generic 'print' function in modern C++. With variadic templates it can be done elegantly in a type-safe way. The definition is recursive; printing out n items is defined as printing the first value, then printing the last n-1 values; printing 1 value uses outs.

template <typename T>
Writer& print(T v) {
   return outs(v);
}

template <typename T, typename... Args>
Writer& print(T first, Args... args) {
   print(first);
   return print(args...);
}

...
int answer = 42;
string who = "world";

print("hello",who,"answer",answer,'\n');
// -> hello world answer 42

This is cute, but although the implementation shows the flexibility of modern C++, the result shows the limitations of print as a utility. Fine for a quick display of values, but what if you need an integer in hex, or a floating-point number to a particular precision, and so forth? Where print exists in languages - Python 3, Java and Lua have this style - there is this problem of formating. One approach is to define little helper functions and generally lean on string handling; for instance it's easy with StrWriter to define to_str:

template <typename T>
string to_str(T v, 
    const char *fmt=nullptr)
{
   StrWriter sw;
   sw(v,fmt);
   return sw.str();
}
...
print
    ("answer in hex is 0x" + to_str(answer,"X"))
    ('\n');
--> answer in hex is 0x2A

You see this kind of code quite frequently with Java, and it sucks for high performance logging because of the cost of creating and disposing all the temporary strings. Java has since acquired an actual printf equivalent (probably provoked by its competitive younger sister C#) and both Python and Lua programmers use some kind of format to make nice strings. Not to say that to_str isn't a useful function - it's more flexible than std::to_string - but it will have a cost that you might not always want to pay.

Another approach is to create little wrappers, like a Hex class and so forth. So you get code like this: print("answer in hex",Hex(answer))();. The namespace becomes cluttered with these classes, like how std is full of things like hex,dec and so forth. A compromise is to add just another function which wraps a value and a format. This isn't bad for performance. since the wrapper type just transfers references to the values; you can see for yourself in print.cpp.

The other approach is the one taken by iostreams - define some special values which control the formatting of the next value, and so forth[1]. It can be done, but it's messy and makes the concept rather less appealing. It's a nice example of the Hello World Fallacy [2]where the easy stuff is attractively easy and the hard stuff is unnecessarily hard. And I maintain that print and iostreams fall into exactly that space.

This implementation of print does have the nice properly that it's easy to overload for new types, which is not the case for outstreams.

Quick Output And Logging

Use of the preprocessor in modern C++ is considered generally a Bad Thing, for good reasons. Macros stomp on everything, without respect for scope, and C++ provides alternative mechanisms for nearly everything (inlining, constant definition, etc) that C relies on the preprocessor to provide. But it isn't an absolute evil, if macros are always written as UPPER_CASE and so clearly distinct from scoped variables.

Here's a case where developer convenience outweighs ideological purity; dumping out variables. Sometimes symbolic debuggers are too intrusive or simply not available. It's a nice example of old macro magic combined with new operator overloading.

#define VA(var) (#var)(var,quote_d)
#define VX(var) (#var)(var,hex_u)
...
string full_name;
uint64_t  id_number;
char ch = ' ';
...
outs VA(full_name) VA(id_number)
    VX(ch) ('\n');
// --> full_name "bonzo the dog" id_number 666 ch 20

Here is a trick which allows you to completely switch off tracing, with little overhead. (Many loggers will suppress output if the level is too low, but any expressions will still be evaluated.)

// if the FILE* is NULL, then a Writer 
// instance converts to false
// can say logs.set(nullptr) to
// switch off tracing
#define TRACE if (logs) logs ("TRACE")
...
TRACE VA(obj.expensive_method()) ('\n');

I mentioned that logging was something that all serious programs need. It's tempting to write your own logger, but it's tricky to get right and this wheel has been invented before.

We use log4cpp where I work, but only its mother would consider it to be elegant and well-documented:

plogger->log(log4cpp::Priority::DEBUG,
    "answer is %d",42);

It does have an iostreams-like alternative interface but it's a bit clumsy and half-baked. However, it is very configurable and handles all the details of creating log files, rolling them over, writing them to a remote syslog, and so forth.

It is easy to wrap this in a Writer-derived class. In fact, it's easier to derive from StrWriter and override put_eoln, which is used by the 'end of line' empty operator() overload. Normally it just uses write_char to put out '\n', but here we use it to actually construct the call to log4cpp:

using PriorityLevel
    = log4cpp::Priority::PriorityLevel;
class LogWriter: public StrWriter {
    PriorityLevel level;
public:
    LogWriter(PriorityLevel level)
        : StrWriter(' '),level(level) {}

    virtual void put_eoln() {
        plogger->log(level,
            "%s",str().c_str());
        clear();
    }
};

By just exposing references to error, warn etc to Writer, the rest of your program does not have any dependencies on the log4cpp headers - just in case you do want to drop in your own replacement that directly uses syslog. Look at testlog.cpp and how it uses logger.cpp to encapsulate the details of logging.

Thereafter, use as before:

error("hello")(42)('\n');
warn("this is a warning")('\n');
debug("won't appear with defaults)('\n');
// -->
// 2016-04-10 09:32:11,552 [ERROR] hello 42
// 2016-04-10 09:32:11,553 [WARN] this is a warning

Costs and Limitations

In a simple test (writing a million records to a file with five comma-separated fields) outstreams seems to be about 10% more expensive, as we can expect from needing more calls to stdio. (Equivalent test for iostreams shows it seriously lagging behind). The library itself is small, so if your system has vfprintf (or equivalent) then it's a easy dependency. If the macro OLD_STD_CPP is defined it compiles fine for C++-03, without support for initializer lists.

There is a fundamental problem with operator() here - it can only be defined as a method of a type, unlike operator<<. So adding your own types requires overriding a special function and using a template version of operator() with a additional first const char* argument to avoid the inevitable Template Overloading Blues.

As a more traditional alternative, if a type implements the Writeable interface, then things work cleanly. (In any case, how a class wants to write itself out is its own private business.) Writeable provides a handy to_string using the implementation of the overriden write_to method.

Some may find this notation too compressed - << is nice & spacy by comparison. It is certainly less noisy than stdio, since format strings are optional. Sticky separators can be annoying (controlling them properly was probably the most tricky bit of the implementation) but for most applications, they seem appropriate - they can always be turned off.

UPDATE

Some did find the notation too compressed - in particular () seems to vanish. So now ('\n') is completely equivalent to ().

The cppformat also looks like a good alternative. It resolves the specification problem with variadic print by allowing for a format, either traditional or Python-style.

Some commenters felt that operator-chaining was completely old-hat and that variadic templates were obviously superior, which is a matter of taste.

Internationalization represents a problem, since the word order may change in translation - Python-style does solve this issue.

[0] Although that doesn't mean we have to like all of it. And using it is not compulsory.

[1] But don't make them sticky like with iostreams - there are Stackoverflow questions about how to print out in hex and then how to stop printing out in hex.

[2] This fallacy is an independent rediscovery of the exact phrase from an earlier article, but the feeling on Reddit was that the first guy's website was ugly and hence inherently inferior.

stdio or iostreams? - A Modest Alternative

An Awkward Choice

It is not, I must emphasize, a criticism of the standard C++ library. That would be the technical equivalent of complaining about the weather.[0]

Chaining with operator() for Fun and Profit

By default, the standard outstreams for stdout and stderr define space as the field separator which saves a lot of typing when making readable output with little fuss.

double x = 1.2;
string msg = "hello";
int i = 42;

outs (x) (msg) (i) ('\n');
// --> 1.2 hello 42
outs("my name is")(msg)('\n');
// --> my name is hello

const char *myfmt = "%6.2f";
double y = 0;
outs(x,myfmt)(y,myfmt)
    (msg,quote_s)('\n');
// -->   1.20  0.00 'hello'

Another useful property of operator() is that standard algorithms understand callables:

vector<int> vi {10,20,30};
for_each(vi.begin(), vi.end(), outs);
outs('\n');
// -> 10 20 30

Containers

The for_each idiom is cute but outstreams provides something more robust where you can provide a custom format and/or a custom separator:

// signatures
//   Writer& operator() (It start, It finis,
//      const char *fmt=nullptr, char sepr=' ');
//  Writer& operator() (It start, It finis, 
//     char sepr);

outs(vi.begin(),vi.end(),',')('\n');
// --> 10,20,30

string s = "\xFE\xEE\xAA";
outs(s.begin(),s.end(),hex_u)('\n');
// --> FEEEAA

// signature: 
// Writer& operator() (
//    const std::initializer_list<T>& arr,
//    const char *fmt=nullptr, char sepr=' ');
outs({10,20,30})('\n');
// --> 10 20 30

// little burst of JSON
outs('{')({
    make_pair("hello",42),
    make_pair("dolly",99),
    make_pair("frodo",111)
},quote_d,',')('}')();
// --> { "hello":42,"dolly":99,"frodo":111 }

// which will also work when iterating
// over `std::map`

This works because the quote format only applies to strings - anything else ignores it.

Writing to Files and Strings: Handling Errors

The Writer class can open and manage a file stream:

Writer("tmp.txt")(msg)('\n');
// -> tmp.txt: hello\n

C++ manages the closing of the file automatically, as with the iostreams equivalent.

Of course, there's no general guarantee that 'tmp.txt' can be opened for writing: outstreams borrow a trick from iostreams and convert to a bool:

Writer wr("tmp.txt");
if (wr) {
   wr(msg)('\n');
} else {
   errs(wr.error())('\n');
}

StrWriter sw;
int arr[] {10,20,30};
sw(arr,arr+3,',');
string vis = sw.str();
// -> vis == "10,20,30"

The Problem with "Hello World"

template <typename T>
Writer& print(T v) {
   return outs(v);
}

template <typename T, typename... Args>
Writer& print(T first, Args... args) {
   print(first);
   return print(args...);
}

...
int answer = 42;
string who = "world";

print("hello",who,"answer",answer,'\n');
// -> hello world answer 42

template <typename T>
string to_str(T v, 
    const char *fmt=nullptr)
{
   StrWriter sw;
   sw(v,fmt);
   return sw.str();
}
...
print
    ("answer in hex is 0x" + to_str(answer,"X"))
    ('\n');
--> answer in hex is 0x2A

This implementation of print does have the nice properly that it's easy to overload for new types, which is not the case for outstreams.

Quick Output And Logging

#define VA(var) (#var)(var,quote_d)
#define VX(var) (#var)(var,hex_u)
...
string full_name;
uint64_t  id_number;
char ch = ' ';
...
outs VA(full_name) VA(id_number)
    VX(ch) ('\n');
// --> full_name "bonzo the dog" id_number 666 ch 20

Here is a trick which allows you to completely switch off tracing, with little overhead. (Many loggers will suppress output if the level is too low, but any expressions will still be evaluated.)

// if the FILE* is NULL, then a Writer 
// instance converts to false
// can say logs.set(nullptr) to
// switch off tracing
#define TRACE if (logs) logs ("TRACE")
...
TRACE VA(obj.expensive_method()) ('\n');

I mentioned that logging was something that all serious programs need. It's tempting to write your own logger, but it's tricky to get right and this wheel has been invented before.

We use log4cpp where I work, but only its mother would consider it to be elegant and well-documented:

plogger->log(log4cpp::Priority::DEBUG,
    "answer is %d",42);

using PriorityLevel
    = log4cpp::Priority::PriorityLevel;
class LogWriter: public StrWriter {
    PriorityLevel level;
public:
    LogWriter(PriorityLevel level)
        : StrWriter(' '),level(level) {}

    virtual void put_eoln() {
        plogger->log(level,
            "%s",str().c_str());
        clear();
    }
};

Thereafter, use as before:

error("hello")(42)('\n');
warn("this is a warning")('\n');
debug("won't appear with defaults)('\n');
// -->
// 2016-04-10 09:32:11,552 [ERROR] hello 42
// 2016-04-10 09:32:11,553 [WARN] this is a warning

Costs and Limitations

UPDATE

Some did find the notation too compressed - in particular () seems to vanish. So now ('\n') is completely equivalent to ().

The cppformat also looks like a good alternative. It resolves the specification problem with variadic print by allowing for a format, either traditional or Python-style.

Some commenters felt that operator-chaining was completely old-hat and that variadic templates were obviously superior, which is a matter of taste.

Internationalization represents a problem, since the word order may change in translation - Python-style does solve this issue.

[0] Although that doesn't mean we have to like all of it. And using it is not compulsory.

[1] But don't make them sticky like with iostreams - there are Stackoverflow questions about how to print out in hex and then how to stop printing out in hex.

[2] This fallacy is an independent rediscovery of the exact phrase from an earlier article, but the feeling on Reddit was that the first guy's website was ugly and hence inherently inferior.

Sunday, 7 June 2015

The Flub Paradox

Blub and Flub

Paul Graham's classic essay Beating the Averages is well worth re-reading. It is the story of how, twenty years ago, Paul Graham and Robert Morris built an online store generator called Viaweb and out-manoeuvred their many competitors using their secret weapon, Lisp. But it is much more than a success story with a fairytale ending.

He identified what he called the Blub Paradox: why people persist in using so-so programming languages when they could be so much more productive using something more powerful. He works from the assumption that there is a hierarchy of power of programming languages, starting with Assembly and working up to your favourite language, the one you would love to use in your job. He calls the so-so, median language Blub [1] The paradox is that when Blub users looks down at 'inferior' languages, they can't imagine using such obsolete technology; but when they look up the power hierarchy they don't see greater power, but weirdness and attitude. The reason, Mr Graham thinks, is because they think in Blub.

For example, the Blubbist looks at Haskell, and first sees the different surface syntax, the dismaying lack of curly braces and the mathematical terseness of the function definitions. Then they see that data is immutable, and think: "These variables don't vary! How is it possible to write programs without mutable state? And why?" The point is that they don't recognize the power, and hence don't acknowledge the hierarchy.

Mr Graham's essays about programming tend to assume the natural superiority of Lisp over all other languages; you don't have to agree with him about this to appreciate his arguments. I'll call the 'obviously' superior language in any comparison Flub, for much the same reasons he calls the so-so language Blub. I can then isolate the common characteristics of ueber-languages and their proponents, and emphasize that both Blub and Flub are time-dependent variables. In other words, Lisp was the Flub of its day. A few years ago Flub was Haskell, and recently the new contender appears to be Rust, judging from the buzz. (Note: this is about tracking fashion, not a comment on the actual virtues of these languages.) There has been a lot of new language design happening in the last seven years or so, so I don't doubt that there will be a new value of the variable Flub in a few years. (These are just observations gleaned from reading Reddit and Hacker News, which is probably the closest I get to following sport.) [2]

Joe Hacker and The Continuum of Power

Mr Graham is of course careful to qualify his assertion that there is a natural continuum of power in programming languages. He would not himself use Lisp for everything, since Perl is a better tool for the text slicing-and-dicing of system administration - cooks have more than one pot. He was (after all) a guy getting things done, who wasn't going to let purity and aesthetics slow him down. Contrast with this quote about Edsgar Dijkstra in an interview with Donald Knuth: "It would make him physically ill to think of programming in C++". Which is appropriate; Dijkstra was a computer scientist [3]. But a bit ... weird to those of us who enjoy programming as a means of making vague exciting ideas come alive and making impact. After all, Mr Graham ended up selling Viaweb to Yahoo for $50 million in shares in 1998.

Any closer examination of 'power' in this context reveals it to be a slippery concept. Is BASIC more powerful than Assembly? I doubt it, you can do anything in assembly, and BASIC is deliberately limited [4]. Assembly is very long-winded however, and involves constantly playing with sharp knives; if a job can be done adequately in BASIC, then it should. Using assembly for a BASIC kind of problem is sheer premature optimization and requires a much higher skill level. So BASIC is more expressive than assembly. It will take you even less lines to do the job in Python, so Python is more expressive than BASIC. A better working definition of 'power' here is therefore 'expressiveness'. In the hands of a master, Lisp is more expressive still, because of its meta powers (code that writes code). So Paul Graham and Robert Morris could out-code their competitors; a small posse of smart hackers in tune with their tools can always run rings around larger gangs of programmers led by non-technical people. And that was their secret weapon, ultimately.

So the more expressive a language the better? Not necessarily; consider APL, famous for its one-liners. This represents 'too much of a good idea' at least for people who like typing with normal keyboards. Expressiveness has a lot to do with library support, not just ideas-per-line; Python is well-known for having libraries to do just about anything and finds a lot of use in the scientific community.

'Power' and 'Expressiveness' turn out to be separate concepts. In the end, there is no simple continuum that you can use to line programming languages up, nose to tail, feeble to awesome.

We prefer to reduce vector spaces to scalars, perhaps because of this psychological need to rank things and people. In particular, the multi-dimensional nature of programming space undermines the central Blub argument: "By induction" he writes, "the only programmers in a position to see all the differences in power between the various languages are those who understand the most powerful one".

Except there isn't such a beast.

Blub: Language or Community?

Blub is probably more an attribute of a language's community than the language itself.

Consider Java. In the "Enterprise", only the architect gets to have fun; the developers are not expected to take pleasure in their work, and the feelings of the users are irrelevant, since it is only their managers that matter.[5] This is the very opposite of the Flub spirit.

But Java is a pretty productive language, if used with an independent mind. The proverbial 'small posse' could do very disruptive things with it. To use it effectively, however, requires using an IDE, and Flubbists hate IDEs, partly because their finger muscle memory has been overspecialized from spending too much time in Eighties power editors, but also because using an IDE is too closely associated with Blub practices. (People who can't ride a bicycle without training wheels, basically.) Having recently made my peace with IDEs again, after a long sabbatical, I can attest that they are irritating - like a pedantic pair programming partner. But you learn to ignore the yellow ink, and other fussy ways, to get the productivity boost.

Now I don't doubt you can do better than Java the language, and still remain in Java the ecosystem; Scala shows signs of actual adoption, and Clojure is an attempt at re-Flubbing Lisp. But a underwhelming language is no obstacle to getting things done by determined talent, if the talent is pragmatic enough and does not get physically ill like Dijkstra. Knuth himself wrote TeX in 'literate' Pascal; his idea was to embed the code in a clear narrative, and incidentally get around the limitations of the language (it did not support separately compiled modules, for instance.)

The Flub Paradox

The Flub Paradox can be stated like this: although Flub is self-evidently more powerful and productive than Blub, relatively few people use it and its effect on the real landscape of innovation appears minimal. Where is Flub being used as a secret weapon, in the way described by Paul Graham? Why aren't its practitioners exploiting their natural advantage?

An example of the Flub paradox is the Haskell web framework Yesod; that is not an earth-shattering list of uses, when you take away the Haskell-related sites.

A disruptive startup is more likely to use Blub in creative ways, focusing on the idea, not the implementation. Facebook dominated its market - using PHP - which everyone agrees is crap. Google built up the world's biggest advertising agency using Java and C++; Android is a Java-like ecosystem that runs in most pockets these days.

(At this point, "The Unreasonable Effectiveness of Blub" seems like a good title for an article: Simon Newcomb's analysis of heavier-than-air flight confronted by the tinkerings of bicycle makers. Be my guest.)

Mr Graham says (and it's still true) that "when you're writing software that only has to run on your own servers, you can use any language you want". But the modern web is a dance between the server and the client, and that client is running JavaScript.

The vitriol surronding JavaScript is interesting and revealing. Granted, it is not well designed and was knocked together in an ridiculously short period of time; but it has good influences, like Scheme's closures[6]. It could have easily been something much worse, like Tcl. Yet the Flubbists hate it, it's untidy and used by yahoos (pun intended.) It is an obstacle to the march of Flub in a way Blub never is.

That's also perhaps why so many words are wasted on flaming Go; a deliberately simplified, anti-theoretical language that is gaining ground, particularly in the business of writing server software. Go is like a predatory shark appearing amongst dolphins; competition for resources that also eats your babies. It is that terrible thing, a rising Blub.

If you are a Flub practioner, you want there to be Flub job opportunities; it is thus very important to establish the idea that Flub is the next best thing. A veritable deluge of articles appears, explaining the beauties of Flub and giving crash courses on the mathematical background needed to understand it. For languages that arise in academia, this is necessary work to establish your reputation and secure a comfortable position from where you can continue to write more articles. If they remain in academia, it is necessarily and also sufficient.

That's partly why there is a Flub Paradox; people like Paul Graham who are sharp hackers and prepared to risk their livelihood to follow a startup dream aren't common. Plus, although they may ascribe their success to Flub, it is actually only one of many factors, and the most important of these are talent and motivation. You put a group of merely good Lisp programmers on a project, and embed them in a corporate environment, and the stellar results are likely to be not reproducible. (I don't doubt they will be more productive than Java programmers. But will the result blend?) Big companies understand this problem well, and prefer Blub programmers, who are easier to source and come with less attitude.

Part of the problem is over-selling Flub as a universal panacea (Fred Brooks' "No Silver Bullet" again.) For example, there has been a great deal of heat round the new language Rust from the Mozilla foundation; a strongly-typed systems language with better guarantees of safety.

To use Rust for writing programs that can afford the overhead of garbage collection would be premature optimization. An engineering analogy: titanium is a fantastic metal, stronger than steel yet half the weight - it does not rust; but it's a bitch to work with. and expensive. By Mr Graham's argument, we should prefer it over aluminium, all other things being equal, except that they ain't. Using a more advanced language could be significantly more expensive than just doing it in Java. I'm not just thinking of the developer cost, but the time cost. For instance, it is reported that a 2.4 kloc Rust program takes 20sec for a dev build; that's very long for such a dinky program. Java, Go and C would be practically instantaneous; for comparison, a 10 kloc C++ program takes under 10sec for a full rebuild, and average rebuild times are about a second. It is important that Rust works on its incremental build story. We hear the authentic voice of despair: "getting blocked for several minutes every time I need to recompile is a big productivity killer after a while."[7]

I don't doubt the world needs safety guarantees, given the messes created by cowboy firmware. C is famous for letting you shoot your own foot; in safety-critical applications, you are shooting someone else's foot. Rust is a sincere attempt at solving that kind of problem, as well as other systems software, like operating systems.

However, these are fairly niche areas of development; it is unlikely that major operating systems will emerge written entirely in Rust, and embedded engineers are a conservative bunch - they usually come from an EE background. Also remember that theoretical correctness and safety guarantees are old preoccupations of the industry, and Ada is well entrenched in some parts. I suspect that the idea of a magically safe language will turn out to be yet another silver bullet; it would be interesting to be proven wrong.

Edit: It was probably a mistake to mention actual concrete values for Flub, again see[1]. The point I wish to emphasize has nothing to do with the value of Flub or Blub, and everything to do with not seeing Flub used as a disruptive force of change, as Lisp was in Paul Graham's original essay. It is probably simply too early to tell.

[1] Trashing a developer's language of choice by name is like insulting a patriot's country or his mother's apple pie - the nerd equivalent of bumping the pool cue of a big man in a bar. It makes further rational discussion impossible.

[2] It would be interesting to do sentiment analysis on these august websites and selected social-media feeds, and track the buzz surrounding popular programming languages. Not exactly science, but numbers are good fuel for an argument. The Tiobe Index is approximate and not good at capturing changes in the 'long tail' where Flub lives.

[3] It's entirely possible that programming in C++ can make you physically ill, but the point is that you have to learn it and use it to know. Dijkstra arrived at this conclusion from first principles.

[4] BASIC was originally a teaching language from the University of Dartmouth, baby FORTRAN. Niklaus Wirth's Pascal is another example of a language designed to teach a very structured way of thinking. So criticisms of Pascal are a bit unfair, because it wasn't intended for large scale application. The dialects that appeared later that some of us remember with affection were much more capable.

[5] The Marxist concept of 'alienation of labour' is applicable here, since it always was as much a psychological as an economic concept. Like a craftsperson becoming a factory worker, enterprise programmers are both separated from the value they create and divorced from the pleasure of creating value. No fun, and no big bucks either.

[6] Closures are functions that carry around their surrounding state with them; they are probably the most powerful thing I experienced through Lua. A pity that JavaScript was standardized quickly in such an uncooked state, since the steady development of Lua produced a much cleaner, simpler design. In many ways it is like JavaScript without braces: prototype-based object system, field references are lookups (a.f == a['f']) but with proper lexical scoping. The Web would be a better place if it ran on Lua, but modern Lua arrived too late. As an example of how surface syntax has such an unreasonable hold on developer opinion, Lua indexes from 1, not 0. This is apparently a problem. Besides, it does not have curly braces, which are an obvious marker of good language design.

Fortunately, modern C++ does have closures ('lambdas') and this has made me a happier man.

[7] Personally, this would be a deal breaker, because it would mess with my flow. Everyone has different strengths and patience isn't one of mine. I spent too much energy railing against the situation in C++, but learned to accept given reality when I got better hardware. But I still miss Delphi.

Sunday, 31 May 2015

What can C++ Libraries Learn from Lua?

The Convenience of Text Pattern Matching

const char *text = "Some people, when confronted with a problem, think \"I know, I'll use regular expressions.\". Now they have two problems.";

(Example for std::regex from http://http://en.cppreference.com/w/cpp/regex)

Parsing text with Regular Expressions is a powerful technique traditionally available with so-called 'scripting' languages, from the grand-daddy Awk through Perl onwards. Regular expressions are first-class citizens of these languages and their users get adept at using them. Of course, there can be too much of a good thing, and sometimes those users forget that they have perfectly good if-statements for making complex matches easier to write and read. This is why I like Lua string patterns precisely because they are more limited in scope. Strictly speaking, although similar, they aren't regular expressions since they lack the 'alternation' operator |. But they are a good balance between simplicity and power, which is the global virtue of Lua, together with smallness. (Lua in total weighs less than the PCRE library.)

C++ and its older brother C are not very good at text manipulation, by the standards set by these dynamic languages. So this investigation is about how we can use regular expressions for text wrangling, first in C and then in C++, and how the standard ways can be improved. My method will simply be to use the Lua string functions as a design template, whenever appropriate.

The C Way

For some obvious reasons, regular expressions are not so easy to use in C, although part of the POSIX standard and thus always available on compliant systems. There is no equivalent to regexp literals (/..../) so you have to compile the expression explicitly before you use it. You must also create a buffer to receive the positions where the match and its submatches occured in the string.

 regex_t rx;
 regmatch_t matches[1];  // there's always one match
 regcomp(&rx,"[[:alpha:]]+",REG_EXTENDED);

 if (regexec(&rx,"*aba!",1,matches,0) == 0) {
     printf("regexec %d %d\n",matches[0].rm_so,matches[0].rm_eo);
 } else {
     printf("regexec no match\n");
 }
 ...
 // explicitly dispose of the compiled regex
 regfree(&rx);

Doing something more realistic that just matching is a more involved. Say we wish to go over all the word matches with length greater than six in the above useful quotation:

 const char *P = text;
 char buffer[256];  // words are usually smaller than this ;)
 while (regexec(&rx,P,1,matches,0) == 0) {
     int start = matches[0].rm_so, end = matches[0].rm_eo;
     int len = end-start;
     if (len > 6) {
         strncpy(buffer,P+start, len);
         buffer[len] = 0;
         printf("got %d '%s'\n",len,buffer);
     }
     P += end; // find the next match
 }

Contrast with the Lua solution - note how '%' is used instead of '\' in regular expressions, with the character class '%a' meaning any char in the alphabet:

 for word in text:gmatch('%a+') do
     if #word > 6 then
         print ("got",#word,word)
     end
 end

So clearly there is room for improvement!

A Higher-level C Wrapper

I am not alone in finding classical regexp notation painful, especially in C strings where the backslash is the escape character. E.g. if we wanted to use special characters literally: "\$$([[:alpha]]+$)". The Lua notation for this expression would simply be "%$%(%a+)%)" which is easier on the eyes and the fingers. So I've provided an option to rx_new to convert Lua notation into classical notation. It is a very simple-minded translation: '%' becomes '\', '%%' becomes '%', and the Lua character classes 's','d','x','a','w' become the POSIX character classes 'space', 'digit', 'xdigit','alpha' and 'alnum' within brackets like '[[:space:]]'. The semantics are not at all changed - these regexps only look like Lua string patterns, although mostly equivalent.

An optional POSIX-only part of llib (see rx.cin the lllib-p directory) provides a higher-level wrapper. rx_find is very much like Lua's string.find although we don't have the convenience of multiple return values.

 Regex *r = rx_new("[[:alpha:]]+",0);
 int i1=0, i2=0;
 bool res = rx_find(r,"*aba!",&i1,&i2);
 printf("%d from %d to %d\n",res,i1,i2);

 // Now find all the words!
 int start = 0,end;
 while (rx_find(r,text,&start,&end)) {
     int len = end - start;
     if (len > 6) {
         strncpy(buffer,text+start,len);
         buffer[len] = 0;
         printf("[%s]\n",buffer);
     }
     start = end;  // find the next match
 }
 ...
 // generic disposal for any llib object
 unref(r);

The need for an extra 'matches' array has disappeared and is now managed transparently by the Regex type. It's pretty much how a Lua programmer would loop over matches, if string.gmatch wasn't appropriate, except for the C string fiddlng - which is essential when you specially don't want to allocate a lot of little strings dynamically.

Here is a cleaner solution, with some extra cost.

 int j1=0,j2;
 while (rx_find(r,text,&j1,&j2)) {
     str_t word = rx_group(r,0);
     if (array_len(word) > 6) {
         printf("%s\n",word);
     }
     unref(word);
     j1 = j2;
 }

rx_group will return the indicated match as a llib-allocated C string; could have used our friend strlen but array_len is going to be faster, since the size is in the hidden header; I've put in an unref to indicate that we are allocating these strings dynamically and they won't go away by themselves.

So, some code for extracting 'NAME=VALUE' pairs in a string separated by newlines (the ever-flexible C for-loop helps with not having to do j1=j2 at the end of the loop body, where these things tend to get forgotten, leading to irritatingly endless loops.)

 Regex *pairs = rx_new("(%a+)%s*=%s*([^\n]+)",RX_LUA);
 str_t test_text = "bob=billy\njoe = Mr Bloggs";
 for (int j1=0,j2; rx_find(r,test_text,&j1,&j2), j1=j2) {
     str_t name = rx_group(r,1);
     str_t value = rx_group(r,2);
     printf("%s: '%s'\n",name,value);
     dispose(name,value);
 }

This is easier to write and read, I believe. Since the loop counters are not used in the body of the loop, and since this is C and not C++, you can write a macro:

 #define FOR_MATCH(R,text) (int i1_=0,i2_; rx_find(R,text,&i1_,&i2_; _i1=i2_)

There are also functions to do substitutions, but I'll leave that to the documentation. llib is linked statically, so using part of it incurs little cost - I'd estimate about 25Kb extra in this case. You will probably need to make normal copies of these strings - a general function to copy strings and other llib arrays to a malloc'd block is here:

 void *copy_llib_array (const void *P) {
     int n = array_len(P) + 1;   // llib always over-allocates
     int nelem = obj_elem_size(P);
     void *res = malloc(n*nelem);
     memcpy(res,P,n*nelem);
     return P;
 }

News from the C++ Standards Committee

Some may think I have anti-C++ tendencies, but a professional never hates anything useful, especially if it pays the bills. So I was interested to see that regular expression support has arrived in C++.

 #include <regex>
 using namespace std; // so sue me
 ...
 // 'text' is the Useful Quote above...
 regex self("REGULAR EXPRESSIONS",
         regex_constants::ECMAScript | regex_constants::icase);
 if (regex_search(text, self)) {
     cout << "Text contains the phrase 'regular expressions'\n";
 }

That's not too bad - the Standard supports a number of regexp dialects, and in fact ECMAScript is the default. How about looking for words longer than six characters?

 regex word_regex("(\\S+)");
 string s = text;
 auto words_begin = sregex_iterator(s.begin(), s.end(), word_regex);
 sregex_iterator words_end;

 const int N = 6;
 cout << "Words longer than " << N << " characters:\n";
 for (sregex_iterator i = words_begin; i != words_end; ++i) {
     smatch match = *i;
     string match_str = match.str();
     if (match_str.size() > N) {
         cout << "  " << match_str << '\n';
     }
 }

Perhaps not the clearest API ever approved by the Standards Committee! We can make such an iteration easier with a helper class:

class regex_matches {

 const regex& rx;
 const string& s;

public:

 regex_matches(const regex& rx, const string& s): rx(rx),s(s) {
 }
 sregex_iterator begin() { return sregex_iterator(s.begin(),s.end(),rx); }
 sregex_iterator end() { return sregex_iterator(); }

}; ....

 regex_matches m(word_regex,s);
 for (auto match : m) {
     string match_str = match.str();
     if (match_str.size() > N) {
         cout << "  " << match_str << '\n';
     }
 }

We're finally getting to the point where a straightforward intent can be expressed concisely and clearly - this isn't so far from the Lua example. And it is portable.

Substituting text according to a pattern is a powerful thing that is used all the time in languages that support it, and std::regex_replace does a classic global substitution where the replacement is a string with group references.

Alas, there are some downsides. First, this does not work with the GCC 4.8 installed on my Ubuntu machines, but does work with the GCC 4.9 I have on Windows. Second, it took seven seconds to compile simple examples on my I3 laptop, which is an order of magntitude more than I expect from programs of this size. So, in this case, the future has not arrived.

A C++ Wrapper for POSIX Regular Expressions.

Portability is currently not one of my preoccupations, so I set out to do a modern C++ wrapping of the POSIX API, in a style similar to llib-p's Regex type. (Fortunately, the GnuWin32 project has made binaries for the GNU regex implementation available - although they are only 32-bit. The straight zip downloads are what you want, otherwise you will probably have unwelcome visistors on your machine.)

When testing this library on large-ish data, I received a shock. My favourite corpus is The Adventures of Sherlock Holmes from the Gutenberg project; just over 100,000 words, and this regexp library (from glibc) performs miserably to do something that is practically instantaneous in Lua. (std::regex is much better in this department.) So I've taken the trouble to extract the actual Lua pattern machinery and make it available directly from C++.

Let me jump immediately to the words-larger-than-six example. It is deliberately desiged to look like the Lua example:

 Rxp words("%a+",Rx::lua);  // simple Lua-style regexps
 for (auto M:  words.gmatch(text)) {
     if (M[0].size() > 6)
         cout << M[0] << "\n";
 }

When modern C++ cooks, it cooks. auto is a short word that can alias complicated types - (here just Rx::match), but the range-based for-loop is the best thing since std::string. And I've got my order-of-magnitude smaller compile time back, which is not an insignificant consideration.

(if you want to use the Lua engine, then replace the first declaration with simply Rxl words("%a+").)

I resisted the temptation to provide a split method; heaven knows the idea is useful but doesn't have to be implemented that way. It would return some concrete type like vector<string> and it would introduce a standard library dependency other than std::string into the library. Rather, Rx::match has a template method append_to which will take the results of the above iteration and use push_back on the passed container:

 vector<string> strings;
 words.gmatch(text).append_to(strings);

If you wanted a list<string> instead, then it trivially happens. You can append to a container and so forth without awkward splicing.

What would it mean to populate a map with matches? I don't think there's one answer, since there is no clear mapping, but here is one way of interpreting it; the expresson must have at least two submatches,and the first will be the key, and the second will be the value:

 Rx word_pairs("(%a+)=([^;]+)",Rx::lua);
 map<string,string> config;
 string test = "dog=juno;owner=angela"
 word_pairs.gmatch(test).fill_map(config);

Again, this will work with any smart array, not just std::map, as long as it follows the standard and defines mapped_type. The user of the class only pays for this method if it is used. These are considerations dear to the C++ mindset.

I'm an admirer of Lua's string.gsub, where the replacement can be three things:

a string like '%1=%2', where the digits refer to the 'captured' group (0 is the whole match)
a lookup table for the match
a function that receives all the matches

The first case is the most useful. Here we have a match, where the submatch is the text we want to extract (called 'captures' in Lua.)

 Rx angles("<(%a+)>",Rx::lua);
 auto S = "hah <hello> you, hello <dolly> yes!";
 cout << angles.gsub(S,"[%1]") << endl;
 // --> hah [hello] you, hello [dolly] yes!

With the second case, I did not want to hardwire std::map but defined gsub as a template method applicable to any associative array.

 map<string,string> lookup = {
    {"bonzo","BONZO"},
    {"dog","DOG"}
 };
 Rx dollar("%$(%a+)",Rx::lua);
 string res = dollar.gsub("$bonzo is here, $dog! Look sharp!",lookup);
 // --> BONZO is here, DOG! Look sharp!

Forcing the third 'callable' case to overload correctly doesn't seem possible, so it has a distinct name:

 // we need a 'safe' getenv - use a lambda!
 res = dollar.gsub_fun("$HOME and $PATH",[](const Rx::match& m) {
     auto res = getenv(m[1].c_str());
     return res ? res : "";
 });
 // --> /home/user and /home/user/bin:/....

(One of the ways that GCC 4.9 makes life better is that generic lambdas can be written, and the argument type here just becomes auto&.)

In this way, string manipulation with regular expressions can be just as expressive in C++ as in Lua, or any other language with string pattern support.

The files (plus some 'tests') are available here, but remember this is very fresh stuff - currently in the 'working prototype' phase.

Transferable Design

The larger point here is that dynamically-typed languages can provide design lessons for statically-typed languages. C++ and C (particularly) are poor at string manipulation compared to these languages, but there is nothing special here about dynamic vs static: it is a question of designing libraries around known good practices in the 'scripting' universe and using them to make programming more fun and productive in C and C++.

There is a limit to how far one can go in C, but it's clear we can do better, even if llib itself might not feel like the solution. C is not good at 'internal' iterators since function pointers are awkward to use, and so 'external' iterators - using a loop - is better for custom subtitutions, etc.

C++ has already got a standard for regular expressions, but it will take a while for the world to catch up: we are being hammered here by heavy use of templates, all in headers. This is of course a classic example of too much of a good thing, because generic code is how C++ escapes the tyranny of type hierarchies, by compile-time static duck-typing. (The fancy word here is 'structural typing'.) For instance, Rxp::gsub can use anything that looks like a standard map.

Even there, I wanted to present an alternative design, not necessarily because I want you to use it, but because looking at alternatives is good mental exercise. The Standard is not Holy Scripture, and some parts of it aren't really ready, in a pragmatically useful way. In the comfort of our own company repository I can choose a solution that does not damage our productivity.