Thursday, 2 October 2025

DS Hell: Nu as a powerful data DSL

HTML

HTML is the Domain Specific Language (DSL) that really has taken over the world. It is meant to be used as semantic markup, e.g <em> is for emphasis which can be then mapped to bold, or whatever. Indeed, without a good stylesheet it looks like utter crap. This form has been nicely styled with PicoCSS:

<form>
  <fieldset class="grid">
    <input 
      name="login"
      placeholder="Login"
      aria-label="Login"
      autocomplete="username"
    />
    <input
      type="password"
      name="password"
      placeholder="Password"
      aria-label="Password"
      autocomplete="current-password"
    />
    <input
        type="submit"
      value="Log in"
    />
  </fieldset>
</form>

HTML itself (and its data cousin XML) is a simplified form of SGML - which I present here as a example of a standards body having way too much fun.

HTML is a markup language, for presenting documents with structure. Since actually writing lots of HTML is tedious, there is a need to generate it from data.

One of the things buried in browsers is XSLT which was designed for the case of taking XML (such as an RSS feed) and converting it into HTML. It is not very pretty; here is a little taste:

…
<xsl:template match="myNS:Author">
    --   <xsl:value-of select="." />

  <xsl:if test="@company">
    ::   <b>  <xsl:value-of select="@company" />  </b>
  </xsl:if>

  <br />
</xsl:template>

XSLT involves an XML document converting one kind of XML document into another kind. But conceptual simplicity does not necessarily mean easier and more convenient

More common these days to see templates, like this Go template example:

<h1>{{.PageTitle}}</h1>
<ul>
    {{range .Todos}}
        {{if .Done}}
            <li class="done">{{.Title}}</li>
        {{else}}
            <li>{{.Title}}</li>
        {{end}}
    {{end}}
</ul>

This is certainly easier to read and write, but now we have two languages intermingling with each other - PHP is a good example of this style.

Let's see what a Nushell representation would look like. XML documents are records with tag (a string), attributes (a record) and content (a list of child elements). Defining a command tag to conveniently construct these elements, then the original form example looks like this:

form
    (fieldset -c grid
       (input -a {
            name: 'login'
            placeholder: 'Login'
            autocomplete: 'username'
            aria-label: 'Login'
        })
        (input -a {
            type: 'password'
            name: 'password'
            placeholder: 'Password'
            aria-label: 'Password'
            autocomplete: 'current-password'
        })  
        (input -a {
            type: 'submit'
            value: 'Log in'
        })
    )
)

Pop that through to xml -i 2 -s (some indenting, and closing empty tags) and ... HTML.

Not bad at all, but a little more fussy than the original. The power of this representation is that it is code; can now have fun defining more specialized and better self-documenting input constructors. Or instead of 'Login' have (gettext 'Login') and get localization.

Here's a more dynamic example, which wraps each string in a <li> tag:

tag ul ([
    'Here we go'
    'A whole list of us'
] | each {|v| tag li $v})

We define a helper list that does this. Together with a helper link for <a> elements we get this useful idiom:

(list ul li (
        [
            ['Home' '/index.html']
            ['Help' '/manual.html']
        ] | each { link }
    ))

The complete little library is here

YAML

YAML (Stood for 'Yet Another Markup Language' initially, until its creators got more serious) is a popular way to represent arbitrarily nested data. The attraction is that there is little punctuation needed (as in what makes a shell work ) - JSON involves a lot more quoting and fussy positioning of commas.

a-map:
  # with some key value pairs
  one: 1
  two: 2
  three: drei
  four: # and here's an array
   - 4
   - 40
   - 400
title: a apparently straightforward notation   

Unlike JSON, comments are allowed and whether something is a string or not is worked out for you. However, those rules are not obvious and the takeaway from years of experience is "just use quotes, man".

(Also from experience: normal users will completely fuck up things like indentation unless given an opinionated editing environment)

How did something so straightforward in intention get so unwieldly? First proposed in 2001, and first version in 2004: it had three parents - (contrast with JSON, which was a strictly-inforced subset of Javascript data literals proposed by Douglas Crockford, also in 2001.) They worked hard on a specification, and (in the ultimate analysis) simply had too much fun, like with the SGML process. It would have been better to have someone knock out a prototype over the weekend because they needed a more expressive data format.

YAML is declarative, like HTML. So templating is common, see this example from the Salt Stack

# Declare Jinja list
{% set users = ['fred', 'bob', 'frank']%}

# Jinja `for` loop
{% for user in users%}
create_{{ user }}:
  user.present:
    - name: {{ user }}
{% endfor %}

I had to deal with this shit at one point, and I'm still triggered by Captain Yaml when he's teamed up with Kid Jinja.

Ewww, as the mean girls say.

I have done some experiments with YAML expansion, inspired by these bad experiences.

This is a semi-useful example. Here is the data:

# animals.yml
data:
  dog:
    version: 1.2
    ports: [2555]
  cat:
    version: 0.8
    ports: [1023]
    volumes: [/:/hostfs.ro]

And the template - the MAP special form takes applies all the values in the original data map and constructs a new object for each one. The LIST form does the same for each value in an array.

# services.yml
services:
   MAP-k,v-in-data:
    image: 'ourtech/((k)):((v.version))'
    ports:
      LIST-p-IN-v.ports: '((p)):((p))'
    IF-v.volumes?:
      volumes: ((v.volumes))

And the result is in a known format:

# docker-compose.yml
services:
  cat:
    image: ourtech/cat:0.8
    ports:
    - 1023:1023
    volumes:
    - /:/hostfs.ro
  dog:
    image: ourtech/dog:1.2
    ports:
    - 2555:2555

This experiment is in Go and I can clean it up and make it available if there's interest - I am not entirely convinced of the exact notation yet.

It's no surpise that Nu is a pleasant alternative to YAML, since it is also low-punctuation. This is an Ansible file, rendered in the Nu equivalent of JSON: Nuon (Nu Object Notation):

[{
  name: 'Write hostname'
  hosts: all
  tasks: [
      {
        name: 'write hostname using jinja2'
        ansible.builtin.template: {
          src: templates/test.j2
          dest: /tmp/hostname
        }
      }
    ]
 }]

This is a tasteful way of integrating Jinja templating, since the templating is in separate files and one doesn't get that mad feeling from seeing two completely different notations in the same file:

# templates/test.j2
My name is {{ ansible_facts['hostname'] }}

With Nu, we can lean into the idea that configuration is executable. This is the same docker-compose example as previous:

use expand.nu *

{
  services: (MAP $data.data {|v| NON-NULL { 
      image: $"ourtech/($v._KEY):($v.version)"
      ports: (LIST $v.ports "{_}:{_}")
      volumes: $v.volumes?
    } 
  })
} | to yaml

MAP operates on the values of a record (tho inserts the key as _KEY); LIST operates on a list, much like each except with a special case for scalar values.

NON-NULL lets us define a record literal that does not insert null values. So the volumes entry will not be filled if $v does not have a volumes key.

These helper commands are to be found here

'Executable Configuration' Are you Insane?

Nu is very shell-focused, so (a) can trivially shell out to external commands and (b) there are platform-independent ways to create and remove files. It isn't designed for sandboxing like Lua.

The safest bet would be running in a container with limited access to the host filesystem, CPU quotas, the works. What the Bomb Squad would call a controlled explosion.

A less defeatest option would be to use the view ir command to dump and traverse the IR looking for call opcodes and apply some sensible allow-list rules (or deny-list? I'm not sue yet. The command ir-scan in the above module returns the list of commands referenced.)

But this is not intended as a solutions article; document/data DSLs often need to be generated, and templating involves two entirely different layers of language co-existing with each other. This is particularly hard to read as a human if dealing with YAML. A more expressive language offers a way out - at the cost of allowing arbitrary computation.

Thursday, 28 August 2025

Two Kinds of Shells

Two Different Command-line Shells

The Unix shell

The first Unix shell was the Thompson Shell from 1973 which already looks very familiar (here's an example) The if command would jump to labels, like in Assembly language, so certainly not a great programming language at this point. But an excellent shell:

Indirection - sending the output of a command into a file:

$ prog > myfile.txt

Piping - sending output through filters:

$ prog | sort
$ cat big.txt | head -n 10

If a command has a return code of 0, then it is successful; commands can write to standard output, and standard error; by default redirection and piping works with standard output.

Note big.txt may be indeed ginormous in the last example, but only enough will be read to show the first 10 lines

Originally Unix was developed with electromechnical teletypes, which are noisy and slow. encouraging short names (like ls and cd etc). C was first written on teletypes with a line editor - vi only appears in the late 70s. So the terminal was first very physical, then running on a monochrome monitor over a serial link, and finally the multicoloured glory of modern terminal 'emulators' (Nice overview) So the Unix shell uses the minimum number of keystrokes for its functionality; consider the brilliant notation prog & for putting a process in the background.

The Bourne shell first appears in 1979. By this time people needed scripting, and the new shell was much better for this. Bear in mind that C, with all its beauty and sharp edges, is not a very approachable language for once-offs and little utilities, as it is low-level with a very basic standard library.

for i in `seq 1 10`
do
  if test $i -gt 5
  then
     echo "larger $i"
  fi
done

The weirdness of fi and esac is because Stephen Bourne was an Algol 68 fan, although he decided against using od to terminate do loops (or maybe they had a 'Really, Stephen?' conversation at Bell Labs)

Nearly everything is done with external commands - seq, test and echo. All the shell is doing here is expanding the variable i. String interpolation also happens in double-quoted strings.

The Unix principle involves composition of specialized commands, each doing their one job very well.

To port the Unix shell necessarily means porting the 'little languages' (the domain-specific languages) that made the shell so powerful: grep, sed, awk and so on. (Perl consolidated these tools into an alledgely coherent whole, but this happens almost a decade later)

sh has been re-implemented many times (like the GUN bash) and for other operating systems. It is not entirely a good fit for Windows (although until PowerShell arrived there was nothing better) mostly because it is more expensive to shell out a command than in Unix; Windows prefers its native threading model (which arrives pretty late in Unix/Linux history). The Unix tradition of 'everything is a file' does not cover Windows functionality like the Registry, etc.

Typing in a Shell, Writing a Script

What makes a language both a good shell and good for scripting? Even if the language has a good interactive prompt, it is not usually convenient as a shell because there are too many key presses involved (particularly shift-key):

# Python >>> exec("prog", "-f", os.environ['HOME']").out("temp.txt")
# Shell  $ prog -f $HOME > temp.txt

Even with library support the extra parentheses and commas are going to slow the shell user down; there are more keystrokes, and many of them are punctuation. Part of what makes shell work is implicit strings everywhere (so not having to 'quote' everything) and explicit $ meaning variable access.

sh is an excellent shell, but is it a good scripting language?

Brian Kernighan wrote a famous paper entitled "Why Pascal is not my favourite programming language" and it would not be difficult to write a companion piece for standard POSIX shell. In that paper, the main criticism is that the type system is too rigid (in particular, array types); in the case of the Bourne shell there is only one type, text. A string might contain a number, and then you would have to compare it in a different way. Lists are done in an ad-hoc way with space-separated words. It is easy to mess things up, and even easier to be judged - any attempt to write a shell script will bring forth rock-dwelling critics.

When dealing than anything except one-liners, error handling is crucial. Bash has a scary default mode where it just keeps going whether errors happen or not. So you have to code very defensively, as in Go, always checking return codes and explicitly deciding what to do.

A lot of non-trivial shell is converting one ad-hoc text format into another. The classic text mangling tools are good, but they have a learning curve and in fact most of the skills needed to be competent at shell are outside the shell itself; it is mostly an 'empty shell'.

There has been a move for newer commands to optionally produce JSON output to be parsed in a standard way by other commands; there remans a nice presentation for human users, but machine users don't have to bother parsing this. The jc project aims to convert the output of popular commmand-line tools into JSON, and jq provides a powerful DSL for processing JSON.

Nushell

So an idea emerged in Microsoft early this century: what if data passed through shell pipelines, not as a particular serialized text format like JSON, but as raw .NET objects? The data could be operated on with the methods and properties of these objects, and at the end of the pipeline, the objects would be converted into a default presentation for human users. Microsoft Powershell was first released in 2006 and becoming part of Windows with vs 2.0 in 2009.

It was a hit, because frankly the situation with Windows admin was a mess. Grown adults reduced to clicking on buttons, or forced to work with some of the most clunky command-line tools known to humanity, accessed with a uniquely brain-dead command shell.

I'm not really a fan, since administering Windows is not where I like to be, and I still think it's a rrevolutionary idea held back by a second-class implementation. It is the slowest shell to start, easily 500ms on a decent machine, since all those .NET assemblies have to be pulled in at startup.

The idea of a cross-platform shell organized around the data-pipe principle remained powerful, and Nushell started happening in 2019.

All values in Nushell have a type; the main types are:

  • numbers (int and float are distinct)
  • filesize, duration, datetime
  • strings
  • lists
  • records (corresponding to Javasript objects or Python dicts)
  • tables lists of records with the same keys

By default, you get a pretty view of tables (this is themeable, if you find the default a bit heavy) - can instead convert the data to YAML etc. In Nushell ls creates a table:

/work/dev/llib> ls
╭───┬─────────────────┬──────┬─────────┬──────────────╮
│ # │      name       │ type │  size   │   modified   │
├───┼─────────────────┼──────┼─────────┼──────────────┤
│ 0 │ LICENSE.txt     │ file │  1.4 kB │ 3 years ago  │
│ 1 │ build           │ file │    33 B │ 3 years ago  │
│ 2 │ build-mingw.bat │ file │    60 B │ 3 years ago  │
│ 3 │ examples        │ dir  │  4.0 kB │ 6 months ago │
│ 4 │ llib            │ dir  │  4.0 kB │ 3 months ago │
│ 5 │ llib-p          │ dir  │  4.0 kB │ 3 years ago  │
│ 6 │ readme.md       │ file │ 39.9 kB │ 3 years ago  │
│ 7 │ tests           │ dir  │  4.0 kB │ 3 months ago │
╰───┴─────────────────┴──────┴─────────┴──────────────╯
/work/dev/llib> # render the table in YAML
/work/dev/llib> ls | to yaml
- name: LICENSE.txt
  type: file
  size: 1486
  modified: 2022-02-04 16:49:34.466300249 +00:00
- name: build
  type: file
  size: 33
  modified: 2022-02-04 16:49:34.466300249 +00:00
- name: build-mingw.bat
  type: file
  size: 60
  modified: 2022-02-04 16:49:34.466300249 +00:00
- name: examples
  type: dir
  size: 4096
  modified: 2025-02-08 19:07:39.755974376 +00:00
- name: llib
  type: dir
  size: 4096
  modified: 2025-05-04 15:44:33.971085666 +00:00
- name: llib-p
  type: dir
  size: 4096
  modified: 2022-02-04 16:49:34.470300295 +00:00
- name: readme.md
  type: file
  size: 39915
  modified: 2022-02-04 16:49:34.470300295 +00:00
- name: tests
  type: dir
  size: 4096
  modified: 2025-05-08 17:10:16.706859509 +00:00

Piping the table into the describe command gives you the actual type of the data created by ls (the Powershell equivalent is Get-ChildItem | Get-Member -MemberType Property)

/work/dev/llib> ls | describe
table<name: string, type: string, size: filesize, modified: datetime> (stream)

get extracts a column as a list:

/work/dev/llib> ls | get name
╭───┬─────────────────╮
│ 0 │ LICENSE.txt     │
│ 1 │ build           │
│ 2 │ build-mingw.bat │
│ 3 │ examples        │
│ 4 │ llib            │
│ 5 │ llib-p          │
│ 6 │ readme.md       │
│ 7 │ tests           │
╰───┴─────────────────╯

help <cmd> gives help with examples and help commands gives the whole lot - 618 on my system! And these are builtins and plugins, not executables. You can of course call external commands, but this shell is very full-featured out of the box, which explains why its 22mb on my system. It has built-in sqlite, http is a built-in command, and with the polars plugin (part of the standard distribution) it can do dataframe manipulation, read Parquet files, etc.

/work/dev/llib> cat LICENSE.txt | lines | take 5
╭───┬──────────────────────────────────────────────────────────────────────╮
│ 0 │ -------------------------------------------------------------------- │
│ 1 │ Copyright (c) 2013 Steve Donovan                                     │
│ 2 │ All rights reserved.                                                 │
│ 3 │                                                                      │
│ 4 │ Redistribution and use in source and binary forms, with or without   │
╰───┴──────────────────────────────────────────────────────────────────────╯

The nushell language (called Nu) was developed in Rust by fans of Rust, so it looks like a scripting variant of Rust - this is the equivalent of the shell example earlier:

for i in 1..10 {
    if $i > 5 {
        print $"larger ($i)"
    }
}

Normal comparison operators are available, the range iterator is builtin. The strangest thing is the string interpolation syntax $"...."

Nushell is not available on any random machine you might ssh into, so using it as your shell requires justification. There is always an investment of time and energy needed.

First, it makes simple queries on data easy, and commands return data. There is an actual filesize type which can be written with the usual postfixes:

/work/dev/llib> ls | where size > 10kb
╭───┬───────────┬──────┬─────────┬─────────────╮
│ # │   name    │ type │  size   │  modified   │
├───┼───────────┼──────┼─────────┼─────────────┤
│ 0 │ readme.md │ file │ 39.9 kB │ 3 years ago │
╰───┴───────────┴──────┴─────────┴─────────────╯

There is a Unix find command for going over a directory tree, which I can never remember how to use. But this ls command can take a glob pattern meaning 'everything under this directory':

/work/dev/llib> ls **/* | where size > 50kb
╭───┬──────────────────────────────┬──────┬──────────┬──────────────╮
│ # │             name             │ type │   size   │   modified   │
├───┼──────────────────────────────┼──────┼──────────┼──────────────┤
│ 0 │ examples/example.db          │ file │  11.1 MB │ 7 months ago │
│ 1 │ examples/json.db             │ file │  11.3 MB │ 7 months ago │
│ 2 │ examples/pkgconfig/pkgconfig │ file │  51.7 kB │ 3 years ago  │
│ 3 │ examples/web/simple          │ file │  62.0 kB │ 3 years ago  │
│ 4 │ examples/web/use-select      │ file │  71.8 kB │ 3 years ago  │
│ 5 │ llib/libllib.a               │ file │ 327.7 kB │ 3 months ago │
│ 6 │ tests/test-json              │ file │  59.8 kB │ 5 months ago │
│ 7 │ tests/test-pool              │ file │  56.6 kB │ 5 months ago │
│ 8 │ tests/test-template          │ file │  64.0 kB │ 5 months ago │
╰───┴──────────────────────────────┴──────┴──────────┴──────────────╯

It is then easy to apply a command to each one of these files:

/work/dev/llib> ls **/* | where size > 50kb | 
    get name |  each { path parse  }
╭───┬────────────────────┬───────────────┬───────────╮
│ # │       parent       │     stem      │ extension │
├───┼────────────────────┼───────────────┼───────────┤
│ 0 │ examples           │ example       │ db        │
│ 1 │ examples           │ json          │ db        │
│ 2 │ examples/pkgconfig │ pkgconfig     │           │
│ 3 │ examples/web       │ simple        │           │
│ 4 │ examples/web       │ use-select    │           │
│ 5 │ llib               │ libllib       │ a         │
│ 6 │ tests              │ test-json     │           │
│ 7 │ tests              │ test-pool     │           │
│ 8 │ tests              │ test-template │           │
╰───┴────────────────────┴───────────────┴───────────╯

Second, the pipeline model makes function application go from left to right; the usual f(g(h(x))) is right to left from the argument. It is easier to successively refine the result by applying extra operations if we write this $x | h | g | f - easier to read and to edit in an interactive shell.

Why should you consider using it for shell scripting? Apart from the straightforward syntax and sensible try..catch error handling, for me it's how elegant it is to write self-documenting custom commands:

# Greet guests along with a VIP
#
# Use for birthdays, graduation parties,
# retirements, and any other event which
# celebrates an event # for a particular
# person.
def vip-greet [
  vip: string        # The special guest
   ...names: string  # The other guests
] {
  for $name in $names {
    print $"Hello, ($name)!"
  }

  print $"And a special welcome to our VIP today, ($vip)!"
}

And help vip-greet will work as expected.

That's pretty classy.