Monday, 27 April 2015

llib - a Set of Useful C Libraries

C: Why still Bother?

"Programmers waste enormous amounts of time thinking about ... the speed of non-critical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Donald Knuth

I'll start with the classic Knuth quote in context: even when a program is CPU-intensive, it turns out that usually there's a small part of that program that actually needs to be fast and efficient. This has implications for how we use more low-level languages like C.

It's common to come across the postion that C has been superceded by C++. Since modern C is a kind-of subset of C++, the feeling is that one can write 'better C' and cherry-pick the convenient parts of C++, like strings and type-safe containers. Alas, but then you move instantly from writing good C to writing bad C++, at least according to the guardians of good C++ style. But other things being equal, C++ has definitely got the edge when it comes to easy-to-use containers and strings, and I've enjoyed that style enough to want it in C as well.

Personally, I enjoy C, I enjoy the fast compiles and small executables that don't have burdensome dynamic dependencies. Call me impatient, but waiting half-a-minute for a program to compile tends to clear out my active mental cache. After all, we have to work within constraints, including our own cognitive limitations. Arguing with a C++ compiler sometimes feels like wrestling with a pig, anyway. The lack of abstractive power in C becomes a strength when debugging code, while common C++ abstractions become hard to inspect.

However, C has a small standard library, so any non-trivial project tends to collect ad-hoc utility functions. Half-baked re-invented wheels litter the landscape of these systems. The clean narrative of C programs is often disturbed by outbreaks of pointer gymnastics, hard to read and maintain. This is because of not understanding Knuth's point - the 97% really doesn't have to be fast - and because higher-level coding in C always involves manual resource management. For instance, if I write a function to split a string using a delimiter into an array of strings, I will have to provide a function which cleans that array up. Any non-trivial C object will have a named destructor. For instance, in the mosquitto messaging API, we have mosquitto_sub_topic_tokenize and the corresponding mosquitto_sub_topc_tokens_free. POSIX regular expressions haveregcomp and regfree, and so forth.

Automatic storage (arrays allocated on the stack) have a powerful appeal since they're fast (no overhead!) and forgiving (they go away!). But how big is your stack? I don't know, and with recursion no-one knows. This is classic micro-optimization and is a source of joy for crackers who know how easy it is to smash the stack for fun and profit. Not to say that they are evil (I think we should reserve this term for mass murderers and the inventors of BASIC), just over-used without care. They are fast, but because they're in shoot-yourself-in-the-foot territory they should be used when code needs to be fast.

The llib project provides a set of BSD-licensed, useful libraries built around a simple object system. meant to be statically linked with your program. A good contrast would be Gnome's GLIB or Apache's APR, which are big awkward dynamic dependencies. Since this is now officially the 21st Century (many historians believe the 20th started in 1914) - it is modern C. Since we live in an imperfect world but still believe in cross-platform compatibilty, it can be compiled as C++, making it available to Visual Studio users. This will help to explain why the code is unnecessarily pedantic (like explicitly needing to cast from void* to T*. The source is also available as zipball.

Of course, deep embedded programmers, who will work with 4K RAM without obvious signs of distress, play another game altogether. This is not for them.

C with Objects

A llib 'object' is a pointer allocated with a header 'behind' the data - the block is allocated, header written, and the returned pointer is just past the header. There are no smart pointers in C, only dumb pointers with hidden secrets:

 typedef struct ObjHeader_ {  // 64 bits
     unsigned int type:14;
     unsigned int is_array:1;
     unsigned int is_ref_container:1;
     unsigned int _ref:16;
     unsigned int _len:32;
 } ObjHeader;

These objects are reference counted, so they may be easily shared by other objects - we replace the problem of ownership with the problem of sharing. For instance, llib strings are C strings which always have a terminating NUL-byte, but they have type information and must be freed using obj_unref:

 #include <llib/obj.h>
 char *s = str_new("hello dolly");
 printf("%d %d\n", strlen(s), array_len(s));  // -> 11 11
 printf("type %s is_array %d\n", obj_typename(s), obj_is_array(s));
 // -> type char is_array 1
 obj_unref(s);   // decrement the reference count

You may safely use these strings in any context expecting a C string, as long as you don't free them. Their header has type information, reference count and array size. This is useful if you have strings that contain embbedded NUL characters. The reference count starts as one, and when it becomes zero, the pointer is deallocated - this is called disposal.

 char *a = str_new("hello world");
 char **s = array_new(char*,2);
 a[0] = obj_ref(s);  // rc -> 2
 a[1] = obj_ref[s); // rc -> 3
 obj_unref(s);  // rc -> 2
 // string s is still very much alive

Stroustrup says "In C, arrays don't even know their size". These arrays do, however, which simplifies APIs that use them. For instance, str_split will return an array of strings, and array_len will tell you how big that array is. But how do we solve the pesky problem of freeing that array together with its strings?

 char **ra = array_new_ref(char*,2);
 char *t = str_new("frodo");
 ra[0] = ref(t);   // informal shortcut name for obj_ref
 ra[1] = ref(t);
 // all dead!
 char **strs = str_split("hello world"," ");
 assert(array_len(strs) == 2);
 // the array and its two strings are no more...

ra is a reference array which will unref all its elements when it is disposed. If str\_split was done in the old-fashioned way, I would also have to provide 'strings_free(for cleanup) andstrings_len` (perhaps depending on a NULL-terminated array convention). So a little more intelligence in objects means less API cognitive load.

Virtual Destructors

To llibify an object is straightforward, although more tedious than C++

 typedef struct {
     const char *name;
     int age;
 } Person;
 static Person_dispose(Person *p) {
 Person *person_new(const char *name, int age) {
     Person *res = obj_new(Person, Person_dispose);
     res->name = str_ref(name);
     res->age = age;
     return res
 Person *mother = person_new("Helga",52);
 char *name = str_new("Bob");
 Person *father = person_new(name,56);
 assert(obj_refcount(father->name) == 2);
 // name is still alive with rc == 1

So (a) objects share other objects explicitly using refcounting (b) they may have a disposal function which is called when they are disposed; they can then hand the objects back to the system. But obviously that disposal function can do much more, like freeing resources, so this is pretty much what C++ calls a virtual destructor.

Why str_ref? Because we have no guarantee that the string will be one of ours. If it's a plain C string, statically or dynamically allocated, then we'll have to make a refcounted copy.

You still have to manage the lifetime of your shared objects, although replacing free with unref gets you half way. The other half is remembering to increment the reference count when a function or object is 'borrowing' a pointer.

It would be most cool if the reference to an object would automatically be decremented when it went out of scope. This can't be done using standard C99 but GCC has a variable attribute cleanup which will call a function when that variable goes out of scope. Clang implements this as well (as does Intel C on Linux) so this is too good not to use. llib wraps this up as a macro:

    obj_scoped Person *john = person_new("john",23);
 } // john is gone!

Since objects may have custom disposal functions, this is effectively C++'s RAII pattern: people are obsessed with avoiding memory leaks, but resource leaks can be a bigger problem. It's also a way to avoid the single-return pattern that larger C functions adopt, since we now have a disposal guarantee for objects - multiple returns don't have to explicitly unref the objects falling out of scope.

This is not a new idea, of course; see "implementing smart pointers for the C programming language". I don't foreground it as much in llib because it is a non-standard feature which I cannot (yet) find a way to do in C++ (the fallback position) without resorting to actual smart pointers.

Resizable Arrays and Values

llib provides doubly-linked lists and binary-tree maps because, well, that's what we learned at school. These will aways be awkward in C; either they must store untyped pointers for you, or your types must have the necessary next/previous and left/right pointers in them. It turns out that resizable arrays are flexible and easy to make strongly-typed.

In llib, such arrays are called 'seqs' (inspired by my brief flirtation with Nim). They are essentially pointers to arrays, with a capacity, and they resize using the familar doubling method, like std::vector.

 int **ss = seq_new(int);
 FOR(i,100) {
     seq_add(ss, 10*(i+1));
 int *a = *ss;  // can always access the array
 assert(a[0] == 10 && a[1] == 20);
 // now take ownership of the array
 a = seq_array_ref(ss);
 // ss is now dead;  a has rc == 1

A common idiom is to build up an array using a seq, and then use seq_array_ref to provide you with the final array, sized to fit the actual number of elements. There are also functions to insert and remove elements, although of course these may be inefficient for large arrays. But as always, "when in doubt, use brute force", as Ken Thompson said.

Associative arrays are just one of those very cool things you absolutely need. But you don't necessarily need binary search. In llib 'simple maps' are arrays of pointers, where the even elements point to strings and the odd elements point to your data. Combining this with seqs results in the smap_* family of functions.

Now it turns out with these two concepts, you can do dynamic data, or as it is pronounced in Computer Science, Lisp. For instance, JSON data maps neatly into arrays of maps of arrays and so forth. To complete the picture we simply need to 'box' primitive values as llib objects, which is a well-known strategy in languages like Java and C# which are semi-dynamic. So value_float will make a typed object out of a double value, the predicate value_is_float can be used to test the type, and value_as_float will 'unbox' the double value. The representation is very simple, just a pointer to a double, plus llib type information. With this, one can construct, read and write arbitrary JSON-style data.

I suspect that many C programmers would regard this as the kind of thing that keeps them from Java . But, this is not an essential part of using llib, just a flexible way of looking at data. You do not (for instance) need to box primitive types in a normal homogenous array, only in arrays that may contain any kind of data. The JSON parser will unbox such an array for you, if it finds that all its values are numbers.

 double *xx = array_new(double,3);
 xx[0] = 1;
 xx[1] = 2;
 xx[2] = 3;
 // VAS is short for Value Array of Strings - tolerantly accepts any objects as well
 PValue* p = VAS("one","two","three",xx);
 // ==> ["one","two","three",[1,2,3]]
 // the other way
 PValue obj = json_parse_string("[10,20,30,40]");
 if (obj_is_instance(obj,"double")) { // check the type!
    double *aa = obj;
    FOR(i,array_len(aa)) printf("%f ",aa[i]);
 // ==> 10.000000 20.000000 30.000000 40.000000

Values provide another solution to an old C problem: how to return errors. Everyone agrees that errno is kind of icky, but we can only return one value directly.

 const char *str (int i) {
     if (i == 0)
         return value_errorf("number is bad %d",i);
     char buff[20];
     return str_new(buff);
 const char *res = str(0);
 if (value_is_error(res)){
     fprintf("error: %s\n",res);
 } else { // ok!
 obj_unref(res); // in both cases

str always returns a llib string, but an error is a string with a different dynamic type, so that value_is_error(res) is true.

What do Programs Need?

I don't want to give a full overview of the llib API, since there's The Documentation - (I've already been taken to task for writing artcles longer than the average attention span.)

Suffice to say that a major obsession of llib is string manipulation and common file format support, since the original C API is very basic. We start with an analysis of what non-trivial programs need in library support:

  • parsing command-line arguments arg
  • reading configuration files config
  • file and path operations file
  • flexible and more expressive string handling str
  • reading tabular data table
  • reading and writing XML and JSON xml, json
  • creating documents from templates template
  • parsing structured documents using a lexical scanner scan

It would be an unusual program that used all of this, but the beauty of static linking is "you don't pay for what you don't use". XML/JSON is part of the age we live in, but (particularly for the former) I don't claim these libraries are industrial - their niche is a program needing to read simple XML configuration files, or needing to speak JSON to external programs.

The other beauty of static linking is building executables with no unusual dependencies.

What's the Damage?

This all has a cost since each object must have a eight-byte hidden header. So the memory overhead for lots of small objects can be significant.

Most C programmers (incuding myself) don't like framework fascism, You may safely use parts of llib inside a C application but you must at least use obj_unref instead of free to clean up. It may be necessary to copy strings and arrays if they later might be freed.

The main discipline needed is to religiously balance every ref with an unref; be aware that reference arrays (and other reference containers) decrement the reference count of all their elements. But adding an object to such a container doesn't automatically increase its reference count, since that must be explicit. My experience is that it's hard to do this as an afterthought - write careful leak-free code from the start if the code is going to be part of a larger application. Apparently this was the story when folks tried to package Git as a library - Mr Torvalds had written the fastest code he could, and didn't bother to clean up memory since the process wouldn't last long enough for it to matter.

Currently, llib isn't thread-safe - the allocations (obj_new) access and modify static program state. unref can bite you if you're unlucky (and eventually everyone is unlucky with threading). Because of this static state, if both a program and a shared library use llib, then they must dynamically link to the core of llib. Fortunately, this is only a few Kb.

A tricky problem remains: of reliably detecting whether a pointer is 'one of ours'. This is principally needed when deciding to make a llib copy of a passed C string - otherwise it would make these libraries awkward to use, if you had to explicitly call str_new (or simply obj_ref) when passing strings. No solution seems entirely satisfactory; on Linux I keep track of high- and low- allocated memory pointers, which appears to work, but it would be easily confused by pointers allocated with just plain malloc. On Windows the multiple heaps used by the runtime make such a simple heuristic impossible, so I resort to keeping track of every allocation. Eventually, this becomes inefficient.

A Longer Example

The utility pkg-config is a descendent of the original gtk-config, which was needed to make building GTK+ applications less painful. As you can see from the end of the linked man page, this program has been re-written a few times. Unfortunately, it has picked up some awkward dependencies along the way, especially GLIB. I suspect its last author was having too much fun, so I decided to do another rewrite in the same spirit of indulgence.

pkg-config --flags gtk+-2.0 will give you all the C compile flags needed for compiling a GTK+ program, and --libs gives all the linker flags. It works by looking up files with extension .pc, which tells you what build flags that 'package' needs, plus what packages it further depends on, or requires. It's a classic file and text heavy program that any Python programmer would regard as unsuited to the old dog C. But (sorry lads) it has to be fast, because it's been shelled out a lot in makefiles all over the Gnomosphere.

It's a bit long to include in an article, so I invite you to read it here. I've tried to write it in as literate a style as possible, So it is deliberately kept as a single 563 line program.

Although a little larger (and it's hard to find 51K large), pkgconfig has no dynamic dependencies other than libc (I've excluded Linux-specific stuff here for clarity; it does no fancy stuff so easy to recompile for Windows and so forth as well.)

 pkgconfig$ ll pkgconfig
 -rwxrwxr-x 1 user user 51268 Apr  5 13:35 pkgconfig*
 pkgconfig$ ldd pkgconfig => /lib/i386-linux-gnu/ (0xb75f1000)
 pkgconfig$ ll $(which pkg-config)
 -rwxr-xr-x 1 root root 38592 Jul 14  2013 /usr/bin/pkg-config*
 pkgconfig$ ldd $(which pkg-config) => /lib/i386-linux-gnu/ (0xb7625000) => /lib/i386-linux-gnu/ (0xb7618000) => /lib/i386-linux-gnu/ (0xb7468000) => /lib/i386-linux-gnu/ (0xb742a000)