Sunday 31 May 2015

What can C++ Libraries Learn from Lua?

The Convenience of Text Pattern Matching

const char *text = "Some people, when confronted with a problem, think \"I know, I'll use regular expressions.\". Now they have two problems.";

(Example for std::regex from http://http://en.cppreference.com/w/cpp/regex)

Parsing text with Regular Expressions is a powerful technique traditionally available with so-called 'scripting' languages, from the grand-daddy Awk through Perl onwards. Regular expressions are first-class citizens of these languages and their users get adept at using them. Of course, there can be too much of a good thing, and sometimes those users forget that they have perfectly good if-statements for making complex matches easier to write and read. This is why I like Lua string patterns precisely because they are more limited in scope. Strictly speaking, although similar, they aren't regular expressions since they lack the 'alternation' operator |. But they are a good balance between simplicity and power, which is the global virtue of Lua, together with smallness. (Lua in total weighs less than the PCRE library.)

C++ and its older brother C are not very good at text manipulation, by the standards set by these dynamic languages. So this investigation is about how we can use regular expressions for text wrangling, first in C and then in C++, and how the standard ways can be improved. My method will simply be to use the Lua string functions as a design template, whenever appropriate.

The C Way

For some obvious reasons, regular expressions are not so easy to use in C, although part of the POSIX standard and thus always available on compliant systems. There is no equivalent to regexp literals (/..../) so you have to compile the expression explicitly before you use it. You must also create a buffer to receive the positions where the match and its submatches occured in the string.

 regex_t rx;
 regmatch_t matches[1];  // there's always one match
 regcomp(&rx,"[[:alpha:]]+",REG_EXTENDED);

 if (regexec(&rx,"*aba!",1,matches,0) == 0) {
     printf("regexec %d %d\n",matches[0].rm_so,matches[0].rm_eo);
 } else {
     printf("regexec no match\n");
 }
 ...
 // explicitly dispose of the compiled regex
 regfree(&rx);

Doing something more realistic that just matching is a more involved. Say we wish to go over all the word matches with length greater than six in the above useful quotation:

 const char *P = text;
 char buffer[256];  // words are usually smaller than this ;)
 while (regexec(&rx,P,1,matches,0) == 0) {
     int start = matches[0].rm_so, end = matches[0].rm_eo;
     int len = end-start;
     if (len > 6) {
         strncpy(buffer,P+start, len);
         buffer[len] = 0;
         printf("got %d '%s'\n",len,buffer);
     }
     P += end; // find the next match
 }

Contrast with the Lua solution - note how '%' is used instead of '\' in regular expressions, with the character class '%a' meaning any char in the alphabet:

 for word in text:gmatch('%a+') do
     if #word > 6 then
         print ("got",#word,word)
     end
 end

So clearly there is room for improvement!

A Higher-level C Wrapper

I am not alone in finding classical regexp notation painful, especially in C strings where the backslash is the escape character. E.g. if we wanted to use special characters literally: "\$\(([[:alpha]]+\))". The Lua notation for this expression would simply be "%$%(%a+)%)" which is easier on the eyes and the fingers. So I've provided an option to rx_new to convert Lua notation into classical notation. It is a very simple-minded translation: '%' becomes '\', '%%' becomes '%', and the Lua character classes 's','d','x','a','w' become the POSIX character classes 'space', 'digit', 'xdigit','alpha' and 'alnum' within brackets like '[[:space:]]'. The semantics are not at all changed - these regexps only look like Lua string patterns, although mostly equivalent.

An optional POSIX-only part of llib (see rx.cin the lllib-p directory) provides a higher-level wrapper. rx_find is very much like Lua's string.find although we don't have the convenience of multiple return values.

 Regex *r = rx_new("[[:alpha:]]+",0);
 int i1=0, i2=0;
 bool res = rx_find(r,"*aba!",&i1,&i2);
 printf("%d from %d to %d\n",res,i1,i2);

 // Now find all the words!
 int start = 0,end;
 while (rx_find(r,text,&start,&end)) {
     int len = end - start;
     if (len > 6) {
         strncpy(buffer,text+start,len);
         buffer[len] = 0;
         printf("[%s]\n",buffer);
     }
     start = end;  // find the next match
 }
 ...
 // generic disposal for any llib object
 unref(r);

The need for an extra 'matches' array has disappeared and is now managed transparently by the Regex type. It's pretty much how a Lua programmer would loop over matches, if string.gmatch wasn't appropriate, except for the C string fiddlng - which is essential when you specially don't want to allocate a lot of little strings dynamically.

Here is a cleaner solution, with some extra cost.

 int j1=0,j2;
 while (rx_find(r,text,&j1,&j2)) {
     str_t word = rx_group(r,0);
     if (array_len(word) > 6) {
         printf("%s\n",word);
     }
     unref(word);
     j1 = j2;
 }

rx_group will return the indicated match as a llib-allocated C string; could have used our friend strlen but array_len is going to be faster, since the size is in the hidden header; I've put in an unref to indicate that we are allocating these strings dynamically and they won't go away by themselves.

So, some code for extracting 'NAME=VALUE' pairs in a string separated by newlines (the ever-flexible C for-loop helps with not having to do j1=j2 at the end of the loop body, where these things tend to get forgotten, leading to irritatingly endless loops.)

 Regex *pairs = rx_new("(%a+)%s*=%s*([^\n]+)",RX_LUA);
 str_t test_text = "bob=billy\njoe = Mr Bloggs";
 for (int j1=0,j2; rx_find(r,test_text,&j1,&j2), j1=j2) {
     str_t name = rx_group(r,1);
     str_t value = rx_group(r,2);
     printf("%s: '%s'\n",name,value);
     dispose(name,value);
 }

This is easier to write and read, I believe. Since the loop counters are not used in the body of the loop, and since this is C and not C++, you can write a macro:

 #define FOR_MATCH(R,text) (int i1_=0,i2_; rx_find(R,text,&i1_,&i2_; _i1=i2_)

There are also functions to do substitutions, but I'll leave that to the documentation. llib is linked statically, so using part of it incurs little cost - I'd estimate about 25Kb extra in this case. You will probably need to make normal copies of these strings - a general function to copy strings and other llib arrays to a malloc'd block is here:

 void *copy_llib_array (const void *P) {
     int n = array_len(P) + 1;   // llib always over-allocates
     int nelem = obj_elem_size(P);
     void *res = malloc(n*nelem);
     memcpy(res,P,n*nelem);
     return P;
 }

News from the C++ Standards Committee

Some may think I have anti-C++ tendencies, but a professional never hates anything useful, especially if it pays the bills. So I was interested to see that regular expression support has arrived in C++.

 #include <regex>
 using namespace std; // so sue me
 ...
 // 'text' is the Useful Quote above...
 regex self("REGULAR EXPRESSIONS",
         regex_constants::ECMAScript | regex_constants::icase);
 if (regex_search(text, self)) {
     cout << "Text contains the phrase 'regular expressions'\n";
 }

That's not too bad - the Standard supports a number of regexp dialects, and in fact ECMAScript is the default. How about looking for words longer than six characters?

 regex word_regex("(\\S+)");
 string s = text;
 auto words_begin = sregex_iterator(s.begin(), s.end(), word_regex);
 sregex_iterator words_end;

 const int N = 6;
 cout << "Words longer than " << N << " characters:\n";
 for (sregex_iterator i = words_begin; i != words_end; ++i) {
     smatch match = *i;
     string match_str = match.str();
     if (match_str.size() > N) {
         cout << "  " << match_str << '\n';
     }
 }

Perhaps not the clearest API ever approved by the Standards Committee! We can make such an iteration easier with a helper class:

class regex_matches {

 const regex& rx;
 const string& s;

public:

 regex_matches(const regex& rx, const string& s): rx(rx),s(s) {
 }
 sregex_iterator begin() { return sregex_iterator(s.begin(),s.end(),rx); }
 sregex_iterator end() { return sregex_iterator(); }
}; ....
 regex_matches m(word_regex,s);
 for (auto match : m) {
     string match_str = match.str();
     if (match_str.size() > N) {
         cout << "  " << match_str << '\n';
     }
 }

We're finally getting to the point where a straightforward intent can be expressed concisely and clearly - this isn't so far from the Lua example. And it is portable.

Substituting text according to a pattern is a powerful thing that is used all the time in languages that support it, and std::regex_replace does a classic global substitution where the replacement is a string with group references.

Alas, there are some downsides. First, this does not work with the GCC 4.8 installed on my Ubuntu machines, but does work with the GCC 4.9 I have on Windows. Second, it took seven seconds to compile simple examples on my I3 laptop, which is an order of magntitude more than I expect from programs of this size. So, in this case, the future has not arrived.

A C++ Wrapper for POSIX Regular Expressions.

Portability is currently not one of my preoccupations, so I set out to do a modern C++ wrapping of the POSIX API, in a style similar to llib-p's Regex type. (Fortunately, the GnuWin32 project has made binaries for the GNU regex implementation available - although they are only 32-bit. The straight zip downloads are what you want, otherwise you will probably have unwelcome visistors on your machine.)

When testing this library on large-ish data, I received a shock. My favourite corpus is The Adventures of Sherlock Holmes from the Gutenberg project; just over 100,000 words, and this regexp library (from glibc) performs miserably to do something that is practically instantaneous in Lua. (std::regex is much better in this department.) So I've taken the trouble to extract the actual Lua pattern machinery and make it available directly from C++.

Let me jump immediately to the words-larger-than-six example. It is deliberately desiged to look like the Lua example:

 Rxp words("%a+",Rx::lua);  // simple Lua-style regexps
 for (auto M:  words.gmatch(text)) {
     if (M[0].size() > 6)
         cout << M[0] << "\n";
 }

When modern C++ cooks, it cooks. auto is a short word that can alias complicated types - (here just Rx::match), but the range-based for-loop is the best thing since std::string. And I've got my order-of-magnitude smaller compile time back, which is not an insignificant consideration.

(if you want to use the Lua engine, then replace the first declaration with simply Rxl words("%a+").)

I resisted the temptation to provide a split method; heaven knows the idea is useful but doesn't have to be implemented that way. It would return some concrete type like vector<string> and it would introduce a standard library dependency other than std::string into the library. Rather, Rx::match has a template method append_to which will take the results of the above iteration and use push_back on the passed container:

 vector<string> strings;
 words.gmatch(text).append_to(strings);

If you wanted a list<string> instead, then it trivially happens. You can append to a container and so forth without awkward splicing.

What would it mean to populate a map with matches? I don't think there's one answer, since there is no clear mapping, but here is one way of interpreting it; the expresson must have at least two submatches,and the first will be the key, and the second will be the value:

 Rx word_pairs("(%a+)=([^;]+)",Rx::lua);
 map<string,string> config;
 string test = "dog=juno;owner=angela"
 word_pairs.gmatch(test).fill_map(config);

Again, this will work with any smart array, not just std::map, as long as it follows the standard and defines mapped_type. The user of the class only pays for this method if it is used. These are considerations dear to the C++ mindset.

I'm an admirer of Lua's string.gsub, where the replacement can be three things:

  • a string like '%1=%2', where the digits refer to the 'captured' group (0 is the whole match)
  • a lookup table for the match
  • a function that receives all the matches

The first case is the most useful. Here we have a match, where the submatch is the text we want to extract (called 'captures' in Lua.)

 Rx angles("<(%a+)>",Rx::lua);
 auto S = "hah <hello> you, hello <dolly> yes!";
 cout << angles.gsub(S,"[%1]") << endl;
 // --> hah [hello] you, hello [dolly] yes!

With the second case, I did not want to hardwire std::map but defined gsub as a template method applicable to any associative array.

 map<string,string> lookup = {
    {"bonzo","BONZO"},
    {"dog","DOG"}
 };
 Rx dollar("%$(%a+)",Rx::lua);
 string res = dollar.gsub("$bonzo is here, $dog! Look sharp!",lookup);
 // --> BONZO is here, DOG! Look sharp!

Forcing the third 'callable' case to overload correctly doesn't seem possible, so it has a distinct name:

 // we need a 'safe' getenv - use a lambda!
 res = dollar.gsub_fun("$HOME and $PATH",[](const Rx::match& m) {
     auto res = getenv(m[1].c_str());
     return res ? res : "";
 });
 // --> /home/user and /home/user/bin:/....

(One of the ways that GCC 4.9 makes life better is that generic lambdas can be written, and the argument type here just becomes auto&.)

In this way, string manipulation with regular expressions can be just as expressive in C++ as in Lua, or any other language with string pattern support.

The files (plus some 'tests') are available here, but remember this is very fresh stuff - currently in the 'working prototype' phase.

Transferable Design

The larger point here is that dynamically-typed languages can provide design lessons for statically-typed languages. C++ and C (particularly) are poor at string manipulation compared to these languages, but there is nothing special here about dynamic vs static: it is a question of designing libraries around known good practices in the 'scripting' universe and using them to make programming more fun and productive in C and C++.

There is a limit to how far one can go in C, but it's clear we can do better, even if llib itself might not feel like the solution. C is not good at 'internal' iterators since function pointers are awkward to use, and so 'external' iterators - using a loop - is better for custom subtitutions, etc.

C++ has already got a standard for regular expressions, but it will take a while for the world to catch up: we are being hammered here by heavy use of templates, all in headers. This is of course a classic example of too much of a good thing, because generic code is how C++ escapes the tyranny of type hierarchies, by compile-time static duck-typing. (The fancy word here is 'structural typing'.) For instance, Rxp::gsub can use anything that looks like a standard map.

Even there, I wanted to present an alternative design, not necessarily because I want you to use it, but because looking at alternatives is good mental exercise. The Standard is not Holy Scripture, and some parts of it aren't really ready, in a pragmatically useful way. In the comfort of our own company repository I can choose a solution that does not damage our productivity.

3 comments:

  1. We found your weblog is very handy for us. When you keep going this perfect work I’ll come back to your site!
    Aba File Generator

    ReplyDelete
  2. Thanks man! I've get a new library in the works - I have just _having opinions_, since they're so damn cheap.

    ReplyDelete