Thursday 29 September 2016

A Modest Alternative II: Taming stdio's scanf

The Ugly Brother

My favourite Irish joke (and I'm Irish enough to tell it) concerns a posh gent who is lost in Dublin. He asks a bystander, "Tell me my good man, how can I get to the National Museum?". Who replies: "Well sir, I wouldn't go from here". This is the received opinion about scanf, the ugly brother of printf.

It tends to appear only in the kind of beginner tutorals where programs ask the user to provide two numbers and then add them up. Serious people avoid scanf because it's hard to use properly: easy to confuse, and tricky to get meaningful errors. Beginners read the tutorials and then ask questions about how to bullet-proof scanf and get told by serious people not to use it.

scanf's glamorous cousin std::cin is easier to use, because it matches against strongly-typed non-const references - reading a constant is a compile error. But error handling is not straightforward.

Besides, sometimes you don't have the resources nor the inclination to use iostreams, as I discussed in my last article

Line by Line

A common strategy is to read a text file, line by line. Here std::getline is superior to fgets; it allocates more buffer if needed and trims end-of-line characters.

The outstreams library provides a very similar interface:

string line;
stream::Reader in("myfile.txt");
while (in.getline(line)) {
   do_something(line);
}
if (! in) {
   stream::errs(in.error())('\n');
}
// if failed:
// --> No such file or directory

Always a good idea to check for errors; if the error state is set, then no further read operations will take place, so it's perfectly fine to check afterwards.

So far, very much like iostreams's istream interface, except you get a sensible error without having to call perror yourself (which belongs to stdio, the irony)

From now on I'll assume that you have broken down and said 'using namespace stream'. Here's a one-liner which populates a container:

vector<string> lines;
Reader in("myfile.txt");
in.getlines(lines);
// errors!?

The container can be anything which understands push_back - a std::list would work just as well here. There is an optional second argument to getlines which gives the maximum number of lines to grab. (It's not how these things are typically organized; you would pretend that Reader was a container - or write an adapter - and use std::copy and std::back_inserter. This is more general but significantly uglifies the common case.)

Then there is a readall method which grabs all of the file and puts it into a std::string. The weakness of the standard string - that it knows nothing about character encoding - becomes a strength when you treat it as sliceable, appendable bag of bytes.

Having captured the lines, those serious people will typically then use serious conversion functions like std::stoul etc to actually extract values and get errors.

A Comedy of Errors

Things become less than simple with both stdio and iostreams when reading items one by one. scanf returns the number of items successfully processed, or EOF if we run out of stream. If the number scanned is too big for the type, garbage results. (The man page does claim that it will set errno if a conversion resulted in a value out of range, but it appears to be lying.) So serious use involves lots of checking.

I've tried to tame these errors by putting a facade around the actual scanf calls, rather as outstreams wrapped printf:

double x;
string s;
short b;

StrReader is ("4.2 boo 42");
is (x) (s) (b);

You always should be aware of errors: files may be mistyped, mangled by the network, or maliciously altered.

is.set("x4.2 boo 42");
is (x) (x) (b);
errs(is.error())('\n');
// --> error reading double
// at 'x4.2'
is.set("x4.2 boo 344455555");
is (x) (x) (b);
errs(is.error())('\n');
// --> error converting int16 
// --> out of range
//  344455555
is.set("x4.2 boo x10");
is (x) (x) (b);
errs(is.error())('\n');
// --> error reading int64 at
// 'x10'

('int64' for reading a short? Because I cannot assume that the read value is in range.)

There's an even more thorough way to handle errors; there is an error struct which can be 'read' from a stream.

Reader::Error err;

if (! is (x) (x) (b) (err)) {
// err.errcode  EOF or errno
// err.msg as returned by error()
// err.pos position in file
}

// safe one-liner:
// capture the error state
// before reader object dies
if (Reader("tmp.txt")
    .getline(line)
    (err)
) {
   // cool
} else {
   // bummer! But at least
   // you know _where_
}

So, in summary so far, Reader (and its string-oriented cousin StrReader) provides a safer way to use stdio for input, with better error handling, without the baroque contortions of iostream errors.

Why no exceptions? It can be a matter of taste; it's often forbidden in embedded coding and it can be argued that paying attention to the error where it happens leads to better code. You are completely free to throw your own exception of course after checking for an error, but it then it won't be some generic 'file not found' exception which makes no sense several stack traces down.

Some useful tricks

It's possible to read a file of numbers directly:

outstreams$ cat numbers.txt
10 20 30 40
50 60 70 80
...
int i;
Reader in("numbers.txt");
while (in (i)) {
    outs(i);
}
outs(eol);
// -> 10 20 30 40 50 60 70 80

Of course, the read could be replaced by in >> i with istream and it would work in the same way, except that any errors would not give as much information.

What if we only wanted the first three numbers on each line?

in (i) (j) (k) ();
outs(i)(j)(k)(eol);
// -> 10 20 30    
in (i) (j) (k) ();
outs(i)(j)(k)(eol);
// -> 50 60 70

The no-argument overload of the call operator is equivalent to the skip method, which generally takes the number of lines to skip.

Binary files

The read method comes in two flavours. The first is given a buffer with its size, and returns the number of bytes actually read.

Reader self ("rx-read");
self.setpos(0,'$');
auto endp = self.getpos();
outs("length was")
    (endp)(eol);
// --> length was 42752
self.setpos(0,'^');
auto buff = 
    new uint8_t [endp];
auto res = 
    self.read(buff,endp);
outs("read")(res)
    ("bytes")(eol);
// --> read 42752 bytes

setpos is a little eccentric, but will make sense to people who use regular expressions - '^' means 'start of file' and '$' means 'end of file'; anything else means current position.

(The std::istream method of the same name and signature returns the stream, which is consistent, but awkward if you're interested in the bytes actually read.)

The other form of read is a template method that takes one argument and returns the stream. It's useful for objects that have fixed size at compile time, like structs and arrays

MyHeader st;
MyInfo info;
inf.read(st).read(info);
// error state will be EOF
// if we didn't read all

Commands

Executing and capturing the output of external problems is a pain point in C++ for me. So naturally I wanted to capture the output of that excellent old dog popen and make it easier to use.

CmdReader ls("ls *.cpp");
vector<string> cpp_files;
ls.getlines(cpp_files);

string uname =
    CmdReader("uname").line();
if (uname == "Linux") { 
    outs("that's a relief")(eol);
}

CmdReader derives from Reader - all it need do is override close_handle so that pclose is called instread of fclose. It has an extra convenience method line for grabbing first line of output. stdout and stdin are merged.

If you aren't particularly interested in the output and just success/failure, there are a few convenient patterns:

// quick cool/uncool check
string res = CmdReader(
    "true", 
    cmd_ok
    ).line();
if (res != "OK") {
    outs("very weird shell")
        (eol);
}

// actual return code
int retcode;
CmdReader("false",cmd_retcode)
    (retcode);

Wrapping Up

We have lost some scanf functionality by breaking up the format into little bits.

However, if you already have a means to split the input stream into parts, then Reader machinery can be re-used to do the actual conversion - see templ-read.cpp:

int n;
double x;
string s;
auto parts = {"42","5.2","hello"};
auto rdr = make_parts_reader(
    parts.begin(),
    parts.end()
);
rdr (n) (x) (s);

This pattern can be made to work with any source of strings, like regular expression matches, database queries, and so forth.

I tried to show that scanf can be tamed, and it then becomes more pleasant and reliable to use. This library is not a heavy dependency (instream and outstream do not depend on each other and are each a single source file) and provides a middle option when choosing between stdio and iostreams.

Highly structured input formats are best read with the correct tools - we do have JSON, CSV, config-file, etc parsers. I suspect the problem here is that there is no browsable and discoverable repository of useful C++ libraries like Cargo for Rust, etc.

No comments:

Post a Comment