Friday, 2 December 2011

Reading CSV Data in Go using Reflection

Filling in Structs

A common data format is CSV (Comma-Separated Values). Although simple in principle, there are some tricky quoting issues fully discussed in the relevant RFC. Fortunately, the Go csv package handles all of this decoding for you.

For instance, this reads CSV data from standard input:

 r := csv.NewReader(os.Stdin)
 row,err := r.Read()
 for err != nil {
     ... do something with row, which is a slice of strings ...
     row,err = r.Read()
 }
 if err != os.EOF {
     fmt.Println("error",err)
 }

The delimiter need not be a comma; for instance you can set r.Comma to '\t' for tab-separated fields before reading.

What you then do with the row is up to you; this article shows how to map selected columns onto the fields of a struct, performing conversions as needed. This provides a declarative way of describing your data.

The assumption is that your data has an explicit header giving column names:

 First Name,Second Name,Age
 John,Smith,67
 Jill,Tailor,54

The corresponding struct definition looks like this:

 type Person struct {
    FirstName string `field:"First Name"`
    Second_Name string
    Age int
 }

We are going to use reflection to discover the names and types of any public fields of the struct, and use optional field tags to find the corresponding column name if it isn't the same as the field name. (Note that as a further convenience, underscores in field names will correspond to spaces in the column name.)

Tags are only accessible using reflection, but provide a simple mechanism to annotate struct fields with key/name pairs inside Go raw quotes (backticks).

This is a common pattern used by those Go packages which can read structures from text encodings, for instance xml and json.

Reflection in Go is fairly straightforward if you get the concepts first. In particular, if you are used to Java-style reflection, you have to distinguish between types and pointers to types. We are going to create an iterator that runs over all rows in the file, and pass it a pointer to our struct, which will have its values rewritten each time. This creates less unnecessary garbage, and allows for some optimizations. The pointer is passed as interface{} and we get the actual run-time type of the struct itself like so:

 st := reflect.TypeOf(ps).Elem()

(The Elem() is necessary because the type of ps is a pointer).

Once we have the type of the struct, then st.NumField() is the number of fields and st.Field(i) is the type of each field in the struct. The tag 'field' is accessed using st.Field(i).Tag.Get("field"); if this is an empty string, then we have to use the name of the field. The column index of the field in the data can then be looked up in the first row, which has to be the column names.

Reading and converting each line proceeds as follows: we know the 'kind' of the field type and switch on five possibilities: string, integer, unsigned integer, float and value. We can then safely use the appropriate strconv function to convert the string and the appropriate reflect.Value method to set the field's value; for instance, all integer fields are set with the SetInt method, which is passed the largest integer (int64) for all integer types.

The use of this API is intended to be as simple as possible. As usual with Go code, there is very little explicit typing of variables needed.

 r := csv.NewReader(in)
 p := new (Person)
 rs,_ := NewReaderIter(r,p)
 for rs.Get() {
     fmt.Println(p.FirstName,p.Second_Name,p.Age)
 }
 if rs.Error != nil {
     fmt.Println("error",rs.Error)
 }

The first argument to NewReaderIter involves a particularly Go-like solution to the problem of over-specific types. When I was first refactoring test code into a general solution, my structure looked like this:

 type ReadIter struct {
     Reader *csv.Reader
     ...
 }

Well, that's a very C-like way of doing things. We actually don't need to have an explicit dependency on the csv package, since all we require of our reader is that it has a Read method like csv.Reader:

 type Reader interface {
     Read() ([]string,os.Error)
 }
 type ReadIter struct {
     Reader Reader
     ...
 }

This use of interfaces will be familar to Java programmers, but there is a big difference. If we had a Java package called csv, then we would have to use the same interface as csv.Reader. And if this class was not designed generally enough and did not implement an interface, then we would have to write a bridging class that implemented our interface for csv.Reader. It either leads to close coupling or lots of adapter code. Go interfaces, however, can be considered as 'compile-time duck typing'. All our Reader interface requires of a type is that it implements particular methods. This csvdata package does not depend on the csv package in any way, and you could use any object that knew how to feed us string slices with a Read method. This is similar to the concept of protocols in Objective-C.

Arbitrary Types

So far, the reflection code handles basic types. It's not hard however to find examples of data containing dates and other custom types. You could keep dates as strings and parse them as needed, but it's better to keep a canonical internal form.

A solution is provided by the flag package for command-line parsing, where you can handle new types by making them implement the Value interface:

 type Value interface {
    String() string
    Set(string) bool
 }

Interfaces that support String are ubiquitous in Go (in particular fmt.Println knows about them). The Set method is for custom parsing of strings into your type.

Parsing and displaying times is handled by the time package. It's a little eccentric but easy once you get the pattern. The date/time format is provided as a string that looks just like the format you want, but with standard values for the various parts. So, month is '01', day is '02', hour is '03', minute is '04', second is '05' and year is '06'. Hour may also be '15' for 24-hour format, and '06' may be '2006' for four-digit year format.

 const isodate = "2006-01-02"

This will format and parse dates in ISO order.

A date type that implements Value is straightforward - the only constraint on Value is that the methods must be defined on a pointer value, so that the Set method can modify the the value. We can extend the time.Time struct by embedding this type in our Date struct:

 type Date struct {
    *time.Time
 }

Note that this field has no name, just a type! Writing it like this means that all methods and fields of *time.Time are available to *Date.

 func (this *Date) String() string {
    return this.Format(isodate)
 }

To explicitly access the implicit Time field, use its type name:

 func (this *Date) Set(sval string) bool {
    t, e := time.Parse(isodate, sval)
    this.Time = t
    return e == nil
 }

This technique is discussed in Effective Go. Extending a type in this way is not inheritance in the classic sense. If I defined the method Format on Date then this.Format(isodate) would call the local method, and the embedded type's Format method would have to be explicitly called as this.Time.Format(isodate); the new method does not override the original method.

So given data that looks like this:

 Joined,First Name,"Second Name",Age
 2009-10-03,John,Smith,67
 2010-03-15,Jill,Tailor,54

the struct will be

 type Person struct {
    Joined *Date
    FirstName string `field:"First Name"`
    Second_Name string
    Age int
 }

and reading code will look like this:

 p := new (Person)
 p.Joined = new(Date)
 rs,_ := NewReadIter(r,p)
 for rs.Get() {
    fmt.Println(p, p.Joined.Year)
 }

Again, there is no dependency introduced on the flag package. But we get bonus functionality if we do use flag, since command-line parameters can now be defined using the same Date type.

 var date Date = Date{time.LocalTime()}
 flag.Var(&date, "d", "date to filter records")

This is all straightforward when the underlying value is already a pointer. But sometimes you need to extend a primitive type to customize how it is parsed and presented. An example is the abomination known as the 'thousands separator', where a numerical field may look like "22,433".

 type Float float64
 func (this *Float) String() string {
    return strconv.Ftoa64(float64(*this),'f',10)
 }
 func (this *Float) Set(sval string) bool {
     sval = strings.Replace(sval,",","",-1)
    v, e := strconv.Atof64(sval)
    *this = Float(v)
    return e == nil
 }

(I'll leave the job of printing out the values in the abominated form as an exercise)

The problem is that these values are not pointers. The reflection trick is to do the type assertion val.Addr().Interface().(Value) on each value and if it succeeds, then we know that the pointer to the value (Addr()) satisfies the Value interface. Just checking the kind of the type is not sufficient, because this Float type still has kind refect.Float64.

Issues and Possibilities

Using reflection is always going to be slower than direct manipulation but in practice this does not make that much difference. (It takes less than a second to process a data set with 23000 records with 20 columns on this modest machine and doing direct conversion is not significantly faster.)

Keeping up with a rapidly evolving target like Go is always a challenge, and as I was writing this article a new weekly came out that changed the meaning of time.Time - in short, it's not a pointer to a structure anymore (so lose the stars, basically.)

An orthogonal feature that I would like is the ability to read and convert columns of data from CSV sources. It's common enough to be useful, and not difficult to do, but would not need any reflection techniques.

An obvious extension is writing structures out in CSV format. This should be straightforward and there are lots of opportunities for code reuse, but this code is intended primarily as an example. Without doc comments, it's only about 190 lines of code plus comments, which is within the magic number limit defining 'example'.

If it's useful to people and they need these features, then I'll promote csvdata to a proper goinstallable pakage.

For now, the package and some examples is here.

3 comments:

  1. I was planning on writing something like this myself, thanks for the thorough explanation and for saving me some time. One question, instead of `field:"first name"`, wouldn't `csv:"first name"` be more fitting?

    ReplyDelete
  2. I agree with you, it's more specific. I should update this and make it go-gettable as well!

    ReplyDelete
  3. Agreed - this was a great post, and would love to see it updated and go-gettable!

    ReplyDelete