Why C?

Why not C?

C vs. Go, D, Rust, etc.; C is fast

I love Go (https://golang.org): I think it’s one of the best things ever to happen to our craft, and I use it often. The D language (http://dolang.org) is an exciting and elegant successor to C++ (more about which below) — D has many of Go’s strengths, with a tighter stylistic similarity to C. And initial experiments with Rust are intriguing. Yet with none of them could I obtain the throughput I get in C.

Specifically, I did simple experiments in several languages — Ruby, Python, Lua, Rust, Go, D. In one I just read lines and printed them back out — a line-oriented cat. In another I consumed input lines like x=1,y=2,z=3 one at a time, split them on commas and equals signs to populate hash maps, transformed them (e.g. remove the y field), and emitted them. Basically mlr cut -x -f y with DKVP format. I didn’t do anything fancy — just using each language’s getline, string-split, hashmap-put, etc. And nothing was as fast as C, so I used C. Here are the experiments I kept (I failed to keep the Lua code, for example): C cat, another C cat, D cat, Go cat, another Go cat, Rust cat, Nim cat, D cut, Go cut, Nim cut.

One of Go’s most powerful features is the ease with which it allows quick-to-code, error-free concurrency. Yet Miller, like most high-volume text-processing tools, spends most of its time obtaining and parsing input strings and negligible time doing all subsequent processing. Thus the absence of in-process multiprocessing is only a slight penalty in this particular application domain — parallelism here is more easily achieved by running multiple single-threaded processes, each handling its own input files, either on a single host or split across multiple hosts.

C is ubiquitous

Every Unix-like system has a C compiler (or is an apt-get or yum install away from it). This, I hope, bodes well for uptake of Miller.

C is old-school

This alone is not enough reason to program in C, but since I find myself coding in C due to the other reasons on this page, it’s happy enough to use a throwback language for a throwback tool (see Why call it Miller?). That said, Miller is coded in GNU C99, it uses getopt-style command-line parsing, and for development work I make use of modern tools such as valgrind. K&R was a long, long time ago. (I’m writing plain C with // comments; enough said.)

C vs. C++

I have a strong personal distaste for C++: its syntax is an ugly layer over the simplicity of C; templates and STL are even more awkward and even less elegant. (Meanwhile I find Java, Go, and D to be both elegant and modern; I ruled them out not for aesthetics but for performance as described above.) Meanwhile all the positive features I would want from C++ are easily implementable in C as follows:

this pointers and attributes

The C++ compiler implictly inserts this pointers into method calls: for example
  class MyClass {
    private:
      char* a;
    public:
    MyClass(char* a) {
      this->a = strdup(a);
    }
    ~MyClass() {
      free(a);
    }
    int myMethod(char* b) {
      return strlen(a) + strlen(b);
    }
  };
  ...
  MyClass* myObj = new MyClass("hello");
  int x = myObj->myMethod("world");
results in something like
  void MyClass$constructorcharptr(MyClass* this, char* a) {
    this->a = strdup(a);
  }
  void MyClass$destructor(MyClass* this) {
    free(this->a);
  }
  int MyClass$myMethod(MyClass* this, char* b) {
    return strlen(this->a) + strlen(b);
  }
  MyClass* myObj = MyClass$constructorcharptr("hello");
  int x = MyClass$myMethod(myObj, "world");
It’s easy enough to imitate this: simply use the coding convention of prepending the class name to all methods, and placing this-pointers as the first arguments to methods. Miller uses precisely this approach. For example:
typedef struct _lrec_t {
  ...
} lrec_t;
// Constructors
lrec_t* lrec_csv_alloc(...) {
  lrec_t* prec = malloc(sizeof(lrec_t);
  ...
  prec->attribute = ...;
  return prec;
}
lrec_t* lrec_dkvp_alloc(...) {
  ...
}
// Destructor
void lrec_free(lrec_t* prec) {
  ...
  free(prec->attribute);
  ...
  free(prec);
}
// Methods
int lrec_foo(lrec_t* prec, ...) {
  return prec->...;
}
void lrec_bar(lrec_t* prec, ...) {
  prec->...;
}

This implements the object-oriented principle of encapsulation.

Interfaces and virtual-function pointers

Coding conventions again do most of the work, here accompanied by typdeffed function pointers. For example, here is Miller’s record-reader interface:
#include <stdio.h>
#include <containers/lrec.h>
typedef lrec_t* reader_func_t(FILE* fp, void* pvstate, context_t* pctx);
typedef void    reset_func_t(void* pvstate);
typedef void    reader_free_func_t(void* pvstate);

typedef struct _reader_t {
    void*               pvstate;
    reader_func_t*      preader_func; // Interface method
    reset_func_t*       preset_func;  // Interface method
    reader_free_func_t* pfree_func;   // Interface method
} reader_t;

A class implementing this interface might look like

// Attributes are private to this file
typedef struct _reader_csv_state_t {
  ...
} reader_csv_state_t;

// Implementation of interface methods. Marked static (file-scope) to not
// pollute the global namespace; exposed only via function pointers.
static lrec_t* reader_csv_func(FILE* input_stream, void* pvstate, context_t* pctx) {
  reader_csv_state_t* pstate = pvstate;
  ...  use various pstate->attributes ...
}
static void reset_csv_func(void* pvstate) {
  reader_csv_state_t* pstate = pvstate;
  ...  use various pstate->attributes ...
}
static void reader_csv_free(void* pvstate) {
  ...  use various pstate->attributes ...
}

// Constructor
reader_t* reader_csv_alloc(...) {
  reader_t* preader = mlr_malloc_or_die(sizeof(reader_t));

  reader_csv_state_t* pstate = mlr_malloc_or_die(sizeof(reader_csv_state_t));
  ... set various pstate->attributes ...

  preader->pvstate      = (void*)pstate;
  preader->preader_func = &reader_csv_func;
  preader->preset_func  = &reset_csv_func;
  preader->pfree_func   = &reader_csv_free;

  return preader;
}

// Factory method
  ...
  reader_t* preader = reader_csv_alloc(...);
  ...
// Method call
  ...
  lrec_t* pinrec = preader->preader_func(input_stream, preader->pvstate, pctx);
  ...

This implements the object-oriented principles of polymorphism and runtime binding.

More details are at https://github.com/johnkerl/miller/tree/master/c/containers.