Miller 5.6.1

Why C?
• Why not C?
• C vs. Go, D, Rust, etc.; C is fast
• C is ubiquitous
• C is old-school
• C vs. C++
    • this pointers and attributes
    • Interfaces and virtual-function pointers

Why not C?

C lacks many of the features found in modern, high-level languages such as Java or Go: garbage collection, collections libraries, generics/near-generics, hash-map/linked-list literals built into the language (e.g. mymap={"a"=>1,"b"=>2} or mylist=[3,4,5]), autodoc (e.g. Javadoc), and so on. Yet, while memory management is indeed Miller’s trickiest aspect, its garbage-collection needs are well-delineated and so the absence of GC is no great loss. Miller’s performance relies on the principles of touching each byte as few times as possible, and copying bytes only when necessary. This results in a baton-passing, free-on-last-use memory-management pattern which works well enough. (See also Miller doesn’t require a complex collections library: mostly simple hash maps, hash sets, and linked lists which aren’t difficult to code. Moreover, Miller’s primary data structure, the lrec_t, is hand-tuned to Miller’s use case and would have required hand-coding in any case.

C vs. Go, D, Rust, etc.; C is fast

I love Go ( I think it’s one of the best things ever to happen to our craft, and I use it often. The D language ( is an exciting and elegant successor to C++ (more about which below) — D has many of Go’s strengths, with a tighter stylistic similarity to C. And initial experiments with Rust are intriguing. Yet with none of them could I obtain the throughput I get in C.

Specifically, I did simple experiments in several languages — Ruby, Python, Lua, Rust, Go, D. In one I just read lines and printed them back out — a line-oriented cat. In another I consumed input lines like x=1,y=2,z=3 one at a time, split them on commas and equals signs to populate hash maps, transformed them (e.g. remove the y field), and emitted them. Basically mlr cut -x -f y with DKVP format. I didn’t do anything fancy — just using each language’s getline, string-split, hashmap-put, etc. And nothing was as fast as C, so I used C. Here are the experiments I kept (I failed to keep the Lua code, for example): C cat, another C cat, D cat, Go cat, another Go cat, Rust cat, Nim cat, D cut, Go cut, Nim cut.

One of Go’s most powerful features is the ease with which it allows quick-to-code, error-free concurrency. Yet Miller, like most high-volume text-processing tools, spends most of its time obtaining and parsing input strings and negligible time doing all subsequent processing. Thus the absence of in-process multiprocessing is only a slight penalty in this particular application domain — parallelism here is more easily achieved by running multiple single-threaded processes, each handling its own input files, either on a single host or split across multiple hosts.

C is ubiquitous

Every Unix-like system has a C compiler (or is an apt-get or yum install away from it). This, I hope, bodes well for uptake of Miller.

C is old-school

This alone is not enough reason to program in C, but since I find myself coding in C due to the other reasons on this page, it’s happy enough to use a throwback language for a throwback tool (see Why call it Miller?). That said, Miller is coded in GNU C99, it uses getopt-style command-line parsing, and for development work I make use of modern tools such as valgrind. K&R was a long, long time ago. (I’m writing plain C with // comments; enough said.)

C vs. C++

I have a strong personal distaste for C++: its syntax is an ugly layer over the simplicity of C; templates and STL are even more awkward and even less elegant. (Meanwhile I find Java, Go, and D to be both elegant and modern; I ruled them out not for aesthetics but for performance as described above.) Meanwhile all the positive features I would want from C++ are easily implementable in C as follows:

this pointers and attributes

The C++ compiler implictly inserts this pointers into method calls: for example
  class MyClass {
      char* a;
    MyClass(char* a) {
      this->a = strdup(a);
    ~MyClass() {
    int myMethod(char* b) {
      return strlen(a) + strlen(b);
  MyClass* myObj = new MyClass("hello");
  int x = myObj->myMethod("world");
results in something like
  void MyClass$constructorcharptr(MyClass* this, char* a) {
    this->a = strdup(a);
  void MyClass$destructor(MyClass* this) {
  int MyClass$myMethod(MyClass* this, char* b) {
    return strlen(this->a) + strlen(b);
  MyClass* myObj = MyClass$constructorcharptr("hello");
  int x = MyClass$myMethod(myObj, "world");
It’s easy enough to imitate this: simply use the coding convention of prepending the class name to all methods, and placing this-pointers as the first arguments to methods. Miller uses precisely this approach. For example:
typedef struct _lrec_t {
} lrec_t;
// Constructors
lrec_t* lrec_csv_alloc(...) {
  lrec_t* prec = malloc(sizeof(lrec_t);
  prec->attribute = ...;
  return prec;
lrec_t* lrec_dkvp_alloc(...) {
// Destructor
void lrec_free(lrec_t* prec) {
// Methods
int lrec_foo(lrec_t* prec, ...) {
  return prec->...;
void lrec_bar(lrec_t* prec, ...) {

This implements the object-oriented principle of encapsulation.

Interfaces and virtual-function pointers

Coding conventions again do most of the work, here accompanied by typdeffed function pointers. For example, here is Miller’s record-reader interface:
#include <stdio.h>
#include <containers/lrec.h>
typedef lrec_t* reader_func_t(FILE* fp, void* pvstate, context_t* pctx);
typedef void    reset_func_t(void* pvstate);
typedef void    reader_free_func_t(void* pvstate);

typedef struct _reader_t {
    void*               pvstate;
    reader_func_t*      preader_func; // Interface method
    reset_func_t*       preset_func;  // Interface method
    reader_free_func_t* pfree_func;   // Interface method
} reader_t;

A class implementing this interface might look like

// Attributes are private to this file
typedef struct _reader_csv_state_t {
} reader_csv_state_t;

// Implementation of interface methods. Marked static (file-scope) to not
// pollute the global namespace; exposed only via function pointers.
static lrec_t* reader_csv_func(FILE* input_stream, void* pvstate, context_t* pctx) {
  reader_csv_state_t* pstate = pvstate;
  ...  use various pstate->attributes ...
static void reset_csv_func(void* pvstate) {
  reader_csv_state_t* pstate = pvstate;
  ...  use various pstate->attributes ...
static void reader_csv_free(void* pvstate) {
  ...  use various pstate->attributes ...

// Constructor
reader_t* reader_csv_alloc(...) {
  reader_t* preader = mlr_malloc_or_die(sizeof(reader_t));

  reader_csv_state_t* pstate = mlr_malloc_or_die(sizeof(reader_csv_state_t));
  ... set various pstate->attributes ...

  preader->pvstate      = (void*)pstate;
  preader->preader_func = &reader_csv_func;
  preader->preset_func  = &reset_csv_func;
  preader->pfree_func   = &reader_csv_free;

  return preader;

// Factory method
  reader_t* preader = reader_csv_alloc(...);
// Method call
  lrec_t* pinrec = preader->preader_func(input_stream, preader->pvstate, pctx);

This implements the object-oriented principles of polymorphism and runtime binding.

More details are at