CLOS circa 1987, Dynamic Dispatch in C++/C#

As reported by programming musings,

http://www.archive.org/details/DanielGB1987

Common Lisp Object Standard presentation, by Daniel G Bobrow.

Some reflections:

Fast Multiple Dispatch

"[for dynamic method resolution] A cache of 1000 entries is hit 98% of the time".  Assuming this is accurate, it explains how you could get by with the cached virtual dispatch lookup that the desktop Microsoft CLR runtime does – while not exactly what Bobrow means, 1000s of cache locations spread through the program can get a hit rate fast enough to compensate for the slow resolution path that is taken.

Fastpath Optimization

"What we learned from smalltalk is … something like 60%-70% of generic functions have exactly one method implemented".  I’ve been seeing discussions lately about fastpath optimization in dynamic languages.  The technique is to take advantage of the fact that not all dynamic call sites are very dynamic.  You can sometimes statically dispatch calls (assume various parameters are fixed-width-integers, assume others strings, etc) as long as you have recovery code that can detect the difference before any results are visible.

Tech talks haven’t really changed that much in 20 years; apparently we’ve already long since found the local optimum.  Code samples that fit on slides are always stacked; don’t solve real problems.  Diagrams are always awkward but we keep trying anyway.  Oh, and we didn’t compile the code before putting it in!  But don’t get me wrong, I think this talk was well done, and to be fair, he knew of the error (I assume back then, editing slides was a bit harder than PowerPoint).

Some CLR tie-ins

I think both of these features play a significant role in performance of CLR performance characteristics.

For those not familiar, if you step through a virtual call in MSIL on x86 or x64, you’ll see that each call site caches the address of the function that was last hit from that call site.  Consider the pseudo code:

void some_object::do_something(int x)
{
    //...
}
int main()
{
   some_interface* p = get_an_object();
   p->do_something(42);
}

In C++, typically some_object has a vtable for each interface, and each member function has an entry in this table.  Actual execution looks something like:

// ----  C++ style dispatch ----
struct some_interface
{
    struct vtable_layout
    {
        void (*do_something)(some_interface*, int);
    };
    vtable_layout* vtbl;
};
some_object::some_object()
{
    static vtable_layout vtbl = {
        &some_object::do_something;
    };
    this->vtbl = &vtbl;
}
void some_object::do_something(int x)
{
    //...
}
int main()
{
   some_interface* p = get_an_object();
   p->vtbl->do_something(p, 42);
}

Note that because the dispatch mechanism is tied to the vtbl, then only the first argument (or, rather, only one argument, according to some convention) can affect dispatch.

The CLR dispatches calls differently, more like:

 

// ---- CLR style dispatch ----
struct object
{
    void* type_identifier;
}
void default_do_something(object* p, int x)
{
    void (**memfun_do_something)(object*, int);

    // find address of "last_function_called" variable
    memfun_do_something = find_callsite();

    // fill in value using dynamic type of 'p'
    *memfun_do_something = 
        global_function_table
        .find_type(p->type_identifier)
        .find_member_function("do_something");

    // finally, call the actual member function
    return (*memfun_do_something)(p, x);
}
some_object::some_object
{
    this->type_identifier = some_object_id;
}
void some_object::do_something(int x)
{
    // thunking code -- in case wrong function called
    if (this->type_identifier != some_object_id)
        __tailcall(default_do_something(this, x));
    
    // ...
}
int main()
{
   some_interface* p = get_an_object();

   static void (*last_function_called)(object*, int) 
    = default_do_something;
   last_function_called(p, 42);
}

This is much more complicated – though keep in mind that the function ‘default_do_something’ can be shared for all defaults, using the call site as a hint.  Here, if the default is called or if the ‘last_function_called’ cache is wrong, we pay the hit of searching the type system for the method to call.  However, if the call is on the same object as the last call, then the cache will allow us to make a single indirection initially.  And because it is not using dispatch tables but rather a single type id, we can in general get away with having only a single type id per object instead of one vtbl pointer per interface implemented, saving space as well. 

There are still two indirections going on, but only one indirection that will force a stall the instruction pipeline – the type check can take advantage of out-of-order execution and branch prediction.

Related to CLOS, this scheme’s performance is no longer linked to single dispatch – any lookup scheme can be implemented in default_do_something, provided that a good cache hit rate can be achieved.

As well, when only one implementation exists for a virtual function, the cache hit rate will be 100% after the initial call.  For fast path, this means that you could potentially recompile main() to always call some_object::do_something, and rely on the type check to dispatch correctly if the situation changes.