How Roc Compiles Closures
At RWX, we’re investing some of our engineering resources towards development of the Roc language project. Roc is a new, pre-alpha purely-functional language that offers a trove of new developer productivity features. We won’t get into all of them today, but we’d like to tell you about how Roc compiles closures with excellent runtime performance characteristics. One of Roc’s goals is to produce performant binaries when they’re needed, and how closures are represented at runtime is an important part of the design space.
You might know closures as "lambdas" or "function objects" from other languages. Roc takes an approach to closure compilation that differs from most other language compilers by applying a technique know as defunctionalization at the type-system level. We’ll discuss what all that means below!
This is the first part of a long story about compiling closures; in the future, we’ll describe a novel mechanism for compiling Roc-style closures in the presence of ad-hoc polymorphism like Rust’s traits, C++ concepts, or Go’s interfaces1.
Defining Closures
There are many different names for the concept we’ll be talking about today. To be precise, I’d like to talk about a language feature that allows you to write functions that can
- be constructed on-the-fly
- capture values from the lexical scope they’re constructed in
- be used as values in the language, e.g. passed around to other functions
- conditionally be used in place of other functions of the same shape2
Here’s a concrete description of closures satisfying the above criteria in
Python, C++, and Roc. Each snippet defines a function that takes an integer n
,
and conditionally returns a closure that takes an integer m
and increases m
by n
via some operation - either addition or multiplication.
def increase_with(n, increase_op):
if increase_op == "add":
return lambda m: m + n
else:
assert increase_op == "mul"
return lambda m: m * n
Closure Conversion
The Idea
Compiling closures doesn’t have to be tricky, but making their runtime representation efficient may take some care. If we think about a non-capturing function (think of a top-level function exported by a library in any language), calling it is determined by two things:
- The code of the function
- The parameters passed to the function at the call site
For a caller of a function, only the code of the function is outside information, since the caller will provide the parameters. So, to represent a non-capturing function, only the function’s address needs to be relayed. Want to conditionally call one of two functions? Hide the address of the function behind a variable, and have the caller dereference that variable before calling.
Of course, the only difference between closures and non-capturing functions is that a closure captures variables! But that makes all the difference - those captured variables are now also outside information, and must be stored somewhere.
Captured variables are determined at runtime, not statically determined like the code for a function, so they can’t be stored alongside the code of the function Captured variables can’t be determined by the caller of the function, since they were defined at the time the closure was defined. If the caller always knew what they were, there would be no point to capturing those variables - they could be passed as regular parameters!
That means closures must be identified by both their function code, and their captured variables (the latter is sometimes called an "environment"). And there is no way around this - we are forced to provide our callers both pieces of information. In fact, as we’ll see below, the most common way to compile closures is as a tuple consisting of an address to the function, and an address to the captured environment!
Okay, I just gave you a lot of words, so let me give you a concrete example.
Suppose we wanted to implement closures in C using this technique (C has no
language-level concept equivalent to the kind of closures I described). Let’s do
so for our increase_with
example from above:
#include <stdlib.h>
struct any_closure {
void* closure_addr; // Address to the function constituting the closure
void* closure_env; // Captured environment of the closure
};
struct add_closure_env {
int n;
};
// The implementation of the adding closure.
int add_closure(int m, void* env) {
int n = ((struct add_closure_env*) env)->n;
return m + n;
}
struct mul_closure_env {
int n;
};
// The implementation of the multiplying closure.
int mul_closure(int m, void* env) {
int n = ((struct mul_closure_env*) env)->n;
return m * n;
}
struct any_closure increase_with(int n, int increase_op) {
if (increase_op == 0) { // add
// Allocate the environment `add_closure` needs on the heap, and return a pointer to it.
struct add_closure_env* env = (struct add_closure_env*) malloc(sizeof(struct add_closure_env));
env->n = n;
struct any_closure clos = {
.closure_addr=(void*) &add_closure,
.closure_env=env,
};
return clos;
} else {
// Allocate the environment `mul_closure` needs on the heap, and return a pointer to it.
struct mul_closure_env* env = (struct mul_closure_env*) malloc(sizeof(struct mul_closure_env));
env->n = n;
struct any_closure clos = {
.closure_addr=(void*) &mul_closure,
.closure_env=env,
};
return clos;
}
}
As a quick gut check, how would you call a value of any_closure
to
produce the result of a closure?
Let’s walk over a few finer points about this implementation strategy:
-
Note that the code for the
add_closure
andmul_closure
definitions now take an additional argument, the explicit closure environment. As mentioned before, it’s totally necessary. -
any_closure
is defined as an opaque pointer to a function, and an opaque pointer to a closure environment. This is mostly for convenience; we can cast them to the types we expect at a usage site, with no runtime cost.Indeed, for our example, we could be a lot more specific than using void pointers, since
increase_with
returns closures whose functions have the same signature, and moreover their environments are defined the same way!
You might have noticed that, “hey, actually, we don’t need the closure
environment to be a pointer unless it needs to live on the heap - in other
cases, it can be placed on the stack, and the type of closure_env
here can
be a union.” And that would be true, if we always knew what type of
closure_env
the function at closure_addr
expected3. Let’s hang on to
that thought going forward.
This strategy of adding an environment parameter to capturing function definitions’ and carrying the closure environment with the function pointer is generally known as closure conversion, and is pretty much what every compiler that supports closures must do to an extent.4
Before we get to what Roc does differently, I’d like to show you concretely that this strategy of packing a function pointer and environment into a value aligns with what Python and C++ do.5
What Python does
There are a lot of python implementations, so I’ll just talk about CPython, and its bytecode VM.
So, dumping bytecode - CPython includes in its standard library a nice
module dis
to disassemble Python bytecode. Well, for us it’s nice because
it’s an easy way to examine bytecode, for the CPython maintainers I don’t
know… something something Hyrum’s law, right? Anyway, let’s disassemble
increase_with
.
Look at how nice that is to read! Thank you, CPython developers. And thanks
for telling us how closures are constructed - MAKE_FUNCTION
builds a
function object.
Python’s function objects have more than just the code to the function and
its captured variables, but they do have at least that!
What C++ does
In C++, a lot of values are objects, and closures (called lambdas in C++) are no different. C++ lambdas are de-sugared to function objects (called functors) that define what it means to call an instance of such an object via a call operator6. That is, the following code
auto increment_from(int num) {
return [num]() { return num + 1; };
}
desugars to an anonymous class
instance,
roughly in the form presented below. In C++
, auto
directs the compiler
to deduce the type of an item.
auto increment_from(int num) {
class increment_from_functor {
public:
// Constructor
increment_from_functor(int num): num{num} {}
// Defines a way to call an instance of `increment_from_functor`
// with no arguments, returning an int.
int operator()() const {
return num + 1;
}
private:
int num;
} increment_from_functor_inst {num};
return increment_from_functor_inst;
}
This is a pretty efficient way to represent a closure, and is the way you or I would. You can decide how the captured values should be stored (stack or heap-allocated). Compare this to a hypothetical unconditional version of the C code we wrote before - it’s mostly the same, sans some heap allocations.
Unfortunately, this efficient representation doesn’t immediately scale to conditional closures (property [4] of our definition for what it means for something to be a closure). That’s because C++ is nominally-typed, and nominal typing rules don’t change for functors, which as we saw desugar to unique classes. That means the following code doesn’t compile:
auto increase_with(int m, IncreaseOp increase_op) {
if (increase_op == IncreaseOp::add) {
return [m](int n) { return m + n; };
} else {
return [m](int n) { return m * n; };
}
}
C++ provides a nice abstraction to deal with this in the
std::function
container, which allows you to store functions polymorphically at the
expense of additional pointer indirection at runtime, and possible
heap-allocation. It has a constructor that takes any functor with a call
operator of its parameterized type, which is why our initial example did
compile:
#include <functional>
std::function<int ()> increase_with(int num, IncreaseOp increase_op) {
if (increase_op == IncreaseOp::add) {
return [m](int n) { return m + n; };
} else {
return [m](int n) { return m * n; };
}
}
That’s a nice ergonomics feature, but it incurs a runtime performance cost you might need to consider! If only we could do better…
Roc uses Lambda Sets
Roc implements closure conversion (as mentioned, we have to get the captured environment to a closure’s function definition, and that environment must be passed alongside the function pointer no matter what). But Roc makes closure conversion really efficient, by eliminating function pointer dereferences in favor of switch statements on integers, and always optimally deciding at compile-time whether captured environments for an arbitrary closure should be stack or heap-allocated (even when closure environments are decided conditionally).
I know that’s a lot of words, and I promise I’ll walk through all of it. But first, credit where it’s due - this technique is known as defunctionalization, and has been around for a while. It is usually employed as an optimization in the backend of compilers, and as such, is not always guaranteed. Roc ensures defunctionalization is always applied by representing it at the type-system level. The earliest work regarding type-based defunctionalization that I am aware of comes from Bell, Bellegarde, and Hook (1997), but the particular flavor of type-based defunctionalization Roc uses, known as lambda sets, is due to the work of Brandon, et.al. (unpublished). Cejtin, Jagannathan, and Weeks (2000) which describes an auxiliary flow-based technique used in MLton. The implementation of specialization of lambda sets in Roc is greatly thanks to Folkert de Vries.
Defunctionalization
Recall that with our “function pointer and captured environment”
implementation strategy for closures, we have to dereference a function
pointer in order to figure out what function we need to call. But why should
we have to do that? We know that increase_with
only ever returns one of
two closures. What if we just embed that information in a flag, and
conditionally call the function we need based on the value of the flag?
Doing so, we could produce a version of our C increase_with
with the
following diff:
diff --git a/increase_with.c b/increase_with.c
index 1bcc54c..315ea19 100644
--- a/increase_with.c
+++ b/increase_with.c
@@ -1,7 +1,9 @@
#include <stdlib.h>
-struct any_closure {
- void* closure_addr; // Address to the function constituting the closure
+enum increase_with_closure_flag { add, mul };
+
+struct increase_with_closure {
+ enum increase_with_closure_flag flag; // The function to dispatch to
void* closure_env; // Captured environment of the closure
};
@@ -25,13 +27,13 @@ int mul_closure(int m, void* env) {
return m * n;
}
-struct any_closure increase_with(int n, int increase_op) {
+struct increase_with_closure increase_with(int n, int increase_op) {
if (increase_op == 0) { // add
// Allocate the environment `add_closure` needs on the heap, and return a pointer to it.
struct add_closure_env* env = (struct add_closure_env*) malloc(sizeof(struct add_closure_env));
env->n = n;
- struct any_closure clos = {
- .closure_addr=(void*) &add_closure,
+ struct increase_with_closure clos = {
+ .flag=add,
.closure_env=env,
};
return clos;
@@ -39,8 +41,8 @@ struct any_closure increase_with(int n, int increase_op) {
// Allocate the environment `mul_closure` needs on the heap, and return a pointer to it.
struct mul_closure_env* env = (struct mul_closure_env*) malloc(sizeof(struct mul_closure_env));
env->n = n;
- struct any_closure clos = {
- .closure_addr=(void*) &mul_closure,
+ struct increase_with_closure clos = {
+ .flag=mul,
.closure_env=env,
};
return clos;
Besides eliminating at least one load from memory, this strategy might have
other advantages. With a smaller memory representation of closure_addr
, a
compiler may be able to produce a smaller memory representation for struct
s
composed of increase_with_closure
s.
Let’s take this a step higher. From here-on we’ll deal mostly with Roc syntax; I’ll explain basic concepts as they come up.
The typical example of defunctionalization has to do with higher-order
functions. For example, suppose we have a function map : List a, (a -> b) -> List b
that takes a list of any element type a
, a function that maps the
element type to any other type b
, and returns a list of the mapped elements.
Now suppose we define a function
greetNames : List Str, Str -> List Str
greetNames = \names, greeting ->
mapper = \name -> "\(greeting), \(name)!" # Roc’s string interpolation
List.map names mapper
that takes a list of names, a greeting, and builds a list of greeting strings
corresponding to the given names. Notice that with the
function-and-environment-pointer implementation strategy for compiling closures,
the implementation of List.map
would have to dereference a function pointer and
environment pointer to determine the mapping function. But in the particular
case of greetNames
, we know that the call to List.map
can only ever take one
closure - namely mapper
!
That means we could define a particular version of List.map
for this usage,
that only ever calls mapper
directly for each element, rather than needing to
follow a dereference. And, in cases where List.map
takes a closure that might
point to multiple function addresses, we’d only need to switch on an integer,
like we did with our increase_with
example. This technique is known as
defunctionalization78.
Most usages of defunctionalization I’m aware of apply it as an optimization in
the middle or backend of a compiler, over an intermediate representation.
Moreover, such optimizations are typically global - in our example, a compiler
would search for all usages of List.map
, modify each call site to pass a
unique integer instead of a function pointer, and modify the implementation of
List.map
to switch over the passed integer to dispatch to the appropriate
function. This avoids creating specialized definitions of List.map
for each
different closure it’s called with, but makes incremental compilation harder -
you have to be careful when a new usage of List.map
is added, or one is
deleted, in determining what parts of the compiled program can be re-used.
Lambda Sets
Critically, a compiler cannot guarantee defunctionalization can be applied unless it knows all values that are functions, and moreover attaches the set of functions those functions might actually point to. So, why don’t we associate those set of functions at the point we figure out whether a value is a function or not?
That’s the key insight of lambda sets - by bookkeeping the set of functions a closure might dispatch in the function type, you have all the information you need to implement defunctionalization during compilation, and can hence guarantee that it always happens. Moreover, all the information is available right at your feet; if the compiler knows a value has a function type, it also knows exactly what function it can be.
Let me show you what this looks like. Roc’s compiler generates a unique name for
each function that a program uses, and this name is stored in the type inferred
for a function value. Below, I’ve annotated the inferred type for values in our
incOrDec
example; the symbols #clos1
and #clos2
correspond to unique names
generated for the two anonymous closures used in the function.
increaseWith = \n, increaseOp ->
when increaseOp is
Add -> \m -> m + n
# ^^^^^^^^^^^ {} -[[#clos1 I64]]-> I64
Mul -> \m -> m * n
# ^^^^^^^^^^^ {} -[[#clos2 I64]]-> I64
The type for increaseWith
itself would be inferred as9
increaseWith : I64, [Add, Mul] -[[incOrDec]]-> (I64 -[[#clos1 I64, #clos2 I64]]-> I64)
[[#clos1 I64]]
describes a lambda set that consists of example one function,
#clos1
, that captures one variable, a 64-bit signed integer. (The name of the
captured variable isn’t stored in the type; it could be, but Roc’s
implementation stores that name elsewhere.) So what is the inferred type of
increaseWith
saying? Well, it says it’s a function value that resolves to one
possible function, incOrDec
, that captures nothing and takes as parameters a
64-bit signed integer and boolean. It returns a function value that resolves to
either #clos1
, which captures an I64
, or #clos2
, which captures an I64
.
[#clos1 I64, #clos2 I64]
is a tag union - a data type in Roc that allows
you to specify a value that can take on one of a disjoint number of types. Tag
unions are Roc’s sum types, and also known as enums, disjoint unions, and
variants in other languages. For lambda sets, the tag names consist of the
possible functions in the lambda set, and the payloads of the tag are the types
of the captured variables. This syntax is not for convenience - lambda sets are
compiled just like tag unions! Roc effectively10 desugars our increaseWith
example into
clos1 = \m, n -> m + n
clos2 = \m, n -> m * n
increaseWith = \num, increaseOp ->
when increaseOp is
Add -> Clos1 num
Mul -> Clos2 num
and calls like
n = 20
(increaseWith 10 Add) n
are effectively10 desugared to
n = 20
when increaseWith 10 Add is
Clos1 m -> clos1 m n
Clos2 m -> clos2 m n
And that’s how Roc compiles closures - with no pointer chasing to figure out what function to call, explicitly enumerating the possible calling options, in the same way a Roc developer would do by-hand.
Moreover, because Roc
monomorphizes type-polymorphic
functions, and lambda sets are part of function types, Roc will produce unique
instantiations of functions that use or produce unique lambda sets. Taking our
List.map
example from before
greetNames : List Str, Str -> List Str
greetNames = \names, greeting ->
mapper = \name -> "\(greeting), \(name)!"
List.map names mapper
a unique specialization of [List.map](http://List.map)
whose body only ever
calls the function mapper
will be created by Roc’s compiler, because the usage
of List.map
’s inferred type (with lambda sets) is List Str, (Str -[[mapper Str]]-> Str)
. That’s the optimal way we’d define List.map
for this usage if
we could by-hand, but we don’t have to - the compiler will do it for us!
This kind of defunctionalization-by-default can be very powerful - not only can you save a dereference or two, but the compiler can determine intelligently when captured environments can be passed around on the stack versus the heap, regardless of how conditional the closure environment is. In fact, for the vast majority programs, Roc will always place captured environments on the stack (they must be heap-allocated when lambda sets are recursive - that can happen, but it’s rare!).
Moreover, since this technique explicitly enumerates where function calls dispatch to, it provides other optimizations, like inlining, exactly the information they need to do their job well. Functions hidden behind a pointer can rarely be inlined at call sites!
Tradeoffs
Like every choice, lambda sets come with their own challenges and disadvantages. We’re optimistic that their runtime performance characteristics and capacity to unblock other, powerful compiler optimizations over regions of a program make them the right choice for Roc. Nevertheless, here are a few considerations that need to be taken into account:
-
The time to monomorphize a program increases proportionally with the number of unique lambda sets present in a program. An alternative to this is to globally update lambda sets and their dispatch sites with all lambdas of a type in the program, but this has other problems as described previously, in particular with regards to incremental compilation.
-
At the limit, a monomorphized lambda set can describe the entire tree of calls through a program, from its entry point until its exit point. This happens when a function captures a calling function that captures another calling function, and so on. This situation does arise in practice, and we’ve found that care is needed to maintain compiler performance when it happens. Ensuring good compiler performance in such cases is an active area of Roc’s development.
-
A key assumption in assessing the runtime performance of the lambda set compilation technique is that loading and switching on an integer (e.g. implemented as a jump table) is faster than loading and dereferencing a function pointer.
This is a reasonable assumption in general (especially when there is nothing to switch over, that is the switch is constant), however, whether it holds for a specific use case on a specific operating system and architecture is impossible to determine without benchmarks.
If you’re familiar with computer architecture, you may recognize that the fundamental tradeoff in the defunctionalization approach boils down to the efficacy of a CPU’s branch predictor (for the conditional-dispatch case) relative to its branch target predictor (for the pointer-chasing case). I am not aware of any existing studies that measure the relative efficacies of these predictors across CPUs; if you are aware of any, please let us know!
-
Type-based defunctionalization, and lambda sets in particular, are a relatively new idea with little prior art. That means there are few, if any, references to turn to when questions arise, and there are open questions about the full power and limitations of lambda sets.
So, that’s a whirlwind tour of how to compile closures, and how Roc uses a unique and highly-performant technique. We haven’t covered the nuances, but I hope this has given you a sense for what tradeoffs and decisions compiler authors make in support language features like closures.
Before we leave, you may have been wondering why the lambda set syntax
includes two opening and closing braces -[[ ... ]]->
rather than just one.
That’s due to an extension of lambda sets we’ll discuss in the future! As a
teaser (and challenge to the interested), I’ll say that it’s not clear how
type-based defunctionalization should interact with typeclasses, or ad-hoc
polymorphism in general. If you’d like to like to think about this problem
beforehand, consider what happens when a caller dispatches to a function
hash
, that has no single implementation, but is instead determined
conditionally based on the type fed into it (that is, it is implemented as
part as of an interface, or typeclass). We’ll share Roc’s answer to this
challenge soon.
Footnotes
-
Roc supports ad-hoc polymorphism via a feature called “abilities”. They are comparable to typeclasses, traits, or interfaces in other languages. ↩
-
Let’s say two functions have the same shape if the language permits you to call them and use their result in the same way; for statically-typed languages that might be enforced during type checking. ↩
-
If at this point you’re saying “wait a minute…,” just hold on, I think I’m with you. We’ll get there in no time. ↩
-
You may sometimes hear of a related technique known as lambda lifting. Lambda lifting is a way of eliminating variable captures by re-writing a program with capturing functions lifted to top-level functions, where the captured values become formal parameters to the function. This might sound a lot like closure conversion, and it is - the main difference is that you can no longer pass around capturing functions as values conditionally, at least not without also doing closure conversion, because the whole idea of captures has been eliminated in such a program! ↩
-
To be precise, C++ implements closures as function objects. In my opinion, this distinction is only nominal; at the end of the day, the compiled code still assembles to a function pointer and a bag of captured variables. ↩
-
Rust uses roughly the same model to compile closures, though it is a little more complicated for the developer due to Rust’s ownership semantics. ↩
-
Defunctionalization has interesting applications besides efficient higher-order-function compilation - see this blog post for a discussion. ↩
-
In languages with objects and dynamic dispatch, defunctionalization of function objects might fall out of the common optimization of devirtualization. ↩
-
Importantly, the lambda sets typed between the arrow types are never exposed to the user - that is, you cannot explicitly tell the compiler what lambda sets a function type consists of. They are entirely an implementation detail known only during compilation. ↩
-
For brevity I’ve elided some important details; this is not exactly the transformation, but the main idea is the same. ↩ ↩2