Rob Everest
University of New South Wales

Embedding foreign code
----------------------

Special purpose embedded languages facilitate generating high-performance code
from purely functional high-level code; for example, we want to program highly
parallel GPUs without the usual high barrier to entry and the time-consuming
development process. To enable this we have Accelerate, a Haskell EDSL that
uses a skeleton-based, generative approach to generate low-level CUDA code for
execution on Nvidia GPUs. In this talk, I will describe work done by myself
(Robert Clifton-Everest), Trevor L. McDonell, Manuel M. T. Chakravarty, and
Gabriele Keller and for which a paper we wrote was recently accepted to PADL.

What I will describe is our solution to some of the practical problems with
skeleton-based code generation and introduce an approach to enabling
interoperability with native code. In particular, I will describe how template
meta programming simplifies code generation and optimisation. Furthermore, I
will present a design for a foreign function interface for an embedded
language.

Earlier versions of Accelerate implemented CUDA C code templates and template
instantiation with a mixture of C++ templates and C preprocessor macros. This
solution, while workable, was not ideal. Not only was it fragile, but it
provided no guarantees on the validity of the generated code. By leveraging
the quasiquotation extension to template Haskell we are able to define CUDA
skeletons as quoted CUDA C templates. Doing it in this fashion, the syntactic
validity of the skeleton is checked at compile time. This ensures that if we
can compile the backend, the code generated will always be syntactically
valid. In addition, we can implement consumer-producer fusion by specific
skeleton instantiation.

In the latter part of the talk I will present what is, to my knowledge, the
first foreign function interface for an embedded language. By leveraging
Haskell's own FFI and template Haskell we are able to build build an FFI into
Accelerate that allows for both calling foreign functions from within an
Accelerate program, and for embedding an Accelerate program into an existing
CUDA C/C++ program.