Overview
Haskell, as implemented in GHC, has a very rich language for expressing computations in types. Thanks to the DataKinds extension, any inductively defined data type can be used not only at the term level, but also at the type level. A notable exception are strings, which provide the main theme for today’s blog post.
The String
type in Haskell is defined as a list of Char
s. However,
the type-level equivalent, Symbol
, is defined as a primitive in GHC,
presumably for efficiency. After all, the type checker passes these
types around, and the simpler their structure, the less potential work
the constraint solver needs to do.
The problem is this: since Symbol
is defined as a primitive, there
is no way to pattern match on its structure, and the only way to
interact with them are by using the built-in primitive operations,
namely appending and (efficient, constant-time) comparison.
In this blog post, I will show how these primitives can be used to recover the ability to do arbitrary introspection of these type-level string literals, thereby enabling a whole range of applications where statically known information can be exploited.
The technique presented here was inspired by Daniel Winograd-Cort’s pull request for the generic-lens library.
All of this is packaged into the symbols library.
Motivation
I have written about type-level symbol
parsing in PureScript to implement a type-safe printf
function. (There, I achieved symbol decomposition by patching the
compiler, but no such thing is required here.)
Reusing that example, we will be able to write
The implementation of the printf example using the technique described in this blog post can be found on github.
Primitives
First, let’s have a look at the primitives GHC provides for
manipulating type of kind Symbol
, namely AppendSymbol
and
CmpSymbol
.
These functions are implemented in the compiler, and exported from the GHC.TypeLits module:
Note that there is no Uncons
primitive that returns the head (first
character) and the tail of the symbol. It turns out that we can
implement Uncons
using the two primitives above.
AppendSymbol
The fact that AppendSymbol
is a type family suggests a rather
straightforward semantics. It appends two symbols together resulting
in a third one:
That is to say, it should only go in one way, so to speak.
However, if we have a look at the
implementation
in GHC, we can see that there’s more going on. There are special rules
for the interaction of AppendSymbol
constraints with equality
constraints. In concrete terms, GHC will solve the following
constraint:
That is, if we know a prefix of a symbol, we can decompose it to get
the matching suffix. Morally, the actual signature of
AppendSymbol
would be closer to
But this can’t be expressed today in GHC (type family dependencies
only allow the inputs to be decided solely by the result, and no
such combination of inputs and outputs are allowed), so AppendSymbol
really is a lot more powerful than what the type system would like to admit!
Even with the ability to decompose symbols, there is a problem, however. This decomposition only works if we know what the prefix is. And in general, we need to know two out of the three symbols involved in the constraint to get the third.
As a result, the following won’t work:
that is, suffix
is unsolved.
We might think that we can just try all possible characters as potential prefixes until one matches, but that would require backtracking in the constraint solver, and GHC’s constraint solver doesn’t backtrack.
That is, trying a prefix that doesn’t match results in an unsolvable constraint:
But since we can’t backtrack, there is no way to try a different character once we’ve committed to a particular prefix.
If we knew what the first character was, we could strip it off and get the remaining symbol this way, which would allow us to treat Symbols as a list of characters essentially.
CmpSymbol
It turns out that we can simply use alphabetical ordering to find out
what the first character of a string is. CmpSymbol
compares two symbols,
and returns one of LT
, EQ
, or GT
as a result.
Observe that for any string longer than one, it’s always true that the
string follows its first character alphabetically, and precedes any
character after its first one. As an example, consider the string
"hello world"
, whose first character is h
, and the letter after
h
is i
. Then we have
For strings of length one, they will simply return EQ
when compared
with their first character (themselves).
Decomposition
We now put the pieces together to implement an uncons function for
symbols. First, we need Head
, a function that returns the first
character of a symbol. Second, we will use Head
to interact with
AppendSymbol
to retrieve the tail of the symbol. Doing this
repeatedly will allow us to turn a symbol into a list of characters,
which in turn can be consumed by ordinary type families.
Head
So, to find out what the first character of a symbol is, we just need to find the last character in the ASCII table that precedes our symbol. To do this reasonably efficiently, we use binary search. Since indexing into a type-level list takes linear time, we use a balanced binary search tree instead. Recall that symbol comparisons are constant-time, so the whole operation is constant time (as we’re working with a fixed size alphabet), so this optimisation simply improves the constant factor by an order of magnitude.
The printable subset of the ASCII character set can be encoded as the following tree:
(I generated this structure with the help of other type families, but found that inlining the result into the source file results in much faster lookups.)
Note that each node contains two consecutive characters: this is so that we can easily decide when to stop: when the first element is less than, and the second element is greater than our input string.
The Lookup
type family (and Lookup2
, to make up for a lack of local
declarations in type families) implements a standard binary search.
Finally, Head
is just a lookup in the binary tree.
Uncons
Next, we need to interact the AppendSymbol
constraint with
Head
. We now turn to a type class, Uncons
:
sym
is our symbol, h
is the head, and t
is the tail. It would
be nice to have a functional dependency sym -> h t
, but
unfortunately we can’t make that pass, as recall that the backwards
dependencies of AppendSymbol
are essentially hidden from the type
system.
We write a single instance, which sets up the right constraints:
First, we write h ~ Head sym
, which unifies h
with the first
element of the symbol using the binary lookup defined
previously. Then, the AppendSymbol h t ~ sym
constraint will trigger
the solution of t
, due to the now known prefix h
.
The uncons
member is not necessary for things to work out, but it
helps illustrate the working of the type class in the REPL:
Finally, we can write the Listify
class to recursively break down a
symbol into a list of characters:
And with this, we can parse anything we’d like.
Conclusion
Of course all of the above could be done a lot more efficiently with compiler support, and there’s no reason for that not to happen at some point in the future. This post is just a proof of concept that something like this is already possible today, and the presented technique is suitable for some lightweight applications. For anything larger scale, Template Haskell is probably much better suited for the job today.