Overview
One of the classic examples that keeps coming up when talking about dependently
typed programming languages is the “safe” printf
function – one that ensures
that the number and type of arguments match the requirement in the format
specification.
In languages like Idris, this is just a function that takes a format string, and returns the type of arguments required for constructing the formatted output string.
Other languages, like rust, solve this by various means of metaprogramming: writing a program (macro) that runs at compile-time, generating the program to be executed at runtime.
What these two approaches have in common is that they both operate on strings that are statically available to the compiler. The aim of this post is to show another way of achieving the same result, with tools that are available in PureScript – a strongly-typed functional language, with no dependent types.
The problem
We want to write a program that takes a format string, some number of arguments, and returns the result of inserting the arguments at their specified places in the format string, and does all this in a type-safe way.
The @
symbol before the string is the proxy syntax introduced in 0.12
which provides a concise way of passing types around. The format strings
are actually type-level literals – but more on this later.
Crucially, we need to compute a type from some input, but because PureScript has no dependent types, values and functions in the traditional sense are not available for evaluation at compile-time. However, there is a way to interact with the compiler: via the type-checker.
The solution therefore is to encode this computation in the types, and have the
type-checker evaluate it for us as part of type-checking. Luckily, PureScript
allows string literals in types (these are types whose kind is Symbol
).
Thus, constructing our printf
function comprises two steps:
- parse the input
Symbol
into a list of format tokens - generate the function from the format list that will then assemble the output string
Type-level parsing
For the sake of simplicity, we’re going to focus on two types of format
specifiers: decimals (%d
) and strings (%s
).
We represent these cases with a custom kind, which is like a regular algebraic datatype, but lifted to the type-level. This means that these constructors can be used in types.
Of course, apart from the format specifiers %d
and %s
, everything else is a
literal, which we account for by wrapping them in the Lit
type constructor.
The foreign import
bit means that we’re introducing types here that have no
constructors. That is to say, it’s impossible to construct a value of type D
and S
. We’ll see later how it is still possible to carry these types around
in terms (hint: proxies).
Furthermore, we need a way of representing a sequence of these specifiers, for which we introduce another kind:
With this, we can now write types like FCons D (FCons (Lit " foo") FNil)
,
corresponding to the string %d foo
.
Kind-polymorphism is not supported by the current version (0.12) of PureScript,
so we can’t define a parametric type-level list once and for all – we need a
new one for each type we want to store in lists. With this, and some syntactic
sugar, we would be able to write (as we can in Haskell today) [D, "foo"]
.
This limitation is likely to be removed in a future version of the compiler.
With these building blocks defined, now we have a vocabulary for talking about
the parser itself: it is a function that takes a Symbol
as an input, and
returns a FList
. We encode the computation in the following type class:
The functional dependency string -> format
states that the input string
determines the ouput format
. This bit is crucial, as this is what tells the
compiler that knowing string
is sufficient in determining what the value of
format
is. It is then our task to ensure that this dependency indeed holds,
when writing out the instances.
To deconstruct the input symbol, we use the following type class available in 0.12:
The interesting functional dependency here is the sym -> head tail
, which,
given some symbol, deconstructs it into its head
(the first character) and
its tail
– the rest.
The parser is like a state machine, with the following legal states:
- State 1: found a non-
%
character - State 2: found a
%
character
One possible way of representing these states is by having a separate type class to deal with each.
Since in our simplified example, we know that the specifier symbols can only be single characters, we can define the second state as:
That is, it takes a symbol, and returns the matching specifier. The implementation is straightforward:
This is a partial function, which means that format strings that contain unsupported specifier tokens will simply fail to compile.
The first state is more complicated, as it can consume an arbitrary number of
characters, so we pass it the remaining string (tail
) as well.
Parse1
represents the parsing state where we have the current character
head
, the rest of the input string tail
, and we know that the previous
character was not a %
.
The first case is when the tail is empty. In this case, we just return the current character as the literal in a singleton list:
The second case is more interesting. This is when we find a %
, so we need to
invoke the other function, Parse2
, which handles parsing the specifier
itself. To do that, we use ConsSymbol
to split our current tail s
into
its head h
and tail t
. h
contains the format specifier, which we pass on
to Parse2
. Then, recursively invoke Parse
on t
to parse the rest of
the input. In addition to returning spec
consed to rest
, we also put
an empty string literal at the head of the output list: this is to maintain
the invariant that the head of the output list always contains a string literal.
This invariant will be useful for the last case…
…when we match any other character, i.e. other than %
. Since we’re in
Parse1
, that means that the current character needs to be in a string
literal. For this, we first recursively parse the tail s
into
FCons (Lit acc) r
. The reason we want to know that at the head of parsing
the remaining string is a Lit
is so that we can prepend the current character
to that literal – we need to rebuild long string literals
character-by-character after all. This is where the invariant from the previous
two cases is useful: we don’t have to handle the cases where the head is not
a Lit
, because the recursive calls guarantee that it is. acc
is thus the
tail of the string literal we’re currently parsing, so we put it together
with the current character by ConsSymbol o acc rest
(recall that this type
class can both construct and deconstruct symbols via its functional
dependencies). Then we simply return Lit rest
along with r
.
Notice how these instances actually overlap. In the third case, we can
easily imagine a particular instantiation of o
and r
such that it
matches the instance head in the second case. In other words, when the current
character is %
, both parse1Pc
and parse1Other
match (because
parse1Other
is more general).
To make sure that the instances are selected in the order we want them to be,
we use instance chains. That is, by writing instance A else instance B
we
tell the compiler to try to match instance A
first, and if it fails, then try
B
. This is a new feature in PureScript 0.12, and a very powerful one – it
allows us to avoid the overlapping instance problem for good.
Finally, we need to actually kick off the parser. We do this by invoking it in the first state.
How the sausage gets made: computing the output type
But how do we know how many arguments we need to pass to the formatter? It depends on the format string! No surprises here: just like all the previous type-level computations, this one will also be encoded in a type class with a functional dependency.
The @
symbol is special syntax, and in this case, it means that the formatF
function takes an FList
(format
) as an input. But because FList
is a
custom kind, it has no value-level inhabitants. So, how can we still get
something whose type mentions format
? This is what @
does – it’s a proxy
for a type. Its value is isomorphic to Unit
, and carries no information,
other than its type. Notice that it works for any kind – indeed, proxies are
currently a special-cased type in PureScript, in that they are kind-polymorphic.
Thus formatF
takes a format list, and an accumulator string, and returns some
fun
– this type depends on the actual format list.
Starting with the base case, when there’s nothing to print, simply just return the accumulated formatted string.
When the head of the list is D
, we know that we will need an Int
argument,
and the rest of the function’s type can be computed by recursing on the tail of
the list. As for the implementation, since the return type is now refined to
be of the form Int -> fun
, we are allowed to construct a lambda that takes
the Int
, and appends it to the end of the accumulator, then recurses on the
rest. The implementation of S
is identical, and is omitted for brevity.
Handling literals (Lit
) is left as an exercise for the reader.
Conclusion
Finally, as a matter of convenience, we can wrap the above type classes into one, that serves as a bridge between the parser and the formatter, as such:
And that’s it! It might be instructional to try and work out FormatF
’s
instance resolution for a few simple examples by hand, to get a better idea why
this works. A fully working implementation of the code in this post can be
found on github.