The sufficiently okay compiler
digikar writes that SBCL cannot specialise the generic addition in
sf-vdot
as defined in:
(declaim (inline cl-vdot)) (defun cl-vdot (veca vecb) (declare (type (simple-array * 1) veca vecb)) (let ((len (length veca))) (loop for i below len summing (* (aref veca i) (aref vecb i))))) (defun sf-vdot (x y) (declare (type (simple-array single-float 1) x y) (optimize speed)) (cl-vdot x y))
SBCL knows that the elements of veca
and vecb
in the
inlined version of cl-vdot
in sf-vdot
are of type
single-float
, and thus can generate specialised code for *
.
But it cannot generate specialised code for the implicit +
in
the summing
clause of the loop
, as the sum is initialised
to 0
and thus the type of the sum is
(or (eql 0) single-float)
.1
How might we optimise this further? For brevity I'll just work on this simpler code with the same issue:
(defun sum (vector) (declare ((simple-array single-float 1) vector)) (loop for e across vector sum e))
The first step is to macroexpand loop
:
(defun sum (vector) (declare ((simple-array single-float 1) vector)) (let ((sum 0) (index 0)) (loop (when (= index (length vector)) (return sum)) (setq sum (+ sum (aref vector index))) (setq index (+ index 1)))))
(Expanding extended loop
to simple loop
is good enough for me.)
At this point we can infer types, to see if we can optimise anything.
We have two assignments to sum
: it is first initialised to the integer
0
, and then due to float contagion it is re-assigned with the
single-float result of adding sum
and a single-float element from
vector
. The type of sum
is the union of all assignments i.e.
(or (eql 0) single-float)
. We could generate inline code which
handles both cases2 instead of calling generic-+
which must check
all possible types, but we can still do better.
Converting this code into static single assignment form yields:
sum₀ = 0 index₀ = 0 loop: sum₁ = φ(sum₀, sum₂) index₁ = φ(index₀, index₂) length = length(vector) if index = length then goto exit else goto loop.2 loop.2: element = single-float-aref(vector, index₁) sum₂ = generic-+(sum₁, element) index₂ = fixnum-+(index₁, 1) goto loop exit: return sum₁
We don't strictly need to convert to SSA form to perform the optimisations I'll describe, but using SSA makes presenting the transformations somewhat neater. If you don't know SSA, think of this as an imperative program where the φ function returns the value of whichever variable that the program most recently assigned to.
Inlining generic-+
with the known types yields a new definition of sum₂
:
loop.2: ... sum₂ = single-float-+(to-single-float(sum₁), element) ...
which has the addition inlined, but to-single-float
still requires
dispatching on the type of sum₁
. However, we have the property that
φ is distributive - for some pure function f
, f(φ(x, y)) = φ(f(x), f(y))
.
Generally we would not optimise using this property, because we
turn one call to f
into two, but currently sum₁
has an or
type
which we could simplify by splitting. Distributing to-single-float
yields:
sum₀ = 0 sum₃ = to-single-float(sum₀) index₀ = 0 loop: sum₄ = φ(sum₃, sum₅) index₁ = φ(index₀, index₂) length = length(vector) if index = length then goto exit else goto loop.2 loop.2: element = single-float-aref(vector, index₁) sum₂ = single-float-+(sum₄, element) sum₅ = to-single-float(sum₂) index₂ = fixnum-+(index₁, 1) goto loop exit: sum₁ = φ(sum₀, sum₂) return sum₁
(I also took the liberty to move sum₁
out of the loop, which comes
"for free" with the sea of nodes intermediate representation. Keeping
sum₁
inside of the loop might not be the worst thing in the world
with out-of-order execution; recall that each loop iteration depends
on the last value of sum₄
, but sum₁
is only used when returning.)
We may simplify away to-single-float
in sum₃
due to having
a constant argument and in sum₂
due to having a single-float
argument already:
sum₀ = 0 sum₃ = 0.0 index₀ = 0 loop: sum₄ = φ(sum₃, sum₂) index₁ = φ(index₀, index₂) length = length(vector) if index = length then goto exit else goto loop.2 loop.2: element = single-float-aref(vector, index₁) sum₂ = single-float-+(sum₁, element) index₂ = fixnum-+(index₁, 1) goto loop exit: sum₁ = φ(sum₀, sum₂) return sum₁
Now there is no type dispatch inside the loop. The careful reader might have spotted that I didn't consider type tagging at all; tagging and untagging operations can also be eliminated with simple rewrite rules. Nonetheless, the optimisation we walked through only requires type inference and simple rewrite rules (also known as peephole optimisation).
It's understandable why SBCL doesn't do any of this: SBCL uses the compiler from CMUCL, which predates both single static assignment and the sea of nodes, and implementing either would require large changes to the compiler. But I don't think the techniques on display here are too hard or esoteric; there's many techniques unused in Common Lisp implementations which are in "sufficiently okay compiler" territory. We don't need new languages for ease of optimisation, as it is not hard to optimise the current ones with the right approaches. Nor do we need compilers specialised to any domain - the combination of peepholes and type inference works well in general.