The sufficiently okay compiler

digikar writes that SBCL cannot specialise the generic addition in sf-vdot as defined in:

(declaim (inline cl-vdot))
(defun cl-vdot (veca vecb)
  (declare (type (simple-array * 1) veca vecb))
  (let ((len (length veca)))
    (loop for i below len
          summing (* (aref veca i)
                     (aref vecb i)))))

(defun sf-vdot (x y)
  (declare (type (simple-array single-float 1) x y)
           (optimize speed))
  (cl-vdot x y))

SBCL knows that the elements of veca and vecb in the inlined version of cl-vdot in sf-vdot are of type single-float, and thus can generate specialised code for *. But it cannot generate specialised code for the implicit + in the summing clause of the loop, as the sum is initialised to 0 and thus the type of the sum is (or (eql 0) single-float).¹

How might we optimise this further? For brevity I'll just work on this simpler code with the same issue:

(defun sum (vector)
  (declare ((simple-array single-float 1) vector))
  (loop for e across vector sum e))

The first step is to macroexpand loop:

(defun sum (vector)
  (declare ((simple-array single-float 1) vector))
  (let ((sum 0)
        (index 0))
    (loop
      (when (= index (length vector)) (return sum))
      (setq sum (+ sum (aref vector index)))
      (setq index (+ index 1)))))

(Expanding extended loop to simple loop is good enough for me.)

At this point we can infer types, to see if we can optimise anything. We have two assignments to sum: it is first initialised to the integer 0, and then due to float contagion it is re-assigned with the single-float result of adding sum and a single-float element from vector. The type of sum is the union of all assignments i.e. (or (eql 0) single-float). We could generate inline code which handles both cases² instead of calling generic-+ which must check all possible types, but we can still do better.

Converting this code into static single assignment form yields:

        sum₀ = 0
        index₀ = 0
loop:
        sum₁ = φ(sum₀, sum₂)
        index₁ = φ(index₀, index₂)
        length = length(vector)
        if index = length then goto exit else goto loop.2
loop.2:
        element = single-float-aref(vector, index₁)
        sum₂ = generic-+(sum₁, element)
        index₂ = fixnum-+(index₁, 1)
        goto loop
exit:
        return sum₁

We don't strictly need to convert to SSA form to perform the optimisations I'll describe, but using SSA makes presenting the transformations somewhat neater. If you don't know SSA, think of this as an imperative program where the φ function returns the value of whichever variable that the program most recently assigned to.

Inlining generic-+ with the known types yields a new definition of sum₂:

loop.2:
        ...
        sum₂ = single-float-+(to-single-float(sum₁), element)
        ...

which has the addition inlined, but to-single-float still requires dispatching on the type of sum₁. However, we have the property that φ is distributive - for some pure function f, f(φ(x, y)) = φ(f(x), f(y)). Generally we would not optimise using this property, because we turn one call to f into two, but currently sum₁ has an or type which we could simplify by splitting. Distributing to-single-float yields:

        sum₀ = 0
        sum₃ = to-single-float(sum₀)
        index₀ = 0
loop:
        sum₄ = φ(sum₃, sum₅)
        index₁ = φ(index₀, index₂)
        length = length(vector)
        if index = length then goto exit else goto loop.2
loop.2:
        element = single-float-aref(vector, index₁)
        sum₂ = single-float-+(sum₄, element)
        sum₅ = to-single-float(sum₂)
        index₂ = fixnum-+(index₁, 1)
        goto loop
exit:
        sum₁ = φ(sum₀, sum₂)
        return sum₁

(I also took the liberty to move sum₁ out of the loop, which comes "for free" with the sea of nodes intermediate representation. Keeping sum₁ inside of the loop might not be the worst thing in the world with out-of-order execution; recall that each loop iteration depends on the last value of sum₄, but sum₁ is only used when returning.)

We may simplify away to-single-float in sum₃ due to having a constant argument and in sum₂ due to having a single-float argument already:

        sum₀ = 0
        sum₃ = 0.0
        index₀ = 0
loop:
        sum₄ = φ(sum₃, sum₂)
        index₁ = φ(index₀, index₂)
        length = length(vector)
        if index = length then goto exit else goto loop.2
loop.2:
        element = single-float-aref(vector, index₁)
        sum₂ = single-float-+(sum₁, element)
        index₂ = fixnum-+(index₁, 1)
        goto loop
exit:
        sum₁ = φ(sum₀, sum₂)
        return sum₁

Now there is no type dispatch inside the loop. The careful reader might have spotted that I didn't consider type tagging at all; tagging and untagging operations can also be eliminated with simple rewrite rules. Nonetheless, the optimisation we walked through only requires type inference and simple rewrite rules (also known as peephole optimisation).

It's understandable why SBCL doesn't do any of this: SBCL uses the compiler from CMUCL, which predates both single static assignment and the sea of nodes, and implementing either would require large changes to the compiler. But I don't think the techniques on display here are too hard or esoteric; there's many techniques unused in Common Lisp implementations which are in "sufficiently okay compiler" territory. We don't need new languages for ease of optimisation, as it is not hard to optimise the current ones with the right approaches. Nor do we need compilers specialised to any domain - the combination of peepholes and type inference works well in general.

Footnotes:

Actually, SBCL infers the wider type (or (eql 0) float (complex single-float) (complex double-float)).

Clozure always generates inline code for addition on fixnums without type information, but does not generate specialised code for or types.