Wednesday, June 24, 2009

A quick note about aliasing

Two things quickly:

First: someone (pkhoung) pointed out to me that I had misremembered the atomicity of incf. We can get atomicity using sb-ext:atomic-incf, but the usual one won't lock. I hadn't worried about this much anyway when threading across rows because a) the worst thing a race condition could do was recompute a row and b) this was highly unlikely. Modifying the code to either use a mutex there or to use the atomic version is easy enough, but not really needed. But I should have double checked before adding the "atomic, so no mutex" comment....


Secondly. As I'm sure you're all happy to hear, I'm done ranting about why I think the gcls benchmark actually does a poor job of it's stated goals --- but I haven't finished talking about why approaches like that do a poor job of the computation itself. I won't have anything else to say about the particulars of the benchmark, but while last post talked a bit about addressing the very poor estimates that 50 iterations gives (and there is more to say about estimating the set-membership), there is another big problem hanging over our heads, that I've only obliquely referenced so far.

The fundamental issue is that the set boundary is the interesting part, but the Whittaker (also known as Shannon and/or Nyquist) sampling theorem tells us that we won't be able to reproduce the boundary accurately. More on that later, but here's a couple of pictures.

Consider these 3 panels:

This is a detail near the complex point (-0.74453892,0.12172418) at roughly comparable resolution to that discussed previously (e.g. the 16000x16000 case). I've double the size of all pixels though, to make it easier to see. Hopefully at the posted resolutions this is fairly visible, but you may need to click through to the image itself to see much.

The leftmost panel is what you get from a million iterations deciding set membership. The middle panel is showing, in grayscale rather than black and white, the averaged membership of 100 samples inside the pixel, and the leftmost panel is just thresholding the middle one back to a binary image so you can compare it with the 1st panel.

So this shows how aliasing is affecting the boundary, at least. But last day I mentioned how the Mandelbrot set is actually simply connected .... so we know that the above images are in some sense impossible, because this black region is isolated in the white (out of the set) background. So what's going on here?

I'll discuss this all next time, but for now, here is a larger detail of the same area. It is inverted relative to the above (whit is in set, black out of it). Range of gray values relates to how long it took points within the pixel to escape the disc of radius 2. I've messed with the image a bit though, to make things easier to see, so the densities don't correspond exactly to anything. Solid red shows the (same) central region that is considered in the set, but red circles also show points were at least one random sample in the pixel is also considered in the set.

In this case sampled a few hundred times from each one, and as you can see this wasn't terribly effective at finding the set inside of pixels. However, you can see how the brighter (but not white) pixels contain some information about where it might be.... something I'll take up later.

Monday, June 22, 2009

On the limitations of dumb but fast...

Suppose you ask your friend what time it is, and they say "Right *now*, it's 12:43 and 32 seconds, 323 microseconds .... er, plus or minus about twenty minutes. You'd think they were a bit wonky, wouldn't you?

This is roughly the same reaction I have to someone describing the boundary of the Mandelbrot set on a lattice of 16000x16000 positions, with an iteration count of 50. The spatial resolution is very high, but the accuracy is very low. In fact it's only one of two (what I would consider fatal) deep flaws with this approach, but the other one will have to wait until next time.

Now one (overly) simple way to improve the estimate is to increase the iteration count. So, for example, what if we use 50 million iterations instead of 50? Back of the envelope computations tell you that (knowing that the set is about 3/8 of the total area) the fastest benchmarked code takes 6s to do 50 iterations on this grid, so it will take about 26 days to do 50 million. I think I originally used 10 million as my proxy for "lots and lots", so that's still a bit over 5 days....

Of course there is no real analysis there (clearly more than 50 is better, but is 50 million actually appreciably better than 10 million? Better than 10,000? It's actually deeper than that, but I'll discuss this more later). But that wasn't the point. The point was that if you have scale things up to get more accuracy (which you will), you'll quickly run into long runtimes. This is entirely consistent with many "real world" computational situations. Far more so that the ones that are actually cheap. For that matter, 5 days is often fine. It's when your first back-of-the-envelope computation tells you need 5 centuries that you really have to go back to the drawing board.

Bear in mind, too, that I originally had only a single core to use, so things were much, much worse. Even just looking at a small piece of the lattice was going to take hours and hours, time I didn't want to wait for.

So what to do? Giving up on the tactic of just computing as fast as we can (i.e. "dumb, but fast"), we have to step back and thing about what's happening here. You have a lattice where roughly 1/2 the area is cheap to compute, roughly 1/2 very expensive (at these high iteration counts). This sort of thing cries out for a new approach. Examples: some way to avoid blindly iterating (e.g. period detection, convergence tests), some way to do successive refinement and hence avoid spending time where it isn't needed. Some way to avoid doing some computations at all.

All of these techniques could work well, but as I was in a hurry I looked to the latter one. I remembered another theorem about the Mandelbrot set. Recall that the original iteration-based computation is all based on a theorem that if an orbit ever leaves the radius 2 disk about the origin, it will diverge. Well how about this one: Doudy and Hubbard showed in the original modern work on the set (mid 1980s) that the set is actually (simply) connected, something not at all obvious. So one practical implication is that if you draw an unbroken loop anywhere on the complex plane, if all of the points on that contour are in the set, then its interior is well, and the same for points outside the set, nearly. There are no isolated points in the set. You have to be a little bit careful here, because while the set itself is simply connected, the points outside the set includes all of the complex plane outside the disc of radius 2 at the origin. You don't want to draw a line around the entire set and conclude it is empty! (i.e the exterior is connected, the interior simply connected, the boundary, on the other hand, is weird) [update I lost part of this about the distinction originally, thanks to anon for catching it]

This immediately leads to an algorithm for spatial subdivision. Take the region you are interested and look at its boundary. If all points are in (out) of the set, record the interior points as in (out) of the set. If not, subdivide the region into a partition (e.g. four subregions) and apply the same approach to each subregion. Continue recursively until all points are covered.

Now of course in a discrete setting you stop at the point of one pixel, or in practise larger than that. Beyond that though, there is a as source of error. Because we aren't drawing continuous lines, the sampling may miss a small piece of the set, and thus erroneously conclude that the interior is solidly inside (outside) the set. There is a very practical way to mitigate this problem, but not remove entirely as I'll demonstrate.

Spatial subdivision FTW

It's pretty clear why this algorithm will win on computations if it encounters a subregion that is only withing the set. Rather than having to compute all of the interior points, it computes only the perimeter. The bigger this region is, the more you win, as the perimeter grows linearly while the area grows quadratically. The obvious way to keep a bunch of threads busy on this is to have them test a region, and if it fails, put the subregions on a stack somewhere. Then each thread just goes to the "job stack" for a new region when it needs new work to do.

However, there is still the overhead of specifying regions, and farming out these jobs to threads. The former can involve a lot of memory too. Think about our 16000x16000 case, we can subdivided it into 4 about 12 times, and each time you take 4 numbers specifying the regions, and replace it with 16 specifying 4 subregions. Take 64bit fixnums and unless I've screwed up that's about 64 Mb of potential state to be sloshed around just specifying where you are in the lattice, or about what twice the state being used to store the set values themselves (obviously this could be improved a bit, but that's a back-of-the-envelope bound on it)

Of course this isn't a terrible lot of state, but look what happened last time when I only added a function call and made it 4x slower! So while I knew that this approach would be much faster at large iterations, I assumed the overhead of managing regions would mean that with a tiny number of iterations like 50, "dumb but fast" would probably still win.

While writing up the previous post though, I had a thought. All I really had to do was partition the set enough to spread cheap and expensive subregions around enough to keep all the threads busy. Once in a region, the thread could just beaver away there by itself. As long as you don't interact spatially with other threads, you don't need any locking either. So this way I could keep all the cores busy but pay very little cost in organization. It's not as good at spreading the work around, but it will be very cheap.

So I tried this approach, and it's fast. Quite a bit faster than the current champion, actually, even at the 50 iteration case. Once you get into bigger iterations though, it really mops the floor with the "dumb but fast" approach. Here are the gcc (modified only to vary the number of iterations) case and two partitioning implementations:

This plot shows the expense (NB: in user cpu time) as you increase the number of iterations spent at each site. These are 16000x16000 lattices, loglog plot of seconds against the maximum number of iterations. So you can see that while the gcc code starts off at roughly 1/2 the cost of the lisp code, by the time you hit 1000 iterations the gap has widened to an order of magnitude, and stays there. I only ran this out to a million iterations, but that gives the final user time for the gcc code about 20x the user time of the gcls partition lisp code, and 40x the faster qtree2 (more on that to come) and wall clocks of about 12 hours, 40 minutes, 20 minutes, respectively. Wall clocks being a little skewed by other work on the machine over these longish periods. Of course my lisp code implementing a similar algorithm to the c effort would suffer the same fate, it would just be even a bit (roughly 1.75x) slower. This isn't about implementations, after all. Reporting in user time done to avoid problems with other jobs on these machines.

How fast is it on the benchmark case? Let's compare again with the c implementation, same machine and setup as last time, same declaim statement as previously. Again these are typical runs, and again we throw away the output to avoid NFS timing issues.

-bash-3.2$ /usr/bin/time -v ./mandelbrot.gcc-6.gcc_run 16000 > /dev/null
User time (seconds): 19.26
System time (seconds): 0.08
Percent of CPU this job got: 795%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.43
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 8266
Voluntary context switches: 22
Involuntary context switches: 1001
-bash-3.2$ /usr/bin/time -v sbcl --core gcls_partitioned.core --noinform --eval "(gcls/opt/qtree2 16000 16000)" --eval '(quit)' > /dev/null
User time (seconds): 8.27
System time (seconds): 1.08
Percent of CPU this job got: 650%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.44
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 271688
Voluntary context switches: 206
Involuntary context switches: 488

So you can see that the lisp code does still have a bit of a core utilization problem, only getting about 88% of the 8 cores vs. the c codes 99%. And we have orders of magnitude more page faults. But it isn't enough to catch up to the better algorithm, the c runtime is about 1.7x the lisp runtime. Actually we can do better than that. The system time is heavy because of the big cache of uint32s that we don't actually need if we're only counting to fifty. Make those chars (so max iterations is now 255) and you get:

-bash-3.2$ /usr/bin/time -v sbcl --core gcls_partitioned.core --noinform --eval "(gcls/opt/qtree2 16000 16000)" --eval '(quit)' > /dev/null
User time (seconds): 8.09
System time (seconds): 0.42
Percent of CPU this job got: 657%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.29
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 101670
Voluntary context switches: 105
Involuntary context switches: 449

So were getting pretty close to a 2x multiple of the c code. We could improve things a bit more, especially since the system time is still 1/3 of wall time. I also bet there are tricks there I've missed. But I think the points been made; I don't really care on the low iteration count, though it is a bit amusing that it actually wins there as well. But as we move forward a bit trying to generate a good approximation to the set, this stuff will be useful.

Before anyone complains, I did mention the possible source of error, right? Now while the output for the 200x200 at 50 iterations case is identical (which means, unless I've misunderstood, this code would be a valid entry, not that it's important), with 16000x16000 at 50 iterations and this partitioning scheme there are 3 pixels out of 256000000, or about one in one hundred million. To put this in perspective, changing the number of iterations by one, either to 49 or 51, will change on the order of 100,000 pixel values. Note also that this doesn't mean that either algorithm is right about those 6 pixels, only that they differ on a poor estimation of them. The divergence is bigger, and more interesting, with better estimates, but even at 1000000 iterations they only differ by about 100 pixels at this resolution (note results can depend a bit on partitioning, obviously). This error is very small compared to other errors being made here (and still an order of magnitude smaller than the difference between 100000 iterations and 1000000).

So does it do exactly the same thing in less time? No, it doesn't. But it does a comparable job in much less time, which in effect means it can do a much better job in the same time. Which, after all is far more often what you're actually interested in this sort of work.

Of course, the down side is that even at 10,000,000 iterations the output of either method is still kind of crap. Next time, I'll explain why. It may be a while though, since I'm traveling soon.

Anyway, here is some code. "gcls partitioning" is a fairly clean implementation of the algorithm discussed above. "qtree2" is tweaked for a gain of about 2x. Since they are very similar, I'll only show the latter with all its warts. Bear in mind a lot has been sacrificed for speed in this case, and it hurts readability and generalizability. I'm not going to clean it up, sorry.

The first thing is the function count-iterations. I won't include it here because it is identical to iterate/df except that it returns the iteration count, [b]not[/b] zero or one. Make sure it is inlined.

(defun gcls/opt/qtree2 (width height &key (max-iters 51) (nthreads 8) (partitions 10) (min-size 4)
(real-min -1.5d0) (real-max 0.5d0) (imag-min -1d0) (imag-max 1d0))
(declare (type fixnum width height max-iters min-size) (double-float real-min real-max imag-min imag-max))
(let* ((array (make-array (list height (ceiling width 8)) :element-type '(unsigned-byte 8)))
(done (make-array (list height width) :element-type 'bit :initial-element 0))
(cache (make-array (list height width) :element-type '(unsigned-byte 32) :initial-element 0))
(seq (make-array (array-total-size array) :element-type '(unsigned-byte 8) :displaced-to array))
(regions)
(region-mutex (sb-thread:make-mutex))
(real-step (/ (- real-max real-min) width)) ;;with fencepost error in spec
(imag-step (/ (- imag-max imag-min) height)))
(declare (type double-float real-step imag-step))
(labels ((set-bit (i j v)
(multiple-value-bind (byte bit) (truncate j 8)
(declare (fixnum byte bit))
(if (zerop v)
(setf (aref array i byte) (logandc2 (aref array i byte) (ash 128 (- bit))))
(setf (aref array i byte) (logior (aref array i byte) (ash 128 (- bit)))))
(setf (aref done i j) 1)))
(result-at (i j)
(when (zerop (aref done i j))
(iterate i j))
(multiple-value-bind (byte bit) (truncate j 8)
(if (logand (aref array i byte) (ash 128 (- bit))) 1 0)))
(iterate (i j)
(if (zerop (aref done i j))
(let ((res (count-iterations (+ real-min (* j real-step)) (- imag-max (* i imag-step)) max-iters)))
(declare (type (unsigned-byte 8) res))
(setf (aref cache i j) res)
(set-bit i j (if (= res max-iters) 1 0))
res)
(aref cache i j)))
(compute-region (i-min i-max j-min j-max)
(loop for i from i-min upto i-max
do (loop for j from j-min upto j-max
do (result-at i j))))
(fill-region (i-min i-max j-min j-max val)
(loop for i from i-min upto i-max
do (loop for j from j-min upto j-max
do (set-bit i j val))))
(quadtree (i-min i-max j-min j-max)
(declare (type fixnum i-min i-max j-min j-max))
(if (or (<= (- i-max i-min) min-size) (<= (- j-max j-min) min-size))
(compute-region i-min i-max j-min j-max)
(let ((proto (iterate i-min j-min)))
(if (and (loop for i from i-min upto i-max always (= proto (iterate i j-min)))
(loop for i from i-min upto i-max always (= proto (iterate i j-max)))
(loop for j from j-min upto j-max always (= proto (iterate i-min j)))
(loop for j from j-min upto j-max always (= proto (iterate i-max j))))
(fill-region i-min i-max j-min j-max (if (= proto max-iters) 1 0))
(let ((i-mid (+ i-min (truncate (- i-max i-min) 2)))
(j-mid (+ j-min (truncate (- j-max j-min) 2))))
(quadtree i-min i-mid j-min j-mid)
(quadtree i-min i-mid j-mid j-max)
(quadtree i-mid i-max j-min j-mid)
(quadtree i-mid i-max j-mid j-max))))))
(worker ()
(loop for p = (sb-thread:with-mutex (region-mutex) (pop regions)) until (null p)
do (apply #'quadtree p))))
(setf regions (part2d 0 height 0 width partitions))
(mapc #'sb-thread:join-thread (loop repeat nthreads collecting (sb-thread:make-thread #'worker)))
(with-open-stream (stream (sb-sys:make-fd-stream (sb-sys:fd-stream-fd sb-sys:*stdout*)
:element-type '(unsigned-byte 8) :buffering :full :output t :input nil))
(write-sequence (map 'list #'char-code (format nil "P4~%~D ~D~%" width height)) stream)
(write-sequence seq stream)))))

So you can see in addition to the spatial partitioning, I'm remembering if I've already computed a value and not recomputing if I can avoid it. The whole thing is pretty hacky, and I'm not going to go over all the decisions that went into tweaking this. I iteratively improved three times, getting about a factor of 2 each time, but just looking at the code I realize it may not be obvious why things were done this way. Not also the highest possible iteration count here is 2^32 -1, but that's probably enough!

The other thing missing is part2d, a function that takes a region (a b c d) and breaks it into subregions. Here's an implementation:


(defun part2d (i-min i-max j-min j-max steps)
(declare (type fixnum i-min i-max j-min j-max steps))
(labels ((partition (a b steps)
(declare (fixnum a b st eps))
(loop for n of-type fixnum downfrom (1+ steps) above 0
for i of-type fixnum = a then (+ i (truncate (- b i) n))
collecting i)))
(loop repeat steps
for a on (partition i-min i-max steps) nconc
(loop repeat steps
for b on (partition j-min j-max steps)
collect (list (car a) (cadr a) (car b) (cadr b))))))


Obviously in this case the speed of the function isn't really important as it only is used in setup.

Now it is important that I am comparing iteration counts, not in/out of the set judgements here. This is a more conservative approach, and should reduce the chance of chopping off what seems to be an isolated section of the set (obviously it can't really be isolated). More on that in another post.

In a cleaner version of this sort of approach, though, this would also replace the quadtree decomposition, and you really are probably better off using bounded regions, i.e. [a,b)x[c,d) instead of what I've used here [a,b]x[c,d] ... but this way generated faster code.


So there you go, a fast-and-hacky version of something that can even overcome the formidible problems outlined in the previous post. My core utilization here is lower than the benchmark's c code, but I'm still running about twice as fast there, and once we go up to more reasonable iteration counts (i.e. > 1000) it's really no contest. In fact, this can do more than twice the iterations in the same time -- resulting in a vastly improved approximation of the set for the same computation time. And we're still easily in the short and simple program range. Above is about 150 lines or so, which could be shorter, and also a bit simpler without sacrificing much speed.

But is set approximation of 16000x16000 even at 1000000 iterations a good one? No, it isn't. But we're getting better. Next time I'll talk about what's (still) being done wrong, and why. And a little bit about approaches to do better.

Thursday, June 18, 2009

Blather about FPU optimization & threading with SBCL

Still riffing on this Mandelbrot set stuff (bear with me), I find myself in the strange position of having to demonstrate a decently performing version of an implementation for the benchmark I don't think is very good, to avoid claims of unfair comparisons. Ok, fair enough. I'll do this and use it as an excuse to describe a bit at least in an SBCL-specific way, what might be done. I'll have a few comments about what it reveals about the benchmark, too.

Last time I gave a perfectly straightforward implementation of the main loop:

(defun iterate-point (c max-iterations)
(loop for n below max-iterations
for z = (complex 0 0) then (+ (* z z) c)
when (> (abs z) 2) do (loop-finish)
finally (return n)))

This is the bit of code doing most of the work.

In order to get a reasonably fast version, we need to do two things. First is to give the compiler enough information to do what optimizing it can (in SBCL's case, this is quite a decent job). The second is to make an observation that your compiler almost certainly isn't clever/devious enough to see --- the fact that the complex product (* z z) and the computation of (abs z) share common sub computations. I'm also, just for the sake of the benchmark, going to stop returning the number of iterations and return 0 for not in the set, 1 if it's in the set.
Here it is:

(defun iterate/df (re0 im0 bound)
(declare (type double-float re0 im0) (type fixnum bound))
(do ((n 1 (1+ n)) ;the first iterate always takes z->c, so start there
(re re0) (im im0)
(re^2 (* re0 re0) (* re re))
(im^2 (* im0 im0) (* im im)))
((>= n bound) 1)
(declare (double-float re im re^2 im^2) (fixnum n))
(when (> (+ re^2 im^2) 4d0) (return 0))
(psetf re (+ re0 (- re^2 im^2))
im (+ im0 (let ((a (* re im))) (+ a a))))));faster than (* 2 (* re im))

Loop syntax is nice for some things but can make it hard to see exactly what order things are happening in, etc., so I often prefer do for things like this. Note that the loop is starting from n=1 because properly the first iteration is always 0. So I've declared types and unrolled the computation so the commonality of (* re re) and (* im im) is exposed and not recomputed. This shows what the computation boils down to. You need a counter, a bound for it, and six double floats (re,im, re^2, im^2, the constant 4.0 and the temporary a). This last one I've added just because I knew on my platform SBCL would probably generate better code for a product and sum, rather than (* 2 re im). It might not, on your implementation. SBCL could be smarter about things like this though, so it's often worth checking. At the beginning of the loop we have two integer-float products and two float-float adds to set up re0 and im0, which costs next to nothing.

This computation can live on the chip on most systems you might try this with, which is getting towards the holy grail of FPU speed. Anything we can get in-chip like this will be damn fast. To do the absolute best, we have to start looking at things with pipeline utilization and stalls, loop parallelization or unrolling (if we've got some registers free), tricky, tedious, and often machine specific stuff. But we've already come a long way, and what we have here is basically the simplest case you can possible hope for. And in "real world" numerical code, you rarely get it.

Someone pointed out that I hadn't originally talked about threading, and I commented that this stuff is trivially threadable too. This is because we can keep good memory localization etc. by putting a thread on computing one row, but keep the individual thread busy without any possibility of interacting, because they live on separate rows. It literally could not be any simpler to thread than this, we don't even need a mutex.

So lets do both; you've got the central loop, to thread it we just have to make a minor change. I'm also going to amortize the cost of packing the bits over the whole computation, and output the .pbm file to std out. This is a bit unnatural for lisp, but allows a side by side comparison as per the benchmark so there is no question we are measuring the same things. I found that SBCL specific std-output stream approach from the benchmark itself, as there was an SBCL entry (thanks, guys!)

(defun gcls/opt (width height &key (bound 51) (nthreads 8) (real-min -1.5d0) (real-max 0.5d0) (imag-min -1d0) (imag-max 1d0))
(declare (type fixnum width height bound) (double-float real-min real-max imag-min imag-max))
(let* ((array (make-array (list height (ceiling width 8)) :element-type '(unsigned-byte 8)))
(seq (make-array (array-total-size array) :element-type '(unsigned-byte 8) :displaced-to array))
(rows -1)
(real-step (/ (- real-max real-min) width)) ;; with shifted sampling as spec image
(imag-step (/ (- imag-max imag-min) height)))
(declare (type fixnum rows) (type double-float real-step imag-step))
(labels ((worker ()
(loop for i = (incf rows) while (< i height) ;incf atomic, no mutex needed
for im0 of-type double-float = (- imag-max (* i imag-step)) do
(loop with byte of-type fixnum = 0
with code of-type (unsigned-byte 8) = 0
with b of-type bit = 0
for bit-number of-type fixnum upfrom 0
for j below width
for re0 of-type double-float = (+ real-min (* j real-step)) do
(setf b (iterate/df re0 im0 bound))
(if (= bit-number 8)
(setf (aref array i byte) code code b byte (1+ byte) bit-number 0)
(setf code (logior (ash code 1) b)))
finally (unless (= bit-number 1) (setf (aref array i byte) code))))))
(mapc #'sb-thread:join-thread (loop repeat nthreads collecting (sb-thread:make-thread #'worker))))
(with-open-stream (stream (sb-sys:make-fd-stream (sb-sys:fd-stream-fd sb-sys:*stdout*)
:element-type '(unsigned-byte 8) :buffering :full :output t :input nil))
(write-sequence (map 'list #'char-code (format nil "P4~%~D ~D~%" width height)) stream)
(write-sequence seq stream))))

Again with appropriate type declarations (this is starting to look a bit like C code, how surprising). If you throw those two functions in a file with a "(declaim (inline iterate/df) (optimize speed (safety 0) (compilation-speed 0) (space 0) (debug 0)))" at the top, you'll have the exact code I'm timing. It's actually a little bit faster (and a shorter, obv.) if you expand the iteration inside the worker function and get rid of iterate/df entirely, but only slightly and that will obscure a point I want to make.

Now lets see where we are. I don't have any machines identical to the quad-core the benchmark uses so this won't be exactly the same, but I do have a bunch of 8 core Xeons available. I'm going to ignore the file writing here, both because it's not the key area but more because the Xeons are all on NFS storage only. Writing the file locally on my slower machine takes less than 1/10th of a second, which is in the noise over repeated runs so this won't shift anything much. [Note: in all cases here timings shown are typical over many runs, but I'm not averaging them and looking at the variance, it's not important to the discussion here as all of the timings were very stable. Of course to compare over a range of inputs or whatever we'd average a number of runs...]

-bash-3.2$ SBCL --eval '(load (compile-file "gclsopt.lisp"))' --eval '(save-lisp-and-die "inline.core" :purify t)'
-bash-3.2$ /usr/bin/time -v SBCL --core inline.core --noinform --eval "(gcls/opt 16000 16000)" --eval "(quit)" > /dev/null
User time (seconds): 35.38
System time (seconds): 0.11
Percent of CPU this job got: 782%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.53
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 22260
Voluntary context switches: 57
Involuntary context switches: 2817

So we are getting a respectable 98% or so utilization, and a cumulative time of a little under 5s on a basically idle machine. Looks pretty decent.

It really is easy to underestimate how cheap 50 iterations of a small loop like the above is on modern hardware. If you can stay out of memory, and keep out of trouble in the pipelines and context switches, you're golden. Which isn't to say that my code couldn't be faster (it obviously could!) but if there is another factor of 2 or 4 to be had it is necessarily through keeping the pipes all busy and/or being clever about looping over multiple iterations at the same time, that sort of thing. So lets try a little experiment; I've made only one small change. The call to iterate/df is now not inlined within the main loop, so every pixel has a function call before the 50 iterations:

-bash-3.2$ /usr/bin/time -v SBCL --core notinline.core --noinform --eval "(gcls/opt 16000 16000)" --eval "(quit)" > /dev/null
User time (seconds): 65.12
System time (seconds): 8.92
Percent of CPU this job got: 412%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.97
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 164098
Voluntary context switches: 522128
Involuntary context switches: 12716


Ouch! User time nearly doubled, our respectable 98% core utilization plummeted to below 50%, wall time is 4x slower. Page faults and context switches tell the story, I think. SBCL doesn't have particularly expensive function calls or anything but this all really adds up. Fundamentally, the FPU's are starving here. This has nothing to do with SBCL, you'll see similar affects if you reproduce it in other languages. The fact that it's a function call isn't so important, the fact that it's got to come out of the chip is.

Now accepting for the moment that a) the output set is a reasonable goal as stated and b) that the reason we went from 200x200 by 50 iterations up to 16000x16000 at 50 iterations was simply to have enough work to do, consider this: Why not go to 1600x1600 at 5000 iterations? Same nominal change in back-of-the-envelope terms, right? In fact, since the area of the set is something like 1.5 (out of the total of 4) we have to adjust that a little bit. So lets go 2000x2000 and perform the same comparison:

-bash-3.2$ /usr/bin/time -v sbcl --core inline.core --noinform --eval "(gcls/opt 2000 2000 :bound 5001)" --eval "(quit)" > /dev/null
User time (seconds): 41.81
System time (seconds): 0.11
Percent of CPU this job got: 787%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.32
Minor (reclaiming a frame) page faults: 14060
Voluntary context switches: 58
Involuntary context switches: 3305
-bash-3.2$ /usr/bin/time -v sbcl --core notinline.core --noinform --eval "(gcls/opt 2000 2000 :bound 5001)" --eval "(quit)" > /dev/null
User time (seconds): 42.04
System time (seconds): 0.13
Percent of CPU this job got: 784%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.37
Minor (reclaiming a frame) page faults: 19336
Voluntary context switches: 2178
Involuntary context switches: 3346

Of course results will vary with different hardware a bit but fundamentally the issue is the same. The function calls here act as a proxy for almost anything else you might want to do that will touch memory, etc. Once you can get the computation on the chip like this for such a small number of iterations, lots of other things may not look worth trying at all. Not that there isn't any hope of doing better (I'll discuss later, or next time) but it sure can look that way.

So even if you have no interest in my discussion of this particular benchmark, I hope this illustrates some issues to consider when threading numerical code. Of course, once we've reached the simplicity of the localized loop, there really isn't much new to say. If you know anything at all about fast numerical code of this type, you would have predicted something like this, and you already have a good idea of at least some of the languages that can do a decent job at it. On the other hand if you are inexperienced in the area, I don't think a good way to learn about it is inside a computation that is only nominally doing something useful --- but doing a lousy job of it. Particularly since this sort of extremely simple and localized loop is exactly the sort you usually can't get in practice, or have to fight quite a bit to get. This Mandelbrot computation gets there trivially, but it does it under false pretenses.

So this is in my opinion another real weakness of the benchmark. It could have had a much more interesting setup though, and another post will explore how you might go about making a decent approximation to the Mandelbrot set boundary at high resolution, and what the real challenges are, and why all of this is a lot more interesting than "if I survive fifty iterations of this simple loop, I'm in".

So how good in my above lisp code? Let's compare with the current champion of that benchmark. Again throwing away the files to avoid the network file system skewing things:

-bash-3.2$ /usr/bin/gcc -pipe -Wall -O3 -fomit-frame-pointer -march=native -D_ISOC9X_SOURCE -mfpmath=sse -msse2 -lm -lpthread mandelbrot.gcc-6.c -o mandelbrot.gcc-6.gcc_run
-bash-3.2$ /usr/bin/time -v ./mandelbrot.gcc-6.gcc_run 16000 > /dev/null
User time (seconds): 19.38
System time (seconds): 0.08
Percent of CPU this job got: 795%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.44
Minor (reclaiming a frame) page faults: 8173
Voluntary context switches: 16
Involuntary context switches: 1605

I'm just using the flags from the benchmark run, perhaps it could do with some tweaking. So this puts my lisp code at roughly 1.85x the fast c code, which is about where I expected to be (within a factor of 2). My utilization is lower than it could be, and I expect it could make up a bit of that with tweaking the code, maybe even reach 1.5 or so if you cared to put the time in. Targets here were 64 bit, gcc-4.3.0 and SBCL 1.0.29, running on 8 by 3Ghz E5450's.

So there you go, pretty straight forward approach to get reasonably fast lisp code to do a bad job of the Mandelbrot set. More importantly, I hope I've made clear why the specification makes the overwhelming priority of this benchmark getting your compiler to generate decent floating point code on that tight inner loop, which I think is fairly uninformative on a cross language comparison, even if it were doing something more useful. Perhaps you differ on that.

A comment about optimizing numerical code in lisp: This is fairly typical of the way I approach it. First a simple, clean and clear implementation I can understand and test for correctness. If it's too slow, after profiling some tweaked functions (like the iterate/df) to take care of the worst of it. Once I've settled on an approach, I'll sometimes use the approach like above of pulling the details into labels functions inside a main function. Especially if I've got problem specific tweaks that aren't very nice, but this also make the job easy for the compiler without too many declarations. The down side is if you are debugging at this point, because it may be hard (due to inlining) to see where the problem actually happened. So about 30 minutes of coding time and way more than that to write this up. Sigh.

Hopefully the optimization tweaks and/or SBCL specific threading stuff is useful to someone. I'll probably discuss a bit the issues of threading when it isn't this trivial, at some point. On the tweaking numerical code front, one perhaps not obvious thing you can do to this is to interleave two iterations. This often helps keep the pipelines full. In this case, we get about a 5% improvement using something like this:

(defun iterate/df/double (a-re0 a-im0 b-re0 b-im0 max-iter)
(declare (type double-float a-re0 a-im0 b-re0 b-im0) (fixnum max-iter))
(do ((n 1 (1+ n ))
(a-re a-re0) (a-im a-im0) (a-re^2 0d0) (a-im^2 0d0)
(b-re b-re0) (b-im b-im0) (b-re^2 0d0) (b-im^2 0d0)
(mask 0) (test 3 (logxor mask 3)))
((or (>= n max-iter) (zerop test)) test)
(declare (double-float a-re a-im a-re^2 a-im^2 b-re b-im b-re^2 b-im^2)
(fixnum n) (type (integer 0 3) mask))
(unless (zerop (logxor mask 2))
(setf a-re^2 (* a-re a-re) a-im^2 (* a-im a-im))
(when (> (+ a-re^2 a-im^2) 4d0) (setf mask (logior mask 2))))
(unless (zerop (logxor mask 1))
(setf b-re^2 (* b-re b-re) b-im^2 (* b-im b-im))
(when (> (+ b-re^2 b-im^2) 4d0) (setf mask (logior mask 1))))
(psetf a-re (+ a-re0 (- a-re^2 a-im^2)) a-im (+ a-im0 (let ((a (* a-re a-im))) (+ a a)))
b-re (+ b-re0 (- b-re^2 b-im^2)) b-im (+ b-im0 (let ((a (* b-re b-im))) (+ a a))))))

And the appropriate modifications to the "worker" function to operate on two consecutive points in a row, will give you something like:

-bash-3.2$ /usr/bin/time -v sbcl --core double.core --noinform --eval "(gcls/opt/double 16000 16000)" --eval "(quit)" > /dev/null
User time (seconds): 33.81
System time (seconds): 0.11
Percent of CPU this job got: 782%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.33
Minor (reclaiming a frame) page faults: 22190
Voluntary context switches: 59
Involuntary context switches: 3005

So at this point we're sitting at 1.75 of the current champion c code. This is nearly a 30% improvement on the existing SBCL entry which gives times of about 6.05s on my machine:

-bash-3.2$ /usr/bin/time -v SBCL --core SBCL.core --noinform --eval "(main)" --eval "(quit)" 16000 8 > /dev/null
User time (seconds): 40.55
System time (seconds): 0.12
Percent of CPU this job got: 671%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.05
Minor (reclaiming a frame) page faults: 25138
Voluntary context switches: 66
Involuntary context switches: 2682

But it may do a bit better on fewer cores, as you can see utilization is a bit of a problem for it here at 84% or so. This just illustrates how much difference you may see in superficially similar code, if you are trying to keep a bunch of cores happy! I'll also note that I wrote my original code (this is a little tweaked relative to that) without looking at any of the implementations, but the narrow logic of the spec. led us to very similar implementations even line by line, which I suppose is also evidence for one of my complaints --- artificial narrowness isn't a good thing in these.

That's nearly the last I'll have to say directly about the "Great Computer Language Shootout" (and there was much rejoicing, I'm sure). Next time though, I'll show you how a different algorithm can walk all over this one, and how that relates to more interesting computations.

Monday, June 15, 2009

Once more, from the top (and a bit about sampling)

The benchmark goal is described as:
Each program should plot the Mandelbrot set [-1.5-i,0.5+i] on an N-by-N bitmap. Write output byte-by-byte in portable bitmap format.
cmp program output N = 200 with this 5KB output file to check your program is correct before contributing.

I'm going to assume everyone understands the basic properties of the Mandelbrot set. If not, MathWorld and/or Wikipedia will fill in the gaps. I'll stick to the straightforward (if a bit sloppy) notation used previously, a quadratic recurrence generates by repeated application.

It can be shown that the set lives in the closed disc around the origin of radius 2 and also that a point c is in the set if and only if for all iterates n. Now we can't test all n greater than 0, but we can certainly test the first few, say . This doesn't prove that a point is in the set, but it does prove if a point is not (also useful!) and as you increase the iterations you at least intuitively hope there is some connection between the number of iterations a point survives, and the probability it is in the set. So now we have something we can translate directly to code (and hope blogger doesn't mess up the indentation too much this time):

(defun iterate-point (c max-iterations)
(loop for n below max-iterations
for z = (complex 0 0) then (+ (* z z) c)
when (> (abs z) 2) do (loop-finish)
finally (return n)))

This code returns an integer in [1, max-iterations]. If the result is equal to max-iterations, the point is at least still a candidate for being in the set, whereas if it is lower than that, we know that the point has escaped and c is not in the Mandelbrot set. Let's agree to ignore the issue of floating point representation of real numbers, although it's relevant to computing this set membership.

Ok, so in order to render an image of the set, now all we need to do is form a lattice of points in the complex plane, and compute the result of the iteration. We'll call a point "in the set" if it survives max-iterations:

(defun mb-set (width height &key (max-iterations 51)
(real-min -1.5d0) (real-max 0.5d0) (imag-min -1d0) (imag-max 1d0))
(let ((array (make-array (list height width))))
(with-square-grid ((im i :steps height :first imag-max :last imag-min)
(re j :steps width :first real-min :last real-max))
(dotimes (i height array)
(dotimes (j width)
(setf (aref array i j)
(if (= (iterate-point (complex re im) max-iterations)
max-iterations)
1
0)))))))

Here I make use of a macro to clarify how the sampling (more on this below) is to be done. I know some people dislike this sort of capturing, but I think it helps in situations like the above. A simple version might look something like this:

(defmacro with-square-grid (((var index &key steps first last) &rest others) &body body)
(let ((var-step (gensym)))
`(let ((,var-step (/ (- ,last ,first) (1- ,steps))))
(symbol-macrolet ((,var (+ ,first (* ,index ,var-step))))
,@(if others
`((with-square-grid ,others ,@body))
body)))))

The result of mb-set is an an array of values, 0 for out of the set, 1 for in the set. For the benchmark, we output a raw PBM file; so we need to put these results into a packed representation of the rows and add a header. I'll do this in two steps. First pack the array into the right data format:

(defun pack-array/byte-aligned (array)
(let ((seq (make-array (ceiling (array-total-size array) 8) :element-type '(unsigned-byte 8))))
(dotimes (n (length seq) seq)
(loop for b from (* 8 n) below (+ (* 8 n) 8)
for code of-type (unsigned-byte 8) = (row-major-aref array b)
then (logior (ash code 1) (row-major-aref array b))
finally (setf (aref seq n) code)))))

and then write it out to a file, with the PBM header, using something like this:

(defun write-pbm (filename width height packed-array)
(with-open-file (s filename :direction :output :element-type '(unsigned-byte 8))
(write-sequence (map 'list #'char-code (format nil "P4~%~D ~D~%" width height)) s)
(write-sequence packed-array s)))

You'll note I've cheated a little bit and assumed that the width of the array is evenly divisible by 8, which matches the test case but isn't general (but suffices for now).

Here is the output of the above code for N=200.

And the validation image given on the benchmark page:


So they look roughly similar but look at the little "spike" leading off to the left. This is the real line (i.e. the imaginary component is equal to 0). But hang on a moment, the specification called for the range [-i,+i] on the imaginary axis, and N=200... but no even regular sampling of a symmetric range like this can contain the point zero. Looks to me like the test file was generated with a fencepost error, and the other implementations have presumably just followed suit.

If the above isn't obvious to you, consider a very coarse sampling: Suppose we want to sample [-1,1] in four places, equally spaced. There are two natural ways to do this, with different properties. We can do it including the points [-1,1]:

CL-USER> (setf x-min -1 x-max 1 dx (- x-max x-min) samples 4)
CL-USER> (loop for n below samples collecting (+ x-min (* n (/ dx (1- samples)))))
(-1 -1/3 1/3 1)

Note zero lies between the central two samples, both endpoints are present, and the steps are each 2/3. Alternatively, we can pick the points as the center points of an equipartition of [-1,1] into sub-intervals:

CL-USER> (loop for n below samples collecting (+ x-min (* (+ n 1/2) (/ dx samples))))
(-3/4 -1/4 1/4 3/4)

Here the step size is 1/2, not 2/3, and of course we still miss the point 0. In order to include the point 0, we'd have to use an odd number of samples.

The code generating the benchmark image does something like this, though:

CL-USER> (loop for n below samples collecting (+ x-min (* n (/ dx samples))))
(-1 -1/2 0 1/2)

So now you will (for any even number of samples) get the point 0 in your sampling, but your points are asymmetrically placed in the interval you've specified, and in fact always cut off the end.

If you change the above code to generate the points on the complex plane with this asymmetric approach, you'll get the identical result to the test image, so I suspect that this is what was done originally. You can interpret it as a simple fencepost error when dividing by samples, or as an attempt to do the second, more natural, approach and failing to shift the samples by 1/2 of the delta in each direction. Or, I suppose, as an unclear specification of the expected sampling.

This sort of thing is very common to run into, and can lead to weird errors/results because the amount it is shifted by depends on the number of samples, not generally a good property to have. Ok, hopefully that's not too tedious about the sampling, but it's important to get right (and often isn't) and it will come up again in other forms the next post or two.

In the above code, there has been not attempt at optimization, just to set up the problem an separate it into three parts. I realize the separation of the packing problem is somewhat artificial because while it is unavoidable, it is better to amortize this over the computations, particularly if threading.

Now after asking the compiler to optimize as best it can, consider the following timings:

CL-USER> (time (mb-set 1000 1000))
Evaluation took:
15.593140 seconds of total run time (14.772507 user, 0.820633 system)
[ Run times consist of 3.651 seconds GC time, and 11.943 seconds non-GC time. ]
39,699,456,062 processor cycles
8,436,938,592 bytes consed
#<(SIMPLE-ARRAY T (1000 1000)) {121AF447}>
CL-USER> (time (pack-array/byte-aligned *))
Evaluation took:
0.025524 seconds of total run time (0.025439 user, 0.000085 system)
64,063,813 processor cycles
125,008 bytes consed
#<(SIMPLE-ARRAY (UNSIGNED-BYTE 8) (125000)) {12479007}>
CL-USER> (time (write-pbm "test1000.pbm" 1000 1000 *))
Evaluation took:
0.002482 seconds of total run time (0.000926 user, 0.001556 system)
6,306,962 processor cycles
7,552 bytes consed

This is the first thing I did for the original post. So what have we learned? First off, and no surprise, this implementation has pretty poor performance compared to what can be done (but it hasn't been tuned at all). Packing the results into an array for the PBM file is orders of magnitude cheaper than computing the values. Writing out the file another order cheaper; these things practically free in this context.

However, we know the iteration code is suboptimal. Can we ignore the cost of packing the array? Keep in mind this is work we have to do somewhere, even if it isn't as a separate step. The above code doesn't tell us much, really, as the array access indirection will kill it. Taking a similar version (not shown) of pack-array that has been notated for types (very much faster) and looking at an array the same size as the benchmark.

CL-USER> (defparameter *array* (make-array (list 16000 16000) :element-type 'fixnum))
CL-USER> (dotimes (n (array-total-size *array*)) (setf (row-major-aref *array* n) (random 2)))
CL-USER> (time (pack-array/byte-aligned/fast *array*))
Evaluation took:
1.518081 seconds of total run time (1.486409 user, 0.031672 system)
[ Run times consist of 0.013 seconds GC time, and 1.506 seconds non-GC time. ]
3,818,926,037 processor cycles
32,000,576 bytes consed
#<(SIMPLE-ARRAY (UNSIGNED-BYTE 8) (32000000)) {4FD44007}>

Ok, we are into territory where we could quibble about details, and I certainly could have optimized this a bit more. I'm assuming a factor of two is there for the asking. Also waving our hands wildly about the rough equivalence of machines if we want to compare this (run on a core2 duo 2.5ghz) with the posted benchmark results .... however, just from an order of magnitude point of view it should be clear that, as was my original contention, this second order of cost might come into play. It's obviously going to be an issue if your language implementation can't pack bits like this effectively. However, if we're talking about threaded run times on the order of 10 seconds, it may well be a factor there as well, and we have to keep an eye on it. So while this is hardly a complete analysis it at least raises the specter of a benchmark that confounds two quite different things, breaking my third characteristic.

So there we go, I've laid out a rather naive solution of the benchmark, and already we've run into trouble with respect to my criteria. The specification of the problem seems to be buggy, and the final goal isn't as concrete as it could be if the expected final output was actually specified.

Now I realize I've opened myself up a bit to a complaint of "apples and oranges" comparisons, so next post I'll try and put things on a firmer footing.

We've yet to talk further about why I believe this benchmark is problematic even if you accept the output as meaningful, and quite separably about why I think the output is fairly pointless, and how that undermines it as a benchmark. So this will be two posts, at least.

Somewhere in all that I'll also talk a bit about what is interesting about the set, and computations like this, and some ideas about how to do a better job (of computing the set, not benchmarking). I'm pretty busy at the moment but I'll try and get them out without too much delay. I'm sure my vast set of reader will be holding your breath.

I may well regret this....

The other day I quickly tossed off a few comments about a benchmark: Lies, Damn Lies, and Benchmarks

I've had a little pushback in comments and email which makes me wish I had taken the time to be clearer. I realized that maybe I could take a post or two and expand on it, hopefully include some stuff more interesting than mere chasing of benchmark numbers, so I'm going to have another go. I'll try not to be too boring about it all, and tie in some commentary on lisp, optimization, and the Mandelbrot set (which is a much more interesting object than many people realize, even many people who have played with and/or written generators).

To start off with, a quick note; Greg Buchholz (who I believe came up with it in the first place) comments on the previous post:
When we created the Mandelbrot benchmark in December of 2004, the machines that were used were much slower ... the image limits were something like 200x200, and so an iteration limit of 50 seemed reasonable... Another interesting thing to note is that I seem to recall that with this benchmark we smoked out compiler bugs in O'Caml and MLton. FWIW.

This sounds plausible, and matches my earlier surmise that the image size had just grown to match faster processors and threading, without much further thought. Greg's latter point is a good one. I think things at least similar to this (the size is a bit overkill) make pretty good regression tests, and even "smoke tests", and it doesn't surprise me at all that you could find compiler issues with this. However, what is interesting to compiler hackers doesn't necessarily make much of a generally useful benchmark... and I'll try and make my reservations about this one in particular more clear.

I'll also note that from a numerical analysis point of view, 50 iterations of the maps on a 200x200 grid looks like something that might be a reasonable coarse approximation; 50 iterations on a 16000x16000 grid just looks perverse. Somewhere in my previous post I noted that you could make an argument that this doesn't really matter. Quite contrary to the misreading that this and admission that it doesn't matter, I think that while the argument could be made it's a bad argument....and I'll try to explain why later.

Before I do that though, I'll note that I'm not really trying to pick on "The Great Computer Language Shootout". They have a pretty reasonable statement about the inherent flaws in any such undertaking, and of course it can be fun to try, anyway. I'm not even really picking on this one benchmark so much as using it as an example. There are plenty of lousy benchmarks out there, after all.

I think what got a (bit of a knee jerk) reaction out of me on this particular one when it fell in my lap was the way that it trivializes something really very interesting, while focusing elsewhere (somewhere I'd argue is not very informative, which isn't to say completely uninteresting.)

Now full disclosure: I don't think benchmarks are all that informative as a way of comparing languages. I'm also not particularly a language booster. This is a lisp-ish blog because I occasionally want to throw bits and pieces up here about a language that is not so well known, and one that I find nearly optimal (or at least, the best of a bad lot if you'd rather) for certain types of research code. At least, that's the theory behind this blog --- obviously I mostly just don't post. So while I'm going to use lisp to illustrate things, I'm not trying to argue it's superiority or anything, and I'm not interested in these posts in why language X might have done something better/faster/more elegantly.

So, given that, if there is any value in a benchmark of this sort, what is it? If you can't learn anything new from it, there's not much point. To this end, I can think of several useful characteristics.

  1. It should be well specified, with a concrete goal.
  2. The task should be broad enough to allow for different approaches, yet narrow enough that it is solvable and implementations manageable (and understandable!)
  3. If it confounds two or more issues, it is less useful. Avoiding this as much as possible is a good idea.
  4. If it only tells you what is already known, it's not very useful (but very common).
  5. If it boils down to a particular algorithm, that should be the stated goal.
  6. It shouldn't reduce to toy code --- realistic and reasonably common coding tasks are far more interesting for this than that which is pointless, obscure, or only esoterically useful.
I'm sure there are other good criteria, but I'd argue (and will, at least partially) that the particular benchmark in question doesn't do so well on any of these points.

In response to Isaac in comments, I'll note that I never meant suggest that nobody could learn anything from even a flawed benchmark, somebody can probably learn something from almost any bit of code. This is especially true if you are naive or inexperienced in the particular area. However, it's equally true that you could learn the same things from a better designed task, with less risk of picking up bad ideas/approaches along the way. And just because you didn't know something doesn't mean it isn't well known.

Ok, on to the benchmark, next post, including a (very) little coding content.

Monday, June 8, 2009

Lies, Damn Lies, and Benchmarks

[apologies for the length, don't have time to tighten it up and edit]

So I don't have much time for lisping these days (more's the pity) but I try and keep up with articles going by on planet lisp, etc. Recently there's been a surprising mount of discussion of Mandelbrot sets, which got me to following links.

Nikodemus Siivola discussed it in terms of a post giving a discussion (a bit muddle-headed, in my opinion) about abstraction, as reddit chimed in. Alex Fukunaga has been discussing it in the context of MPI parallelization. Once you chase links, the associated discussion often leads towards a well known benchmark ... so I had a little poke around. And now I'm twiddling my thumbs waiting for something to build, so I might as well write up my thoughts here.

The benchmark is pretty terrible. To be fair, I don't think all that much of benchmarks outside of pretty narrow constraints anyways, but this one is quite egregiously bad. Let's have a look at why.


Mandelbrot sets. First off, a little background on what a Mandelbrot set actually is. There are a few (equivalent) definitions, but the most straightforward one is probably this:

For any point in the complex plane, we can define the quadratic recurrence equation: , with . The Mandelbrot set is the set of all points for which this recurrence does not tend to infinity.

That's it. Nothing complicated, and if we only had a computer capable of computing a continuum amount of infinite sequences, we'd be able to generate the whole thing.

Lacking such a magical device, we have to make estimates. Typically you might do something like this; generate an "image" of a number of regular sample points (i.e. a lattice) on the complex plain. At each point you decide if you think the point is in the set or not, and color it black or white accordingly.

As implemented in the "great computer language shoot out" the core part of the computation is a tight loop iterating over the Mandelbrot recurrence, and testing if it has a magnitude larger than 2. It can be shown that points that venture this far from the origin will wander off to infinity, so that's a good test that a point isn't in the set, but less helpful for ones you think might be in the set (maybe they just haven't wandered that far yet). This process is repeated 50 times (why 50? We'll come back to that). Now if you unroll this computation all you need is about half a dozen floats and an integer to keep track of your iteration number. This is the sort of thing that even a fairly register-poor system can do on chip, with a little luck. So as you'd expect, anything generating decent assembly for this test can manage to make the inner loop pretty damn quick.

Which I surmise is why the benchmarks designers decided to make the image really big, make these tight loops do enough work to measure something. The lattice they've decided on is 16000x16000: a fairly big image by most standards. It is, however, only a binary image so you can pack it into a little over 30Mb (without compression) so it isn't too unwieldy. Even coloring it only 8bit grayscale would take more like 250Mb, so 750Mb for 24bpp rgb, and we're starting to talk about real memory usage. So fair enough to stick with 1bpp, although as a strictly second order effect, this means your benchmark is going to be hampered by any language that can't pack bits into bytes relatively quickly and stream them to disk. Someone can always opt to save the file in an unpacked pbm format, but it would be on the order of 500Mb of disk...

Anyway, that's not particularly relevant, but it does mean the benchmark breaks down into essentially two parts: First, a very tight floating point computation loop, and second packing bits into a buffer somewhere and dumping them onto disk.

That's really all there is to it. The benchmark could have been essentially replace with a nested loop. Consider this:

(loop repeat (expt 16000 2)
do (loop repeat 50
for z = (complex 0d0 0d0)
then (+ (* z z) (complex 1d0 1d0))))

(write-sequence
(make-array (ceiling (expt 16000 2) 8)
:element-type (unsigned-byte 8))
*standard-output*)

This snippit captures the essence of the benchmark a little bit conservatively (assuming a sufficiently stupid compiler that would actually perform the loops).

So if this is essentially what the benchmark boils down to, are we interested? Will it come as a surprise to anyone that assembly and near-assembly languages can make quick work of it? Or that some others have trouble generating really optimal code? In the latter case, should we care?

I maintain it's pretty much uninteresting. Particularly for the case of tight loops like this, if you really, really need to do it, you'll tweak some c or bash some assembly around if it needs optimal efficiency. I don't think this is news to anyone, and I don't think it tells you anything useful about the languages in question.

For example, the fastest c implementation is about 2x the speed of a more straightforward c implementation (which is much closer to the sbcl effort, unsurprisingly). It's also hard to follow, relies on type munging through pointers and, oh yeah, will just bus error on my mac. Not the sort of code you really ever want to maintain or port, if you can reasonably avoid it. It has it's place, sure, it just doesn't have anything to do with 99.99bunchof9s% of the time people are writing code.

I suspect that if the benchmark was just the above (suitably tweaked) nobody would think much of it. But the issue is confused because there is the impression of doing something "useful", namely "computing the Mandelbrot set". So how well does it do that?

Doing a bad job really quickly:

Here is the output of the current speed champion, a c implementation. I've taken the liberty of downsizing it just a little bit, so your browsers don't choke on a 16000x16000 image:



Particularly if you've never looked at this set much, that probably looks pretty good. You can zoom int all sorts of detail in a 16000x16000 image, too. But note the little red square. Below is a 1:1 resolution zoom of that section:


Looks neat, right? Here's roughly what it should look like though:


And to make the comparison more obvious, here in color. Green is in the set in both, blue is out of the set in both, and red shows the error.


Now my second image isn't perfect either. It was made using 10000000 iterations, rather than 50. The one from the benchmark code makes a complete hash of all of the interesting part of the set!

Pretty weak sauce then, if that's the only thing that makes this benchmark interesting. You can argue that being accurate isn't the point, but then again, there's lots of fast ways to make a neat looking large image.

The point is, by making a very large image, the benchmark exposes the weakness in these estimates. Far better to have a much smaller image, with a much higher standard for set membership. The more particular you are about that, the more algorithmically interesting the approaches get, and the less you are going to rely on boiling down to a few machine instructions, repeated really fast.

As it is, the benchmark might as well have been designed for c. As Nikodemus notes, common lisp can do fairly well on that sort of benchmark, merely because you can write c-like code pretty effectively in common lisp.

Above I said we'd get back to the 50 iterations thing. The problem with this is that you are trying to predict the behaviour of the tail of a sequence from the behaviour of it's head. In this particular context, 50 iterates at the head just isn't that many, and the measurement "did this get bigger than 2 in the first 50 iterates" isn't even using the available information particularly well. As far as I can tell, the reason this number is used in the benchmark is because it is the smallest number of iterates allowing you to correctly generate the test image of 200x200 lattice sites (this is much easier at low resolution). I'll leave it to you to decide the value you place on that empirical measurement.


[Edit: A confused commenter has convinced me it was a big mistake to include any discussion of the lisp code that generated the second image. I only did this because nominally that's the purpose of this blog, but it was a bad idea.

I may take all of this up a bit in a later post... but I'm cutting it here now because it detracts from the actual point. ]

Sunday, October 12, 2008

half-assed chord diagrams revisited

After my first post on these I had several requests for the code. As I mentioned it was pretty rough I hadn't posted it at first. There are a large number of things that should be done to this before it can be more generally useful. However, I realized I really don't have the time right now to write this properly, so it will probably be a while.

I decided to post it as is. At least until I was derailed by the discovery that for some reason my cl-pdf based code is generating .pdf files that read fine some places, and throw errors elsewhere. So on thing than apparently needs to be done is to figure out what's going on with that. I gave up the idea of a second post, not wanting to propagate the error.

Until I remembered that Zach mentioning to me that his Vecto package has a very similar API. Ten minutes later I had it working, including time to download and install all the packages Vecto needed. Nice!

So this makes a perfect excuse to also point out Zach's useful package. Although I really do want vector output for my original problem, this way is handy for generating images for a web page, like this one:



This shows a few A minor voicings and inversions, as well as showing how the quick cheap and nasty typesetting approach of my code manages. On the lower row, I've put two bounding boxes to show how it fits thing into the bounds of the supplied arguments, adjusting to the number of frets specified or needed.

Anyway, this version is slightly improved from the original. It's still a messy hack, and really needs to be broken into parts before extending it to do more useful things. The logic for root and non root scale degrees is already pretty ugly; things have to be abstracted before doing more along that line. But since people asked, here it is:

(defun draw-chord (xx yy ww hh frets degrees name font  &key (line-width 2) (min-frets 5) (num-strings 6) 
(tuning '(7 0 5 10 2 7))); tuning in semitones modulo 12, from A = 0 to G# = 11
"draws a chord diagram in the bounding box [xx,yy,xx+ww,yy+hh] with frets ands scale degrees in list"
(declare (optimize debug ))
(let* ((only-fretted-strings (remove-if #'null frets))
(min-fret (max 0 (1- (reduce #'min only-fretted-strings))))
(max-fret (reduce #'max only-fretted-strings))
(num-frets (max min-frets (- max-fret min-fret -1)))
;; adjust everything to fit in bounding box
(width (* ww (/ (1- num-strings) (+ num-strings 2 -1))))
(height (* hh (/ (1- num-frets) (+ num-frets 3 -1))))
(fret-width (/ width (- num-strings 1))) ; most things works
(fret-height (/ height (1- num-frets))) ; in units of these
(x0 (+ xx fret-width))
(y0 (+ yy fret-height))
(font-size (/ fret-width 2)))
(loop repeat (floor (/ (- num-frets (- max-fret min-fret -1)) 2))
never (zerop min-fret) do (decf min-fret)) ;; center figure in the allotted frets vertically
(vecto:set-line-width line-width)
(vecto:set-font font font-size)
(when (zerop min-fret) ; open position, so draw heavy line at top
(vecto:set-line-width (* 2 line-width))
(vecto:move-to (- x0 (/ line-width 2)) (+ y0 height (* 3 line-width)))
(vecto:line-to (+ x0 width (/ line-width 2)) (+ y0 height (* 3 line-width)))
(vecto:stroke)
(vecto:set-line-width line-width))
(vecto:draw-centered-string ; always label the top fret to left of diagram
(- x0 (* 3/4 fret-width)) ; offset out of the way of fingering
(+ y0 height (- (/ fret-height 2)) ;center in the fret
(- (/ font-size 2)))
(format nil "~D" (1+ min-fret))); font font-size)
(loop ;draw fret lines
for y from y0 upto (+ y0 height) by fret-height do
(vecto:move-to (- x0 (/ line-width 2)) y)
(vecto:line-to (+ x0 width (/ line-width 2)) y)
(vecto:stroke))
(loop ;draw string lines
for x from x0 upto (+ x0 width) by fret-width do
(vecto:move-to x (- y0 (* 2 line-width)))
(vecto:line-to x (+ y0 height (* 2 line-width)))
(vecto:stroke))
(loop for string from 0 below num-strings
for fret in frets ;; list of fret number, nil means not played
for degree in degrees ;; list of scale degrees from 1 (root) up. nil when not played
for x = (+ x0 (* string fret-width)) ;x offset on lattice
for y = (when fret (+ y0 height (/ fret-height 2) (- (* (- fret min-fret) fret-height)))) ;y offset on lattice
if (null fret) do ;this string not played, put an "x" above it
(vecto:draw-centered-string x (+ y0 height (* 5 line-width)) "X")
else do ; it's played so label the note
(vecto:draw-centered-string x (- y0 font-size line-width)
(index->string (+ (elt tuning string) fret)))
(unless (zerop fret) ;unless it's an open string, put a marker on
(if (eq degree 1) ; a root gets a square
(vecto:rectangle (- x (/ fret-height 4)) (- y (/ fret-height 4))
(/ fret-height 2) (/ fret-height 2))
(vecto:centered-circle-path x y (/ fret-height 4))) ;everything else gets circles
(vecto:fill-path)))
(vecto:set-font font (* 1.5 font-size)) ;name of font on top with larger font
(vecto:draw-centered-string
(+ x0 (/ width 2))
(+ y0 height (* 5 line-width) font-size)
name)))


So there you go. Don't blame me if it's brittle, it was a quick hack! Doing this sort of thing as a single function isn't such a great idea, and it's also more mechanistic and loop-y than I'd like.

I will probably revisit this problem when I've got a bit of time, and make something useful out of it. It surprised me that a few hours messing about one afternoon could make much better output than most of what I was seeing on-line, but it's still got a ways to go. Oh, you'll need a little function also, to map indices to note names.

(defun index->string (index)
(elt #("A" "A#" "B" "C" "C#" "D" "D#" "E" "F" "F#" "G" "G#") (mod index 12)))

Obviously these needs to be sorted out properly too.

Oh yeah, you need some vecto scaffold to actually generate a .png too, something like:

(defun test1 (filename)
(vecto:with-canvas (:width 600 :height 700)
(draw-chord 0 0 600 700 '(nil 0 2 2 1 0) '(nil 1 3b 5 1 5) "Am" (vecto:get-font "./MyriadWebPro.ttf"))
(vecto:save-png filename)))