Suppose you ask your friend what time it is, and they say "Right *now*, it's 12:43 and 32 seconds, 323 microseconds .... er, plus or minus about twenty minutes. You'd think they were a bit wonky, wouldn't you?
This is roughly the same reaction I have to someone describing the boundary of the Mandelbrot set on a lattice of 16000x16000 positions, with an iteration count of 50. The spatial resolution is very high, but the accuracy is very low. In fact it's only one of two (what I would consider fatal) deep flaws with this approach, but the other one will have to wait until next time.
Now one (overly) simple way to improve the estimate is to increase the iteration count. So, for example, what if we use 50 million iterations instead of 50? Back of the envelope computations tell you that (knowing that the set is about 3/8 of the total area) the fastest benchmarked code takes 6s to do 50 iterations on this grid, so it will take about 26 days to do 50 million. I think I originally used 10 million as my proxy for "lots and lots", so that's still a bit over 5 days....
Of course there is no real analysis there (clearly more than 50 is better, but is 50 million actually appreciably better than 10 million? Better than 10,000? It's actually deeper than that, but I'll discuss this more later). But that wasn't the point. The point was that if you have scale things up to get more accuracy (which you will), you'll quickly run into long runtimes. This is entirely consistent with many "real world" computational situations. Far more so that the ones that are actually cheap. For that matter, 5 days is often fine. It's when your first back-of-the-envelope computation tells you need 5 centuries that you really have to go back to the drawing board.
Bear in mind, too, that I originally had only a single core to use, so things were much, much worse. Even just looking at a small piece of the lattice was going to take hours and hours, time I didn't want to wait for.
So what to do? Giving up on the tactic of just computing as fast as we can (i.e. "dumb, but fast"), we have to step back and thing about what's happening here. You have a lattice where roughly 1/2 the area is cheap to compute, roughly 1/2 very expensive (at these high iteration counts). This sort of thing cries out for a new approach. Examples: some way to avoid blindly iterating (e.g. period detection, convergence tests), some way to do successive refinement and hence avoid spending time where it isn't needed. Some way to avoid doing some computations at all.
All of these techniques could work well, but as I was in a hurry I looked to the latter one. I remembered another theorem about the Mandelbrot set. Recall that the original iteration-based computation is all based on a theorem that if an orbit ever leaves the radius 2 disk about the origin, it will diverge. Well how about this one: Doudy and Hubbard showed in the original modern work on the set (mid 1980s) that the set is actually (simply) connected, something not at all obvious. So one practical implication is that if you draw an unbroken loop anywhere on the complex plane, if all of the points on that contour are in the set, then its interior is well, and the same for points outside the set, nearly. There are no isolated points in the set. You have to be a little bit careful here, because while the set itself is simply connected, the points outside the set includes all of the complex plane outside the disc of radius 2 at the origin. You don't want to draw a line around the entire set and conclude it is empty! (i.e the exterior is connected, the interior simply connected, the boundary, on the other hand, is weird) [update I lost part of this about the distinction originally, thanks to anon for catching it]
This immediately leads to an algorithm for spatial subdivision. Take the region you are interested and look at its boundary. If all points are in (out) of the set, record the interior points as in (out) of the set. If not, subdivide the region into a partition (e.g. four subregions) and apply the same approach to each subregion. Continue recursively until all points are covered.
Now of course in a discrete setting you stop at the point of one pixel, or in practise larger than that. Beyond that though, there is a as source of error. Because we aren't drawing continuous lines, the sampling may miss a small piece of the set, and thus erroneously conclude that the interior is solidly inside (outside) the set. There is a very practical way to mitigate this problem, but not remove entirely as I'll demonstrate.
Spatial subdivision FTWIt's pretty clear why this algorithm will win on computations if it encounters a subregion that is only withing the set. Rather than having to compute all of the interior points, it computes only the perimeter. The bigger this region is, the more you win, as the perimeter grows linearly while the area grows quadratically. The obvious way to keep a bunch of threads busy on this is to have them test a region, and if it fails, put the subregions on a stack somewhere. Then each thread just goes to the "job stack" for a new region when it needs new work to do.
However, there is still the overhead of specifying regions, and farming out these jobs to threads. The former can involve a lot of memory too. Think about our 16000x16000 case, we can subdivided it into 4 about 12 times, and each time you take 4 numbers specifying the regions, and replace it with 16 specifying 4 subregions. Take 64bit fixnums and unless I've screwed up that's about 64 Mb of potential state to be sloshed around just specifying where you are in the lattice, or about what twice the state being used to store the set values themselves (obviously this could be improved a bit, but that's a back-of-the-envelope bound on it)
Of course this isn't a terrible lot of state, but look what happened last time when I only added a function call and made it 4x slower! So while I knew that this approach would be much faster at large iterations, I assumed the overhead of managing regions would mean that with a tiny number of iterations like 50, "dumb but fast" would probably still win.
While writing up the previous post though, I had a thought. All I really had to do was partition the set enough to spread cheap and expensive subregions around enough to keep all the threads busy. Once in a region, the thread could just beaver away there by itself. As long as you don't interact spatially with other threads, you don't need any locking either. So this way I could keep all the cores busy but pay very little cost in organization. It's not as good at spreading the work around, but it will be very cheap.
So I tried this approach, and it's fast. Quite a bit faster than the current champion, actually, even at the 50 iteration case. Once you get into bigger iterations though, it really mops the floor with the "dumb but fast" approach. Here are the gcc (modified only to vary the number of iterations) case and two partitioning implementations:

This plot shows the expense (NB: in user cpu time) as you increase the number of iterations spent at each site. These are 16000x16000 lattices, loglog plot of seconds against the maximum number of iterations. So you can see that while the gcc code starts off at roughly 1/2 the cost of the lisp code, by the time you hit 1000 iterations the gap has widened to an order of magnitude, and stays there. I only ran this out to a million iterations, but that gives the final user time for the gcc code about 20x the user time of the gcls partition lisp code, and 40x the faster qtree2 (more on that to come) and wall clocks of about 12 hours, 40 minutes, 20 minutes, respectively. Wall clocks being a little skewed by other work on the machine over these longish periods. Of course my lisp code implementing a similar algorithm to the c effort would suffer the same fate, it would just be even a bit (roughly 1.75x) slower. This isn't about implementations, after all. Reporting in user time done to avoid problems with other jobs on these machines.
How fast is it on the benchmark case? Let's compare again with the c implementation, same machine and setup as last time, same declaim statement as previously. Again these are typical runs, and again we throw away the output to avoid NFS timing issues.
-bash-3.2$ /usr/bin/time -v ./mandelbrot.gcc-6.gcc_run 16000 > /dev/null
User time (seconds): 19.26
System time (seconds): 0.08
Percent of CPU this job got: 795%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.43
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 8266
Voluntary context switches: 22
Involuntary context switches: 1001
-bash-3.2$ /usr/bin/time -v sbcl --core gcls_partitioned.core --noinform --eval "(gcls/opt/qtree2 16000 16000)" --eval '(quit)' > /dev/null
User time (seconds): 8.27
System time (seconds): 1.08
Percent of CPU this job got: 650%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.44
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 271688
Voluntary context switches: 206
Involuntary context switches: 488
So you can see that the lisp code does still have a bit of a core utilization problem, only getting about 88% of the 8 cores vs. the c codes 99%. And we have orders of magnitude more page faults. But it isn't enough to catch up to the better algorithm, the c runtime is about 1.7x the lisp runtime. Actually we can do better than that. The system time is heavy because of the big cache of uint32s that we don't actually need if we're only counting to fifty. Make those chars (so max iterations is now 255) and you get:
-bash-3.2$ /usr/bin/time -v sbcl --core gcls_partitioned.core --noinform --eval "(gcls/opt/qtree2 16000 16000)" --eval '(quit)' > /dev/null
User time (seconds): 8.09
System time (seconds): 0.42
Percent of CPU this job got: 657%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.29
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 101670
Voluntary context switches: 105
Involuntary context switches: 449
So were getting pretty close to a 2x multiple of the c code. We could improve things a bit more, especially since the system time is still 1/3 of wall time. I also bet there are tricks there I've missed. But I think the points been made; I don't really care on the low iteration count, though it is a bit amusing that it actually wins there as well. But as we move forward a bit trying to generate a good approximation to the set, this stuff will be useful.
Before anyone complains, I did mention the possible source of error, right? Now while the output for the 200x200 at 50 iterations case is identical (which means, unless I've misunderstood, this code would be a valid entry, not that it's important), with 16000x16000 at 50 iterations and this partitioning scheme there are 3 pixels out of 256000000, or about one in one hundred million. To put this in perspective, changing the number of iterations by one, either to 49 or 51, will change on the order of 100,000 pixel values. Note also that this doesn't mean that either algorithm is right about those 6 pixels, only that they differ on a poor estimation of them. The divergence is bigger, and more interesting, with better estimates, but even at 1000000 iterations they only differ by about 100 pixels at this resolution (note results can depend a bit on partitioning, obviously). This error is very small compared to other errors being made here (and still an order of magnitude smaller than the difference between 100000 iterations and 1000000).
So does it do exactly the same thing in less time? No, it doesn't. But it does a comparable job in much less time, which in effect means it can do a much better job in the same time. Which, after all is far more often what you're actually interested in this sort of work.
Of course, the down side is that even at 10,000,000 iterations the output of either method is still kind of crap. Next time, I'll explain why. It may be a while though, since I'm traveling soon.
Anyway, here is some code. "gcls partitioning" is a fairly clean implementation of the algorithm discussed above. "qtree2" is tweaked for a gain of about 2x. Since they are very similar, I'll only show the latter with all its warts. Bear in mind a lot has been sacrificed for speed in this case, and it hurts readability and generalizability. I'm not going to clean it up, sorry.
The first thing is the function count-iterations. I won't include it here because it is identical to iterate/df except that it returns the iteration count, [b]not[/b] zero or one. Make sure it is inlined.
(defun gcls/opt/qtree2 (width height &key (max-iters 51) (nthreads 8) (partitions 10) (min-size 4)
(real-min -1.5d0) (real-max 0.5d0) (imag-min -1d0) (imag-max 1d0))
(declare (type fixnum width height max-iters min-size) (double-float real-min real-max imag-min imag-max))
(let* ((array (make-array (list height (ceiling width 8)) :element-type '(unsigned-byte 8)))
(done (make-array (list height width) :element-type 'bit :initial-element 0))
(cache (make-array (list height width) :element-type '(unsigned-byte 32) :initial-element 0))
(seq (make-array (array-total-size array) :element-type '(unsigned-byte 8) :displaced-to array))
(regions)
(region-mutex (sb-thread:make-mutex))
(real-step (/ (- real-max real-min) width)) ;;with fencepost error in spec
(imag-step (/ (- imag-max imag-min) height)))
(declare (type double-float real-step imag-step))
(labels ((set-bit (i j v)
(multiple-value-bind (byte bit) (truncate j 8)
(declare (fixnum byte bit))
(if (zerop v)
(setf (aref array i byte) (logandc2 (aref array i byte) (ash 128 (- bit))))
(setf (aref array i byte) (logior (aref array i byte) (ash 128 (- bit)))))
(setf (aref done i j) 1)))
(result-at (i j)
(when (zerop (aref done i j))
(iterate i j))
(multiple-value-bind (byte bit) (truncate j 8)
(if (logand (aref array i byte) (ash 128 (- bit))) 1 0)))
(iterate (i j)
(if (zerop (aref done i j))
(let ((res (count-iterations (+ real-min (* j real-step)) (- imag-max (* i imag-step)) max-iters)))
(declare (type (unsigned-byte 8) res))
(setf (aref cache i j) res)
(set-bit i j (if (= res max-iters) 1 0))
res)
(aref cache i j)))
(compute-region (i-min i-max j-min j-max)
(loop for i from i-min upto i-max
do (loop for j from j-min upto j-max
do (result-at i j))))
(fill-region (i-min i-max j-min j-max val)
(loop for i from i-min upto i-max
do (loop for j from j-min upto j-max
do (set-bit i j val))))
(quadtree (i-min i-max j-min j-max)
(declare (type fixnum i-min i-max j-min j-max))
(if (or (<= (- i-max i-min) min-size) (<= (- j-max j-min) min-size))
(compute-region i-min i-max j-min j-max)
(let ((proto (iterate i-min j-min)))
(if (and (loop for i from i-min upto i-max always (= proto (iterate i j-min)))
(loop for i from i-min upto i-max always (= proto (iterate i j-max)))
(loop for j from j-min upto j-max always (= proto (iterate i-min j)))
(loop for j from j-min upto j-max always (= proto (iterate i-max j))))
(fill-region i-min i-max j-min j-max (if (= proto max-iters) 1 0))
(let ((i-mid (+ i-min (truncate (- i-max i-min) 2)))
(j-mid (+ j-min (truncate (- j-max j-min) 2))))
(quadtree i-min i-mid j-min j-mid)
(quadtree i-min i-mid j-mid j-max)
(quadtree i-mid i-max j-min j-mid)
(quadtree i-mid i-max j-mid j-max))))))
(worker ()
(loop for p = (sb-thread:with-mutex (region-mutex) (pop regions)) until (null p)
do (apply #'quadtree p))))
(setf regions (part2d 0 height 0 width partitions))
(mapc #'sb-thread:join-thread (loop repeat nthreads collecting (sb-thread:make-thread #'worker)))
(with-open-stream (stream (sb-sys:make-fd-stream (sb-sys:fd-stream-fd sb-sys:*stdout*)
:element-type '(unsigned-byte 8) :buffering :full :output t :input nil))
(write-sequence (map 'list #'char-code (format nil "P4~%~D ~D~%" width height)) stream)
(write-sequence seq stream)))))
So you can see in addition to the spatial partitioning, I'm remembering if I've already computed a value and not recomputing if I can avoid it. The whole thing is pretty hacky, and I'm not going to go over all the decisions that went into tweaking this. I iteratively improved three times, getting about a factor of 2 each time, but just looking at the code I realize it may not be obvious why things were done this way. Not also the highest possible iteration count here is 2^32 -1, but that's probably enough!
The other thing missing is part2d, a function that takes a region (a b c d) and breaks it into subregions. Here's an implementation:
(defun part2d (i-min i-max j-min j-max steps)
(declare (type fixnum i-min i-max j-min j-max steps))
(labels ((partition (a b steps)
(declare (fixnum a b st eps))
(loop for n of-type fixnum downfrom (1+ steps) above 0
for i of-type fixnum = a then (+ i (truncate (- b i) n))
collecting i)))
(loop repeat steps
for a on (partition i-min i-max steps) nconc
(loop repeat steps
for b on (partition j-min j-max steps)
collect (list (car a) (cadr a) (car b) (cadr b))))))
Obviously in this case the speed of the function isn't really important as it only is used in setup.
Now it is important that I am comparing iteration counts, not in/out of the set judgements here. This is a more conservative approach, and should reduce the chance of chopping off what seems to be an isolated section of the set (obviously it can't really be isolated). More on that in another post.
In a cleaner version of this sort of approach, though, this would also replace the quadtree decomposition, and you really are probably better off using bounded regions, i.e. [a,b)x[c,d) instead of what I've used here [a,b]x[c,d] ... but this way generated faster code.
So there you go, a fast-and-hacky version of something that can even overcome the formidible problems outlined in the previous post. My core utilization here is lower than the benchmark's c code, but I'm still running about twice as fast there, and once we go up to more reasonable iteration counts (i.e. > 1000) it's really no contest. In fact, this can do more than twice the iterations in the same time -- resulting in a vastly improved approximation of the set for the same computation time. And we're still easily in the short and simple program range. Above is about 150 lines or so, which could be shorter, and also a bit simpler without sacrificing much speed.
But is set approximation of 16000x16000 even at 1000000 iterations a good one? No, it isn't. But we're getting better. Next time I'll talk about what's (still) being done wrong, and why. And a little bit about approaches to do better.