Sunday, July 22, 2007

end of the (sudoku) road

Apologies .... I thought I'd posted this entry last week but it turns out it was saved as a draft.

Anyway, this is the last thing I'll say about sudoku, at least for a while. But I did promise something a little more useful out of all this discussion, so lets talk a bit about generating puzzles. Be warned, though, this is all pretty hackish.

There are all sorts of issues here, particularly the issue that to have a reliable idea of how difficult these things are for a person to solve, we probably want a heuristic solver, no a backtracking solver like the one discussed here. So as far as `difficulty' goes, we'll just have to rely on the number of clues as a rough proxy.

We want to generate puzzles, and one way to think about this is to first figure out how to make a valid solution (i.e. a fully solved puzzle) and then keep taking away positions until we have the right number of clues. We may not be able to take away as many as we want, though, and still keep the original position the unique solution. For my purposes I'm only going to consider unique solutions valid, anything else is cheating.

First off, realize that because of the way the solver works, we already know how to generate a solved puzzle, we just start of with an empty (all zeros) one:

BLOG> (find-first-solution (make-array 81 :initial-element 0))
#(2 7 5 1 4 3 8 6 9 1 3 6 7 9 8 2 4 5 8 4 9 5 6 2 7 1 3 7 1 2 8 3 5 4 9 6 4 6 3
2 1 9 5 7 8 5 9 8 4 7 6 1 3 2 6 5 4 3 2 1 9 8 7 3 2 1 9 8 7 6 5 4 9 8 7 6 5 4
3 2 1)
(notice that if we had used find-all-solutions instead of find-first-solution, we'd have to wait a long time --- the code would happily go along generating every possible solved sudoku board).

So that's great --- except it will always generate the same board. We'll have to introduce some randomness. It's important to note that I have done no analysis of this, and while it will generate a really large number of different boards, I'm not going to claim that it is uniformly distributed or anything like that. What we do is `seed' a board with a certain number of random positions:
(defun seed-sudoku  (num-sites)
(let ((soln (make-array 81 :initial-element 0))
(sites (shuffle! (loop for n below 81 collecting n))))
(labels ((digit-at (n)
(aref soln n))
(possible-digits (n)
(set-difference
`(1 2 3 4 5 6 7 8 9)
(mapcar #'digit-at (neighbors-of n))))
(nth-value-set-p (n)
(/= 0 (digit-at n)))
(random-element (list)
(nth (random (length list)) list))
(update (n)
(when (or (null sites) (zerop num-sites))
(return-from seed-sudoku soln))
(let ((possible-digits (possible-digits n)))
(when possible-digits
(setf (aref soln n) (random-element possible-digits))
(decf num-sites)))
(update (pop sites))))
(update (pop sites)))))
It is possible that for larger num-sites this will not work, and for smaller num-sites there will be a large number of solutions. So we will just repeat until we find a valid board:
(defun generate-filled-soduku ()
(loop for x = (seed-sudoku 20)
unless (null x) do
(let ((res (find-first-solution x)))
(when res (return-from generate-filled-soduku res)))))
We allow for the possibility that a solution isn't possible (it usually is) and just repeat until we get one. I picked 20 for the random seeds as an empirical balance. Too low (e.g. 10) and occasionally the solver takes a very long time to find a solution (on the order of minutes). Too high (e.g. 30) and often seed-sudoku will fail to find enough places to place a clue (i.e., they all become constrained)

So now we can reliably generate solved boards, we'll add a little bit of code to try and remove clues until a desired number is achieved. We start of with 81 (everything known) and randomly select ones to remove so long as doing that doesn't make the solution non-unique.
(defun generate-sudoku (target-clues)
"returns a valid sudoku puzzle, tries to get it down to target-clues number of non-zero entries"
(loop with res = (generate-filled-soduku)
with sites = (shuffle! (loop for n below 81 collect n))
with clues = 81
until (= clues target-clues)
while sites
for n = (pop sites)
for val = (aref res n) do
(setf (aref res n) 0)
(if (multiple-solutions-p res)
(setf (aref res n) val)
(decf clues))
finally
(return (values res clues))))
Now for one last thing, just to give more control. I might want to know that I actually have a 25 clue puzzle, not that I tried for 25 and only made it down to 30. We now do pure rejection sampling on the number of clues. In other words, we keep running generate-sudoku with an argument of n until it is successful:
(defun sudoku-of-n-clues (num-clues)
(loop for (res clues) = (multiple-value-list (generate-sudoku num-clues))
until (= clues num-clues)
finally (return res)))
This becomes slow (apparently exponentially growing in the time to find a solution) at about 20 clues. For my purposes this is fine. I don't do the puzzles myself, but have been told the 20 solution ones that I was generating were pretty hard to do. If we stay above 20, things are really not very expensive:
BLOG> (time (loop repeat 10000 collecting (sudoku-of-n-clues (+ 22 (random 20)))))
Evaluation took: 195.26 seconds of real time
Combining this with the pdf-output I had before would generate a book of 10,000 puzzles, which should take a while to work through.

Of course, this really isn't the end of the road, there are a large number of improvements that could be made, and this algorithm would need some rethinking if we wanted to seriously explore the space for whatever reasons, particularly for low clue numbers. Anyway, it was just an example to show a few CL idioms etc., but I added this as a starting-off point, perhaps, for a decent puzzle generator.

Monday, July 16, 2007

pdf output



A short post about cl-pdf, to get the above output. I've found it handy to hack up a number of outputs for figures and the odd bits and pieces. So, since I had a puzzle generator for sudoku puzzle, I thought I make something to print them (I'll still do a post later on generating puzzles).

This is a hack and I haven't bothered to clean it up but might be useful as an example so I'll post it.

First off we loop over a list of puzzles, printing them four to a page and allowing for a short page left over. This handles opening a document and writing it out:


(defun pdf-sudoku (sudokus file)
"print a list, sudokus, of puzzles to file name file. Print 4 to a page"
(pdf:with-document ()
(loop until (< (length sudokus) 4) do
(loop repeat 4 collecting (pop sudokus) into list
finally (pdf-sudoku-4up-page list))
finally
(unless (null sudokus)
(pdf-sudoku-4up-page sudokus)))
(pdf:write-document file)))

So that looks ok so long as the 4up function actually prints puzzles somehow. This handles the logic of translating around to 4 positions on the page. We're hard coding in 8.5" by 11" paper (72 points per inch) here, and doing a couple of calculations to get the offsets so everything is centered.

(defun pdf-sudoku-4up-page (sudokus)
"print between 1 and 4 puzzles on a single page"
(pdf:with-page ()
(let* ((helvetica (pdf:get-font "Helvetica" :win-ansi-encoding))
(spacer (/ 72 4)) ; 1/4 inch
(width (- (/ (* 8.5 72) 2) (* 2 spacer)))
(voffset (/ (- (* 11 72) (* 2 width) (* 4 spacer)) 2)))

(pdf:translate spacer voffset)
(when (first sudokus)
(fill-in-grid (first sudokus) helvetica width))

(pdf:translate 0 (+ width spacer))
(when (second sudokus)
(fill-in-grid (second sudokus) helvetica width))

(pdf:translate (+ width spacer) 0)
(when (third sudokus)
(fill-in-grid (third sudokus) helvetica width))

(pdf:translate 0 (- (+ width spacer)))
(when (fourth sudokus)
(fill-in-grid (fourth sudokus) helvetica width)))))

Ok, assuming that works we still need to draw a grid and fill in the numbers. I separated out a function to draw the lines, draw-grid.

(defun draw-grid (x0 y0 width height cellsx cellsy)
(pdf:move-to x0 y0)
(loop for x from x0 upto (+ x0 width) by (/ width cellsx) do
(pdf:move-to x y0)
(pdf:line-to x (+ y0 height))
(pdf:stroke))
(loop for y from y0 upto (+ y0 height) by (/ height cellsy) do
(pdf:move-to x0 y)
(pdf:line-to (+ x0 width) y)
(pdf:stroke)))


(defun fill-in-grid (array font &optional (scale (* 8.5 72)))
"fill in numbers from the sudoku in array if they are set (non zero)"
(let ((font-size (/ scale 10))
(step-size (/ scale 9)))
(pdf:set-line-width 1.0)
(draw-grid 0 0 scale scale 9 9)
(pdf:set-line-width 3.0)
(draw-grid 0 0 scale scale 3 3)
(pdf:in-text-mode
(pdf:move-text 0 (- scale step-size))
(pdf:set-font font font-size)
(multiple-value-bind (w h x)
(pdf:get-char-size #\9 font font-size)
(pdf:move-text (/ (- step-size w) 2) (/ (- step-size h) 2)))
(loop with n = -1
repeat 9 do
(loop repeat 9
for x = (aref array (incf n))
unless (= 0 x) do
(pdf:draw-text (format nil "~D" x))
do
(pdf:move-text (/ scale 9) 0)
finally (pdf:move-text (- scale) (- (/ scale 9))))))))

That's it then. There is all sorts of hard coded nonsense, and it should probably handle 2up and 1up as well as different paper sizes. Oh, and let us change the font. But like I said, it's just hacked together and I've tried to isolate the bits that have the hardcoding of sizes, etc. So perhaps it will be useful as an example.

I owe you a short post on generating puzzles, and then will be done with these sudokus ... probably.

Thursday, July 12, 2007

optimizing the solver: wrap-up

I thought I'd add a few closing remarks to the optimization comments. This post won't make much sense unless you've read the previous few solver posts.

1) Particularly if you are new to lisp and coming from a c-family language, you may have a bad feel for how fast your lisp implementation can walk across a list. I made a version that stored all the neighborhood information in an array instead of a vector of lists -- it was slower.

2) You might also question the speed of using recursion. I rewrote the recursive version of solve12 as a loop. It was both much more difficult to understand, and slower. This was because to use the same algorithm, I had to maintain lists of both the unassigned sites, and the ones I'd visited --- the call stack is doing the latter for me, and it does it quickly. With solve5, I made a slightly faster non-recursive version because I could compute where the last site I tried was, but in the more sophisticated solver you can't predict the order that way.

3) There are any number of other tweaks like this you might try, I tried a few, but by no means exhausted the possibilities. Both the approaches in 1) and 2) could probably be finessed into at least catching up with our solver12, if not exceeding it.


I think I've hit (passed, really) diminishing returns with this code. You could spend a lot more time on it, but keep in mind the following:

- any big gain i more likely to come from a better algorithm, if at all

- at this point, we are pretty much hitting implementation dependent improvements. For example, I had one version using a do* loop that was a bit slower, I'm not sure why.

- finding the reasons for questions like the previous will probably involve a combination of profiling an examining the actual code generation (via disassemble) in order to find out what is really going on.

- the painstaking process described in the previous point takes a lot of time, is bug-prone, and tends to end up with code that is hard to follow and maintain. Your carefully measured `optimal' is unlikely to be the same for different implementations. It might not hold for the next upgrade of your compiler.

- In my experience, making fast code say, twice as fast as it was is often a lot more work than making slow code thousands of times faster


All the above taken together means I pretty much never tweak code this far, let alone further if I can avoid it. Unless there is a compelling reason. Turning six months of runtime into a couple of weeks is worth spending a few days on. Quadrupling the speed of some one-off thing that will only run for a day would be a waste of time even if you just consider the time spent coding --- when you add in the much higher probability of some subtle bug that invalidates your results, it's really counterproductive....

All that said, sometimes you need efficient code in research, and I hope this little series has left you with the idea that this is really quite possible with lisp, so that won't scare you off using the expressive power of a language like this.

I'll do one more post in this series, making a simple little puzzle generator and some pdf output for it.

Sunday, July 8, 2007

optimizing the solver part IV: now we're cooking with gas

Last time we established that while solve5 was pretty fast at what it does, it's not a great algorithm. As mentioned, I found a files somewhere that I believe originated here, which we can used for testing. If I add a small change to solve5 to just count how many times it calls update, look at what we get for the first 100 puzzles in this file:


BLOG> (loop repeat 100 for x in *top2365* summing (solve5 x))
77405348

Thats roughly 750,000 calls per puzzle, on average. Lets see if we can't do something about that. I'm going to make a simple and fairly obvious change, but it will require rethinking things. What we want to do is to visit the sites in order of most-constrained to least-constrained. As it is now, we just go and look at each neighbor of the current site once, to see what digits are constrained. If we stay with vector of bits to describe constrained/unconstrained, we have a problem. This is because when you set a new constraint all you do is set the bit to one, but if you want to clear a constraint, you'll have to check all the other sites that might affect that position, to see if it is really cleared (this because a site can be constrained from having a particular digit by multiple other sites).

All that running around and checking seems like a lot of work, so I'm going to use an array that counts the number of contraints a particular position has, and also stores the total number. So we need to keep this array updated, and we'll re-introduce set-digit and clear-digit to wrap that logic up. Other than that, we do something very similar to solve5, except we sort the candidate sites by how constrained they are. Here's a first cut:

(defun solve7 (initialstate)
(let ((soln (make-array `(81) :element-type '(integer 0 9)))
(constraints (make-array `(81 10) :element-type '(integer 0 20) :initial-element 0))
(unassigned `())
(calls 0))
(labels ((sort-unassigned ()
(setf unassigned (sort unassigned #'(lambda (n m) (> (aref constraints n 0) (aref constraints m 0))))))
(set-digit (n d)
(setf (aref soln n) d)
(over-neighbors (i n)
(when (= 1 (incf (aref constraints i d)))
(incf (aref constraints i 0)))))
(clear-digit (n d)
(over-neighbors (i n)
(when (= 0 (decf (aref constraints i d)))
(decf (aref constraints i 0))))
(setf (aref soln n) 0))
(solve (n)
(incf calls)
(loop for digit from 1 to 9
when (zerop (aref constraints n digit)) do
(set-digit n digit)
(sort-unassigned)
(if unassigned
(solve (pop unassigned))
(return-from solve7 (values calls soln)))
(clear-digit n digit)
(sort-unassigned)
finally
(push n unassigned))))
;;setup
(loop for n below 81
for x across initialstate
if (= 0 x) do
(push n unassigned)
else do
(set-digit n x))
(sort-unassigned)
(solve (pop unassigned)))))

And the result:

BLOG> (loop repeat 100 for x in *top2365* summing (solve7 x))
107568

From 77 million calls down to a hundred thousand? Sure, each step is a little more expensive, but there are a lot fewer steps.

The only thing about this approach is that the sort is really heavyweight (go ahead and measure this....). I played around with a couple of options, and what I ended up with was code to do a a search through the list of possible options, and remove (one of) the most constrained one and return it. This code looks hairier than anything we've done so far, but it is fast. The complexity is there to manage unlinking the list. Here it is:

(defun solve8 (initialstate)
(declare (optimize speed (debug 0) (safety 0)))
(let ((soln (make-array `(81) :element-type '(integer 0 9)))
(constraints (make-array `(81 10) :element-type '(integer 0 20) :initial-element 0))
(unassigned `()))
(labels ((find-next-site ()
(loop with maxval = (aref constraints (car unassigned) 0)
with head = nil
for a on unassigned
for b = (cadr a)
while b
for val = (aref constraints b 0)
when (> val maxval) do
(setf head a maxval val)
finally
(return (if head
(prog1 (cadr head)
(setf (cdr head) (cddr head)))
(pop unassigned)))))
(set-digit (n d)
(setf (aref soln n) d)
(over-neighbors (i n)
(when (= 1 (incf (aref constraints i d)))
(incf (aref constraints i 0)))))
(clear-digit (n d)
(over-neighbors (i n)
(when (= 0 (decf (aref constraints i d)))
(decf (aref constraints i 0))))
(setf (aref soln n) 0))
(solve (n)
(loop for digit from 1 to 9
when (zerop (aref constraints n digit)) do
(set-digit n digit)
(if unassigned
(solve (find-next-site))
(return-from solve8 soln))
(clear-digit n digit)
finally
(push n unassigned))))
;;setup
(loop for n below 81
for x across initialstate
if (= 0 x) do
(push n unassigned)
else do
(set-digit n x))
(solve (find-next-site)))))

You'll notice that on this final version I've gone ahead and used maximally fast (and minimally safe) optimization declares. I did the same for solve5, and it makes a small difference. The algorithmic change though, is anything but a small difference:

BLOG> (time (loop for x in *top2365* do (solve5 x)))
Evaluation took:
1650.157 seconds of real time
1586.752 seconds of user run time
23.794111 seconds of system run time
[Run times include 112.596 seconds GC run time.]
0 calls to %EVAL
0 page faults and
185,974,228,944 bytes consed.
BLOG> (time (loop for x in *top2365* do (solve7 x)))
Evaluation took:
46.938 seconds of real time
46.745686 seconds of user run time
0.144156 seconds of system run time
[Run times include 0.2 seconds GC run time.]
0 calls to %EVAL
0 page faults and
226,594,424 bytes consed.
NIL
BLOG> (time (loop for x in *top2365* do (solve8 x)))
Evaluation took:
3.55 seconds of real time
3.533416 seconds of user run time
0.012559 seconds of system run time
[Run times include 0.037 seconds GC run time.]
0 calls to %EVAL
0 page faults and
40,511,344 bytes consed.
NIL
BLOG> (/ 3.55d0 (length *top2365*))
0.0015010570824524313d0

An average time of 0.0015s/puzzle looks pretty good, and it's 400 times faster than solve5, which was already 10,000 times faster than our first attempt, at least on a particular puzzle. I'm not going to run that one on 2000+ puzzles for you, but if you've got a few days of cpu to put to it, go ahead....

This post is pretty long, so I'll leave it like that for now and recap the optimization stuff next time. Then I'm going to go ahead and build a simple puzzle generator (with pdf output) out of solve8 (at least, solve8 with the restart put back in). For those who are interests, putting the restart back in will raise th total runtime from about 3.5s to about 4.0s. Not nothing, but hardly the end of the world.

Oh, and one last thing. It's always a good idea to sanity-check this sort of thing as you go along:

BLOG> (notany #'null (mapcar #'validate (loop for x in *top2365* collecting (solve8 x))))
T

Here validate is a simple routine that checks that every digit is between 1..9, and no digit is the same value as any neighbor it has in row,column, or cell.

Saturday, July 7, 2007

optimizing the solver part III: a bit more about measurement

The code called `solve5' is reasonably fast. I suspect we could make it a little faster, but at this point we've gone about as far with it as I'd like. After all, we're just tuning a particular algorithm, and eventually we have to hit diminishing returns --- as it is we have a clean, easy to follow implementation that is many orders of magnitude faster than to naive code we started off with.

We haven't really asked if the algorithm is good. It isn't, particularly. It's not terrible, but it really isn't that good. So it doesn't make a lot of sense to keep tuning it. Which brings me to the second issue I mentioned a few posts ago; in optimizing code we have to balance effort spent on better algorithms vs. effort spent on tuning.

I talked a bit about profilers. If you're doing this sort of thing, you should acquaint yourself with yours. If you're using a free lisp, you may not have a lot of options (for example, sbcl on darwin/ppc doesn't seem to have a working statistical profiler). There's lots of information out there, but most of it is implementation dependent and I won't spend time on it now. There are other things you can do; what follows is an example.

If you think about the kind of search problem we are doing here, you're bound to start wondering about the order of search. As it is, we just look at the sites in order, lowest to highest, and try and fill in the digits. It should be clear that if someone knew how you were doing this, they could construct a puzzle that was very difficult for you to solve (but might be quite easy if you searched highest-to-lowest, say). This comes back to the issue I earlier raised -- that performance on this problem is controlled by two things: how long it takes you to try a single vertex (i.e. position in the puzzle), and how many times you have to try. This latter part is strongly affected by the order in which we look.

If we're worried about this sort of thing, it doesn't make a lot of sense to look at only one or two puzzles, does it? That tells us a lot about the first part of the issues (how fast can one vertex be tested) but nothing much about how many things I have to try on average.

Before we do something about that, here is a little experiment. I made a new solver, solve6, with a small change:


...
else do
(setf (aref soln n) x))
(setf unassigned (shuffle! unassigned))
...

At the end of the loop setting things up, I randomly shuffle the list of unassigned sites. So every time I run solve6, I look at the sites in a different (random) order. This should have only a tiny effect on execution speed, because it a) only happens once, and b) is done with fast code. Look what happens:

BLOG> (avg-runtime 30 (solve5 *easy*))
1.3333333333333334d-4
1.1954022988505748d-7
BLOG> (avg-runtime 30 (solve6 *easy*))
0.47433333333333333d0
0.8706354712643679d0

My macro (avg-runtime n body) executes body n times and returns the mean and variance of the runtime in seconds (The details probably aren't important, but I'll note it is computing seconds based on get-internal-runtime. This is implementation dependent, but a handy userspace tool to measure things with. I'll describe it more at the end)

I used a puzzle called *easy* this time. Why? Because solve6 takes too long to solve the *slow* puzzle, even once. I got tired of waiting for just one solution after 5 minutes or so (compared to solve5's 1.3e-4s, remember). This little change made a huge difference. Why? Because the order we visit sites really, really matters. Which suggests that rather than fiddling around with tuning what we have, we should really think a bit about how to visit sites in a smart way.

There's another clue here. solve6 execution speed has high variance, which means picking any old order of visitation can get you in all kinds of trouble. But also note, solve5 is comparatively very, very fast. Why? We're doing something clever, without thinking about it. What we're doing is walking along in row-major order through the array, basically. But remember that rows are one of the constraints for this puzzle. So as we move along, we are constraining the next choice (and the one after that, etc.) each time. Looks like this works for us pretty well!

We can do a lot better, and I'll talk about that next time. And we're going to have to talk about average performance across a larger number of examples. I poked around on the web and found some here, that are labeled as `hard to solve'. I'll be using the file `top 2635' if you want to play along next time.

Before I end though, ss promised, a bit more on the macro I used. First off, I used a handy little function that computes the mean and variance of any sequence, something like this:

(defun mean-and-variance (seq)
(let ((n 0)
(mean 0)
(sum^2 0))
(flet ((update (x)
(let ((dev (- x mean)))
(incf n)
(incf mean (/ dev n))
(incf sum^2 (* dev (- x mean))))))
(map 'nil #'update seq)
(values mean (/ sum^2 (1- n))))))

I mention this particularly, because when I first started lisping I would far too often special case different sequence types in a loop (a c programmers approach, I guess) and forgo code like this that is clean, reasonably quick, and doesn't care if you hand it a list or a vector, etc. Common lisp is full of nice function like map, and if you haven't found them yet you should look.

Here is avg-runtime, and a macro it depends on:

(defmacro tictoc (&body body)
(with-unique-names (t0 t1 res)
`(let (,t0 ,t1 ,res)
(setf ,t0 (get-internal-run-time)
res ,@body
,t1 (get-internal-run-time))
(values (float (/ (- ,t1 ,t0) internal-time-units-per-second) 1.0d0) res))))

(defmacro avg-runtime (n &body body)
(with-unique-names (res)
`(loop repeat ,n
collecting (tictoc ,@body) into ,res
finally (return (mean-and-variance ,res)))))


Of course there are loads of ways to do this, and if I was worried about long lists I wouldn't collect one up to pass to mean-and-variance, but it's fine for our purposes. I'm sure there are any number of weaknesses in the above, too. It's something I whipped off both to measure the solve6 run, and more as an example of how easy it is to build this sort of thing to help you.

Friday, July 6, 2007

optimizing the solver, part II: Know your profiler

Let me start off with a comment: I'm very new at this and having trouble keeping the entries a manageable size. Because of this, some of you have noticed I pulled a bit of a bait and switch on you last time. I talked about the need to measure things, but then I didn't show you any measurements. I'm going to start talking about that now, but there's lots to say so I'll just start with a small piece (and take it up again in future posts).

The sudoku example I've show you is simple, and we can reason some things out about the performance. Also, of course, some things come from experience. In general though, you really can't hope to optimize what you can't measure. The first thing you can do this (assuming of course, you've all ready decided you actually *have* to optimize this bit of code) is get to know what sort of profiling abilities you have. This is by nature implementation dependent: take some time to familiarize yourself with your particular system.

Here are some results from the statistical profiler in sbcl (somewhat trimmed). This showed me that my intuition was correct, and neighbors-of had a lot of unneeded overhead in the original implementation:


BLOG> (sb-sprof:with-profiling (:max-samples 50000 :report :flat :loop nil) (find-first-solution *slow*))
Self Total Cumul
Nr Count % Count % Count % Calls Function
------------------------------------------------------------------------
1 3285 30.2 3285 30.2 3285 30.2 - (FLET #:CLEANUP-FUN-1165)
2 2277 20.9 2277 20.9 5562 51.1 - (FLET #:CLEANUP-FUN-1030)
3 763 7.0 4053 37.2 6325 58.1 - SB-KERNEL:%PUTHASH
4 611 5.6 8555 78.6 6936 63.7 - SB-IMPL::LIST-REMOVE-DUPLICATES*
5 595 5.5 595 5.5 7531 69.2 - (FLET #:CLEANUP-FUN-77)
6 551 5.1 2828 26.0 8082 74.3 - SB-IMPL::GETHASH3
7 387 3.6 575 5.3 8469 77.8 - SET-DIFFERENCE
8 354 3.3 617 5.7 8823 81.1 - MAKE-HASH-TABLE
9 318 2.9 393 3.6 9141 84.0 - REMOVE
10 277 2.5 9425 86.6 9418 86.5 - NEIGHBORS-OF
11 232 2.1 232 2.1 9650 88.7 - SB-VM::MOVE-FROM-SIGNED
12 209 1.9 210 1.9 9859 90.6 - EQL
13 195 1.8 10880 100.0 10054 92.4 - (LABELS UPDATE)
14 119 1.1 174 1.6 10173 93.5 - SB-IMPL::%MAKE-HASH-TABLE
15 113 1.0 113 1.0 10286 94.5 - SB-VM::GENERIC-+
16 91 0.8 95 0.9 10377 95.3 - LENGTH
17 88 0.8 88 0.8 10465 96.2 - SUBS->INDEX
18 73 0.7 73 0.7 10538 96.8 - SB-IMPL::GETHASH2
19 50 0.5 50 0.5 10588 97.3 - TRUNCATE
20 39 0.4 88 0.8 10627 97.6 - FLOOR

Look at line 10, particularly.
After replacing neighbors-of with the lookup version we get this:

BLOG> (sb-sprof:with-profiling (:max-samples 50000 :report :flat :loop nil) (find-first-solution *slow*))
Self Total Cumul
Nr Count % Count % Count % Calls Function
------------------------------------------------------------------------
1 323 41.5 477 61.2 323 41.5 - SET-DIFFERENCE
2 183 23.5 779 100.0 506 65.0 - (LABELS UPDATE)
3 137 17.6 137 17.6 643 82.5 - EQL
4 62 8.0 62 8.0 705 90.5 - (FLET #:CLEANUP-FUN-77)
5 35 4.5 35 4.5 740 95.0 - (LABELS DIGIT-AT)
6 23 3.0 23 3.0 763 97.9 - SB-KERNEL:%COERCE-CALLABLE-TO-FUN
7 11 1.4 11 1.4 774 99.4 - SB-VM::MOVE-FROM-SIGNED
8 3 0.4 3 0.4 777 99.7 - SB-VM::GENERIC-+
9 1 0.1 1 0.1 778 99.9 - NEIGHBORS-OF
10 1 0.1 1 0.1 779 100.0 - (FLET #:CLEANUP-FUN-4)
11 0 0.0 779 100.0 779 100.0 - SOLVE2
12 0 0.0 779 100.0 779 100.0 - FIND-FIRST-SOLUTION
13 0 0.0 779 100.0 779 100.0 - NIL
14 0 0.0 779 100.0 779 100.0 - SB-INT:SIMPLE-EVAL-IN-LEXENV
15 0 0.0 779 100.0 779 100.0 - SWANK::EVAL-REGION
16 0 0.0 779 100.0 779 100.0 - "Unknown component: #xA81A978"
17 0 0.0 779 100.0 779 100.0 - (LAMBDA (SWANK-BACKEND::FN))
18 0 0.0 779 100.0 779 100.0 - SWANK::CALL-WITH-BUFFER-SYNTAX
19 0 0.0 779 100.0 779 100.0 - SWANK:LISTENER-EVAL
20 0 0.0 779 100.0 779 100.0 - "Unknown component: #xA80DB98"

So you see, not only have we reduced the overall runtime by 10x, but neighbors-of not accounts for a tiny percentage of the total runtime (so we can safely ignore it, at least for now). This result also showed me that to proceed I needed to concentrated on reducing the cost of the set-difference. Here I knew from experience I could do the same thing with a bit vector quickly, but even if I didn't know that, I'd know this is where I wanted to concentrate my efforts.

There is one other thing I should note. There is a lot of cruft in these reports from flet, labels etc, which brings up a bit of an issue. I've sort of implicitly advocated wrapping things up as local functions to keep the local detail local, and not fill your namespace with all sorts of specialized functions.

There are a couple of downsides to this. The more important one is that it can be much easier to debug separate functions. Additionally, when (if!) we get to the point of profiling and optimizing, it can be harder to see what is going on. On the other hand, a tiny function like the set-digit we had (which just wraps an array reference) could have more calling overhead than actual overhead, in this case we can distort the result by not allowing the compiler to easily elide the call. It can be a bit of a pain to move from one form to the other, but in practice I often start off will a bunch of little functions and then may wrap them all up in a flet or labels later.

Next time I'll talk about more measurement and optimization issues, and begin discussing algorithmic changes we can make (and why).

Thursday, July 5, 2007

optimizing the solver, part I

Optimization, particularly when `premature' has a deservedly bad reputation. In the quest for `fast' code, it is very easy to end up with something that is buggy and hard to read. Many coders fall into the trap of thinking about this far too early in the development cycle .. and the code suffers for it.

That being said, sometimes we have no choice. Over the next couple of posts I'm going to discuss what we can do with this `sudoku' example. I'm going to make some specific points along the way, but before that, lets talk a bit about a philosophy of sorts for optimization.

You hear a lot of pithy sayings about code optimization (e.g. `premature optimization is the root of all evil'), some much more useful than others. If I had to sum it up my feelings about it in a single sentence, I'd probably say (almost certainly borrowed this, but no idea where from): "Rules for optimization: 1) Don't 2) Don't yet 3) If you must, analyse & measure everything and work from there".

This works for me, particularly because coming from a background (asm/c) that lends itself to blazingly fast code nobody can read, I can easily fall into the trap of premature optimization. The 3rd point is crucial though; you can waste an astonishing amount of time polishing code that you thought was slowing you down, but it wasn't. A corollary to this statement is `Know your implementation'. On that note, I'll try and stay as general as possible, but some of the things that follow will be tailored to the sbcl compiler. Really high performance code will often necessarily be implementation dependent (and cpu dependent, and cache/memory dependent ... and if you get to that level may be quite counter-intuitive).

We'll be talking about common lisp here, and one of the really nice things about CL is that you get an extremely expressive high level language, but without resorting to libraries etc. you can still have very efficient compiled code. The language is unique in that respect, in my experience. For what it's worth I've found that with a bit of work, my numerical CL code tends to be within a factor of 2 or so of code (and yes, sometimes faster although usually not). The advantages in development time and correctness/debugging relative to c have been very large in my experience. It's a very research-friendly language. Of course we can't hope to meet carefully tuned asm coding --- but who can?

So now we've dispensed with the obligatory nod to reason and good sense, let's hammer on this poor little sudoku solver (the previous few entries, if you haven't been following along) a bit. If you'll recall, we had a problem:


BLOG> (time (find-first-solution *slow*))
Evaluation took: 113.817 seconds of real time

(I'm on a different, slightly slower machine this time). I'll note that everything you see here has been declaimed to have (optimize speed). Now broadly speaking, algorithmic changes will beat `tuning', but both are important. There are always trade-offs, too; a very fast algorithm may be difficult to read and implement correctly, for example.

We have here a simple algorithm that tries a number of possible configurations until it finds a solution. This means we have two ways to make it faster. 1) make each individual attempted entry faster, and 2) look at less possibilities.

#2 involves changing the algorithm to be smarter about where we look. I'll take that up in a second post. For now, let's look at #1, how to make each step faster.

There is one thing about the current code that immediately jumps out at me. There are only 81 possible positions we examine, and each of these has 20 neighbors (same row, same column, or same cell). Every time we want to look at a position, we compute every neighbor and cons them up in a list. Obviously we'll benefit from not repeating these computations:

BLOG> (memoize:memoize-function 'neighbors-of)
NEIGHBORS-OF
BLOG> (time (find-first-solution *slow*))
Evaluation took: 9.531 seconds of real time

Pretty easy way to gain an order of magnitude, no? And you know what they say: and order of magnitude here, an order of magnitude there --- pretty soon you're talking about real performance improvements.

The above relied on memoization code. Since this is a simple problem of looking up a known (small) set of lists, we could also just minorly modify the definition of neighbors-of too, by precomputing everything and just closing over a vector of the lists:

(loop with array = (make-array 81)
for idx below 81 do
(loop with (i0 j0) = (index->subs idx)
for n below 9
collecting (subs->index i0 n) into res
collecting (subs->index n j0) into res
finally
(loop repeat 3 for i upfrom (* 3 (floor i0 3)) do
(loop repeat 3 for j upfrom (* 3 (floor j0 3)) do
(push (subs->index i j) res)))
(setf (aref array idx) (sort (remove idx (remove-duplicates res)) #'<)))
finally
(defun neighbors-of (idx)
(aref array idx)))

BLOG> (time (find-first-solution *slow*))
Evaluation took: 8.081 seconds of real time


This is less general, but also a little faster. With that obvious fix taken out of the way, let's move on the specific changes. For the sake of vertical space, just for the moment, I'm going to dispense with both the signal & restart stuff described previously, and the local functions set-digit and clear-digit. We'll bring this all back soon.

I'm going to introduce several changes. If you're interested, I encourages separating out the changes and measuring their relative effect (I'm going to talk a bit more about measurement in another post soon). Here is the first variant:

(defun solve4 (initialstate)
(declare (optimize speed))
(let ((soln (make-array `(81) :element-type '(integer 0 9) :initial-contents initialstate)))
(labels ((possible-digits (idx)
(let ((set (make-array 10 :element-type 'bit :initial-element 0)))
(over-neighbors (n idx)
(setf (sbit set (aref soln n)) 1))
set))
(solve (n)
(when (= n 81) (return-from solve4 soln))
(if (/= 0 (aref soln n))
(solve (1+ n))
(loop with possible = (possible-digits n)
for digit from 1 to 9
when (zerop (sbit possible digit)) do
(setf (aref soln n) digit)
(solve (1+ n))
(setf (aref soln n) 0)))))
(solve 0))))
BLOG> (time (solve4 *slow*))
Evaluation took: 0.043 seconds of real time

There are two significant changes here. I've added declarations to help the compiler (sbcl in this case) generate efficient code. I've also replaced the list-based sets of possible digits with a bit array. This may be implementation dependent, but in this case the gain is large.

One little algorithmic tweak can be made too. As it is, we walk over every position, and just ignore it if it is already set. Rather than do this, we can walk a list and only look at the positions that we actually need to:

(defun solve5 (initialstate)
(declare (optimize speed))
(let ((soln (make-array `(81) :element-type '(integer 0 9)))
(unassigned `()))
(labels ((possible-digits (idx)
(let ((set (make-array 10 :element-type 'bit :initial-element 0)))
(over-neighbors (n idx)
(setf (sbit set (aref soln n)) 1))
set))
(solve (n)
(loop with possible = (possible-digits n)
for digit from 1 to 9
when (zerop (sbit possible digit)) do
(setf (aref soln n) digit)
(if unassigned
(solve (pop unassigned))
(return-from solve5 soln))
(setf (aref soln n) 0)
finally
(push n unassigned))))
;;setup
(loop for n below 81
for x across initialstate
if (= 0 x) do
(push n unassigned)
else do
(setf (aref soln n) x))
(solve (pop unassigned)))))
BLOG> (time (solve5 *slow*))
Evaluation took:0.01 seconds of real time


So with very little effort, we've made it from about 100s to about .01s ... I wasn't kidding about the 10,000 times speedup. We're not finished though.. I'll add a post about the limitations of this sort of measurement, and some algorithmic improvements too. You'll probably notice I haven't really backed up my `measure everything' statement either, because I've just shown that the changes I made worked (on this example, anyway) not why I was driven to try them. This is a long enough entry, though, and I'll take that up next time.