Monday, July 18, 2005

 

Optimization conundrums

I've been programming professionally for 20 years, and programming Java professionally for about five. Something I think I have in Java is a good feel for optimization and performance, at least on the micro level. What code is going to cause a lot of object creation, what choice of libraries is the best for a certain set of circumstances, etc.

Let's say I don't have that in Lisp yet. I worked on a benchmark for the Computer Language Shootout, and my best optimizations with declarations weren't doing so well. I asked for help on Usenet, Nicolas Neuss had a look, and, well, I'll just show you.

Here's my code (testrrc package) and Nicolas' (testnn package) combined for one of the routines.

This is compiled and run in CMUCL on my 1 Ghz iBook.

(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))

(defpackage :testrrc)
(in-package :testrrc)

(defconstant IM 139968)
(defconstant IA 3877)
(defconstant IC 29573)

(defparameter THE_LAST 42)
(declaim (inline gen_random))

(declaim (inline gen_random))
(defun gen_random (max)
(declare (type (signed-byte 32) IM IA IC THE_LAST))
(declare (double-float max))
(setq THE_LAST (mod (+ (* THE_LAST IA) IC) IM))
(/ (* max THE_LAST) IM))

(defpackage :testnn)
(in-package :testnn)

(defconstant IM 139968)
(defconstant IA 3877)
(defconstant IC 29573)

(defparameter THE_LAST 42)
(declaim (inline gen_random))

(defun gen_random (max)
(declare (type (unsigned-byte 30) IM IA IC THE_LAST))
(declare (double-float max))
(setq THE_LAST (mod
(+ (the (unsigned-byte 31) (* THE_LAST IA)) IC) IM))
(/ (* max THE_LAST) IM))

(in-package :common-lisp-user)

(defun main ()
(time (dotimes (i 7500000) (testrrc::gen_random 1.0d0)))
(Time (dotimes (i 7500000) (testnn::gen_random 1.0d0))))

And here's the results.


; Evaluation took:
; 1.45 seconds of real time
; 1.142364 seconds of user run time
; 0.025509 seconds of system run time
; 1,529,475,015 CPU cycles
; 0 page faults and
; 5,127,264 bytes consed.
;

; Evaluation took:
; 0.28 seconds of real time
; 0.133931 seconds of user run time
; 6.7e-4 seconds of system run time
; 294,265,464 CPU cycles
; 0 page faults and
; 0 bytes consed.

Not only does that small change make it 5 times faster, but I have no idea why my version needs to cons 5 MEGS of bytes, and Nicolas' conses ZERO bytes. Anyone care to explain that one to me?

Comments:
Hi,

I tried your test using CMU CL ("CVS pre1-19a 19b-pre1-20050606 + minimal debian patches (19B)") on x86 and got these results:

; Evaluation took:
; 0.5 seconds of real time
; 0.484926 seconds of user run time
; 0.006 seconds of system run time
; 1,214,548,720 CPU cycles
; 0 page faults and
; 4,487,848 bytes consed.
;

; Evaluation took:
; 0.46 seconds of real time
; 0.447932 seconds of user run time
; 9.99e-4 seconds of system run time
; 1,087,657,860 CPU cycles
; 0 page faults and
; 639,416 bytes consed.
 
I find the code you posted difficult to read, but the main difference seems to be that the faster code uses shorter integers that would fit in a 32bit int whereas you use 32 bits which the compiler seems unwilling to use unboxed, i.e. you constantly box new integers.

You might want to try a 64bit lisp. :-)
 
Mark,
I guess you can see my frustration - the same code, running CMUCL on Linux/x86, and the times are about the same. On OS X/PPC, the times are drastically different. How does one cross-platform optimize? At least the same routines are faster in both places; there may be scenarios where a set of optimizations may be faster on one architecture and slower on another.

H,
To keep the benchmarks fair the shootout guys want the algorithms implemented as closely as possible across the languages.
Thanks for insight on the boxing. I'll try to get good at using smaller-than-32-bits integers since they give such better performance.
 
It looks like the (*) function normally returns a larger integer (that's not uncommon since CPUs also tend to give you double space for mul results). Notice that the testnn package explicitly does (the (unsigned-byte 31)) on the multiplication result.
If you leave your other declarations alone and just add in (the (unsigned-byte 31) (* ...)) the conses should disappear (they did on my cmucl19a).
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?