[Ecls-list] Unstable changes ahead

Thu Jun 19 17:23:37 UTC 2008

I have created an informal new release ECL_0_9k. People wishing to
remain "stable" might wish to either not update from CVS or to check
out that revision using "cvs checkout -r ECL_0_9k ecl".

The reason for tagging this version is that I have redesigned the
interpreter using threaded code and new inline operators. The outcome
is way faster because of several reasons

- It uses indirect threading. That means the interpreter keeps a table
of memory addresses for sections that implements different bytecodes,
and uses GCC's computed gotos to have very fast and efficient
dispatch.

- Functions in Common Lisp language which take 1 or 2 arguments can be
called directly.

- The code for calling all other functions has been simplified and
inlined in the interpreter.

- Most of the structures and data needed by the interpreter are kept
as local variables in the interpreter loop. In the multithreaded case
this means a significant gain and it is an optimization that will be
extended to other regions of ECL.

- There are new bytecodes for frequently used operations: CONS, ENDP,
CAR, CDR, etc.

- Many other bytecodes have been eliminated or rewritten using common ones.

- Single-byte bytecodes (--enable-opcode8) works again.

I must also admit that the performance improvements are very much
machine dependent. I have a not too high level processor, a Pentium IV
dual core, and it is a significant gain. On another box, a Quad-core
with a much bigger cache and more memory, running 64-bits, the gains
are not so big.

A significant comparison is Maxima's test suite. I must warn that the
following times have been obtained with a version of Maxima that
implements intepreter optimizations for COERCE and TYPEP, and which
are not available in the CVS version yet. The times of the test suite
with the 0.9k release are

real time : 380.454 secs
run time  : 361.130 secs
gc count  : 41 times
consed    : 397399637984 bytes

To be compared with the optimized interpreter

real time : 307.241 secs
run time  : 287.700 secs
gc count  : 43 times
consed    : 429018839736 bytes

The difference is much bigger in the dual core.

This stage of the interpreter represents an interim one. The biggest
performance bottleneck right now is the amount of consing which is
generated in the interpreter, due to binding of local variables. My
goal is to rewrite this using a kind of "register" array, but this is
tricky because of the existence of closures: one has to determine not
only the positions of variables, but when the register array has to be
preserved, etc.

So basically the goals are:

- Simplify the interpreter loop so that more data is kept in registers
and less in memory. Currently GCC is not doing too good regarding this
point.

- Implement the new lexical environment structure as an array of
registers which are directly accessed and which can be either
dynamically allocated in the interpreter stack, or permanently in
memory.

- In cases in which no debugging information is required, call
FLET/LABELS functions directly, without creating closures. Saves both
space and time.

- Study which bytecodes are most often used and look for new
combinations that might speed up frequent operations.

- Optimize the bytecodes compiler code, ensuring that there are not
performance bottlenecks there, as well.

Help is welcome with any of these things, either in the form of code
or just useful ideas, suggestions, etc.

Juanjo

-- 
Facultad de Fisicas, Universidad Complutense,
Ciudad Universitaria s/n Madrid 28040 (Spain)
http://juanjose.garciaripoll.googlepages.com