ACE, optimisation and an old friend

In late November I received an email from an old friend: Sean Miller. Sean was a member of the wonderful community that built up around ACE BASIC, a compiler for the Amiga I developed as a labour of love between 1991 and 1996. I’ve written about ACE in this blog before. Sean told me how his use of ACE influenced him over the years. It has been great to get back in contact with him.

I felt honoured and humbled when, on Christmas Eve, Sean released an episode about ACE on the Raising Awesome YouTube channel he and his son have created. In this episode (Retro Amiga Computing – ACE BASIC and Questions2.1 Development):

Sean shows how to use ACE on an Amiga emulator to compile and run a program he wrote more than 20 years ago (Questions).

Retro Computing with ACE

I’ve expressed this in email to Sean, but let me say it publicly: thank you Sean! It means more to me than I can say.

During the video, Sean comments on the progress of the compilation of Questions, notes that there were around 4000 peephole optimisations (see screenshot from video above) and wonders whether I might explain what a peephole optimisation is. I’d be happy to of course. Now you’ve got me started! 🙂

ACE generates assembly code for the beautiful Motorola 68000 microprocessor. Compilation of some ACE language constructs generates sub-optimal assembly code instructions on the first pass. Assembly code is emitted as ACE parses the input source code without any knowledge of the broader context of the program.

Here’s a trivial ACE program:

x%=42
y%=x%*3

This simply stores 42 in the short integer variable x% (the type is denoted by %), multiplies x% by 3 and stores the product in the variable y%. I chose integer over floating point for this example because the generated assembly is more complex and would distract from the explanation. Speaking of distractions…

As an aside, unlike modern Intel, ARM and other processors, the 68000 didn’t have a floating point unit (FPU), so floating point operations were carried out by library code instead of hardware, such as a Motorola Fast Floating Point or IEEE 754 library. As an aside to my aside, the Amiga 500 had a 68000 processor whereas the Amiga 1200 (I owned both eventually) had a 68020. The 68020 could offload floating point instructions (which it did not know how to handle) to a co-processor. The 68040 was the first 68k processor with an on-board FPU. This is a whole topic by itself.

Back to the trivial example ACE program above…

Here’s the 68000 assembly ACE generates for the two line program without any optimisation (i.e. without the -O option):

    move.w  #42,-(sp)
    move.w  (sp)+,-2(a4)
    move.w  -2(a4),-(sp)
    move.w  #3,-(sp)
    move.w  (sp)+,d0
    move.w  (sp)+,d1
    muls    d1,d0
    move.l  d0,-(sp)
    move.l  (sp)+,d0
    move.w  d0,-(sp)
    move.w  (sp)+,-4(a4)

With optimisation we have 6 assembly instructions instead of 11:

    move.w  #42,-2(a4)
    move.w  -2(a4),-(sp)
    move.w  #3,d0
    move.w  (sp)+,d1
    muls    d1,d0
    move.w  d0,-4(a4)

Looking at the first two lines of the 11 unoptimised sequence:

    move.w  #42,-(sp)
    move.w  (sp)+,-2(a4)
lifo_stack1
Example stack operations (source: goo.gl/5EuhjG)

ACE examines this pair in a sliding window, or so-called peephole, onto the emitted instructions and notices that 42 is being pushed to the first-in, last-out stack then immediately popped from the stack and stored into the variable x%’s address, represented by an offset of two from an address stored in the register a4. The peephole optimiser reduces this push-pop pair to a single instruction:

    move.w  #42,-2(a4)

ACE stops short of taking the newly optimised pair:

    move.w  #42,-2(a4)
    move.w  -2(a4),-(sp)

then peephole optimising it and emitting this:

    move.w  #42,-(sp)

The reason is that the programmer has asked for 42 to be stored in the variable x%.

More ideally would have been this sequence:

move.w  #42,-2(a4)
move.w  -2(a4),d0
muls    #3,d0
move.w  d0,-4(a4)

which literally says:

  • move 42 into variable x%’s memory location
  • move the value stored at x%’s memory location into the 68k register d0
  • carry out a signed multiplication of 3 with the contents of register d0, storing the result in d0
  • move the contents of register d0 into variable y%’s memory location

If the constraints relating to use of x% and y% did not exist, the following would be sufficient to yield the product of 42 and 3 in 68k assembly:

move.w  #42,d0
muls    #3,d0

Notice that the 4 instructions after the multiplication (muls) in the unoptimised sequence are optimised during more than one pass over the assembly code to a single instruction that stores the product into y%, from this:

    move.l  d0,-(sp)
    move.l  (sp)+,d0
    move.w  d0,-(sp)
    move.w  (sp)+,-4(a4)

to this:

    move.w  d0,-4(a4)

So, ACE does better with this than the instruction sequence before the multiplication.

There are other simple optimisations carried out when the -O option is used, relating to numeric negation, but this example illustrates the key aspects.

Bernd Brandes later wrote a more powerful optimiser for ACE, the SuperOptimiser, that built upon this simple peephole optimisation approach.

Every instruction the processor doesn’t have to execute means fewer CPU cycles, so a run-time speed up. This matters a lot for example, when such instructions are part of a loop that iterates many times.

To revisit ACE’s code generation and optimisation implementation, I downloaded and Vidar Hokstad’s improvements to the ACE source (on GitHub) for compilation under Linux. I compiled that on my Mac OS X laptop and used it to generate 68k assembly code. Vidar contacted me several years ago to say that he was engaging in “software archaeology” (that made me feel a bit old, even then) with the ACE source code. I appreciate Vidar’s efforts. He sorted out some compilation problems under the GNU C compiler (gcc) that I would have had to otherwise.

It’s interesting to look at the Intel assembly generated by gcc for a similar C code fragment. The following would have to be embedded in a function:

int x,y;
x=42;
y=x*3;

The gcc compiler generates this sequence:

    movl    $0, -4(%rbp)
    movl    $42, -8(%rbp)
    imull   $3, -8(%rbp), %ecx
    movl    %ecx, -12(%rbp)

As with the ACE generated 68k assembly, only the relevant part is shown. There’s additional code generated just to start up and shut down a program (by gcc, ACE or any other compiler). The Intel assembly generated here is a bit better than the optimised 68k code ACE generated (4 vs 6 lines) although surprisingly, not very much better.

When I wrote ACE in the 90s, all components were written either in C or 68000 assembly and I went straight from an implicit parse tree to assembly code generation. These days I tend to use ANTLR or similar tools for lexical analysis (converting character streams to tokens) and parsing (checking against language grammar). I have yet to use The LLVM Compiler Infrastructure for language development, but that’s on my list too.

Creating an intermediate representation (such as abstract syntax trees) before code generation, provides additional opportunities for optimisation, something I’ve been exploring in recent times. I’ll write more about that in another post.

To be honest, the more I think and write about this topic again, the more I want to.

Thanks again Sean.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s


%d bloggers like this: