← Performance Profiling
  Optimizing Buggy Boy (1)
Thu 15th November 2018   
On the weekend of 2-3 November 2018 the Demosplash party was held at Carnegie Mellon University in Pittsburgh, PA, USA.

There are not many demo parties in the USA, and even less of these showing Oric demos running on the actual Oric hardware specially with a Cumulus device!

For some reason I can't fathom, it appears that Buggy Boy is the host's favorite Oric demo, so it's the one they decided to show.

You can watch the 9 hour long video on the Scene Sat archives (the Oric part starts at 1:30).

The issues

I did enjoy seeing my demo on the big screen, but it kind of hurt to hear the (well deserved) "World Slowest Fractals" (or was it second slowest?) comment, as well as seeing the final slowwwwwwww scroll.

The thing is, there's no real reason why these parts are slow, other than the fact that I finished the demo on the party place with a huge deficit of sleep.

For sure, the most annoying elements in the demo as it stands now are the following:
  • Each time the music loops, whatever is shown of the screen stutters or pauses for a fraction of a second
  • The fractal rendering is way too slow, even taking the opportunity to display the greetings only makes it barely acceptable
  • The final credits scroll is atrociously slow
If I can improve on these problems, I can totally see myself release a version 1.1 of the demo with all these annoying things fixed.

Since the previous article was about profiling, let's dig in the code and see if we can improve things.

If you want to know more about the demo, please refer to the Defence-Force Buggy Boy demo page, it has screenshots, videos, and links to the complete source code1.

The final scroll

This article is not going to tackle all of these issues, we are only going to concentrate on one element at a time, and see how we can approach it in a professional and rigorous way!

One of the first things to do is to isolate elements to make the iteration loop as fast as possible, and also to eliminate anything that can impact the performance (a good example would be the music player, or heavy duty interrupt driven code running in the background).

An easy way to do that, is to modify "main.c" to directly jump to the right location after the initialization are done:

void main()
{
System_Initialize();
TablesInit();

goto EndScroller;

(... lots of stuff ...)

EndScroller:
SwitchToHires();
ClearHiresScreen();
Fx_DrawTitle();
(...)
}

At this point, when launching the demo we arrive directly on the end scroller:


As you can see, it is slow, taking about 2 minutes from start to finish.

So, how does this code work?

How It's Made

I'm not particularly proud of the code, but it's about all I came up with at the time.

If you look at the screen, you'll see that the scroll is not centered:

There is more "black" at the bottom than usual!

The reason is that I reused the text routine used during the fractal to display each line of text, it's hidden behind black ink attributes so it is invisible until the scrolling routine makes it appear.

Obviously the routine has to make sure it does not scroll up the black ink attributes else it will scroll with the text itself.

_Fx_DrawTitle
jsr _VScroll_GenerateScrollMapping
jsr _FxTitle_InitBorderColors

lda #<_FxTitle_NameList
sta greetings_pointer+0
lda #>_FxTitle_NameList
sta greetings_pointer+1

lda #30
sta tmp3
loop
jsr _Mandel_DisplayGreetings
jsr _FxTitle_Scroll
dec tmp3
bne loop

rts

The text of the scroller is just defined as a simple array:

#define FX_TEXT_OFFSET(size) 19-((size)*3-1)/2

_FxTitle_NameList
.byt FX_TEXT_OFFSET(8),0,"YOU HAVE",0
.byt FX_TEXT_OFFSET(4),0,"BEEN",0
.byt FX_TEXT_OFFSET(8),0,"WATCHING",0
.byt 2,0," ",0
.byt FX_TEXT_OFFSET(11),28,".BUGGY BOY.",0
.byt 2,0," ",0
.byt FX_TEXT_OFFSET(7),0,"A 36415",0
.byt FX_TEXT_OFFSET(10),0,"BYTES LONG",0
.byt FX_TEXT_OFFSET(10),0,"ORIC INTRO",0
.byt FX_TEXT_OFFSET(2) ,0,"BY",0
.byt FX_TEXT_OFFSET(12),28,"DEFENCEFORCE",0
.byt FX_TEXT_OFFSET(11),28,"AND FRIENDS",0
.byt FX_TEXT_OFFSET(7),0,"FOR THE",0
.byt FX_TEXT_OFFSET(8),56,"LCP 2004",0
.byt FX_TEXT_OFFSET(2),0,"IN",0
.byt FX_TEXT_OFFSET(9),0,"LINKOPING",0
.byt FX_TEXT_OFFSET(6),0,"SWEDEN",0
.byt 2,0," ",0
.byt FX_TEXT_OFFSET(6),28,"THANKS",0
.byt 2,0," ",0
.byt 2,0," ",0
.byt 2,0," ",0
.byt FX_TEXT_OFFSET(9) ,56,"LAST NOTE",0
.byt FX_TEXT_OFFSET(10),84,"WORSHIPING",0
.byt FX_TEXT_OFFSET(11),84,"RABBITS MAY",0
.byt FX_TEXT_OFFSET(9) ,84,"ALSO CAUSE",0
.byt FX_TEXT_OFFSET(8) ,84,"TROUBLES",0
.byt 2,0," ",0
.byt 0

Just from observing the scroll we can see that there are no particular additional slowdowns when new texts are generated, so we can probably exclude the _Mandel_DisplayGreetings routine from the list of suspects:

It may happen to be taking some time, but in regard to how slow the scrolling is, this one is not our priority at the moment.

Which brings us to _FxTitle_Scroll:

_FxTitle_Scroll
lda #28
sta tmp2
big_loop
lda #<$a000
sta tmp0+0
clc
adc #80
sta tmp1+0
lda #>$a000
sta tmp0+1
adc #0
sta tmp1+1

ldx #198
loop_y
ldy #0
lda (tmp1),y
sta (tmp0),y
iny
// Skip the "black ink" area
iny
loop_x
lda (tmp1),y
sta (tmp0),y
iny
cpy #40
bne loop_x

clc
lda tmp1+0
sta tmp0+0
adc #40
sta tmp1+0
lda tmp1+1
sta tmp0+1
adc #0
sta tmp1+1

dex
bne loop_y

dec tmp2
bne big_loop

rts

Just by the length of the routine, we know that it's not going to be a fast one: It's basically two simple loops using indirect addressing to copy one byte at a time from one pointed location to another one.

Generally the key to writing fast code on the 6502, at least when moving memory, is to unroll the code to avoid paying the code of comparing and looping.

Note that the code does not actually scroll the entire screen: It skips over the second column of the screen in order to preserve the black ink attributes used to hide the new text being introduced.

Cycle Counting

So the big questions is: How many cycles, how long does it take to run this block of code?

If you go to the OSDK documentation page regarding the instruction set you can learn things such as:
  • lda #28 takes 2 cycles and 2 bytes (immediate accumulator load)
  • sta tmp2 takes 3 cycles and 2 bytes (zero page accumulator store)
  • clc takes 2 cycles and 1 byte (clear carry flag)
  • lda (tmp1),y takes 5 cycles and 2 bytes (indirect indexed load)
  • sta (tmp0),y takes 6 cycles and 2 bytes (indirect indexed store)
  • iny takes 2 cycles and 1 byte (increment Y)
  • cpy #40 takes 2 cycles and 2 bytes (immediate value compare)
  • bne loop_x takes 2, 3 or 4 cycles2 and 2 bytes (branch if not equal)
So basically you have all the information here, just a matter of summing all that:
  • the #28 is just the external loop that counts the number of frames before introducting a new text
  • we scroll 198 lines by 39 columns
  • each byte takes 5+6+2+2+3
  • that gives us roughly ((5+6+2+2+3)*39+2+++++++....)*198...
  • whatever.... blargghghghghhh
soooo.... why not use Oricutron to just tell us the exact answer without melting a brain fuse?

The easiest way to do that is to just add a forever loop in the code just before the call to _FxTitle_Scroll:

#define BREAK(color) lda #16+color:sta $BFDF:jmp *
BREAK(1)
jsr _FxTitle_Scroll

Then recompile the code, run the program, and when you see the red square appear on the screen, press F2 to enter the debugger:

4 video modes

As you can see, the program is stuck on the JMP $4110 instruction.

Press F12 to skip the instruction, you should now be on the JSR _FxTitle_S... line.

At this point it may be a good idea to press F9 to reset the cycle counter (CY), and now all you need is to press F11 to have Oricutron run the entire function for you and automatically stop when it's back to the next instruction you have on screen, and you should see the following:

4 video modes

The cycle counter now displays 4072004 cycles, which is a huge amount, but as you can see on the screen, we also have the "YOU HAVE" text displayed, which means we did scroll the screen 28 times, so we need to divide to know the time spent on one frame.
  • 4072004/28 = 145428 per screen scrolling
  • 145428/20000 = 7.2 frames to scroll up the screen once
  • 145428/(198*39) = 18 cycles per byte moved
With 7.2 frames, we are far from anything remotely acceptable.

And the 18 cycles per byte shows that there is a huge overhead, since the LDA+STA pair "only" took 11 cycles.

We can probably do better than that :)

Optimizing

All this code do is basically to scroll the entire screen area from $a000+40 to $a000, it's just a big memcpy... that avoids one byte every 40 bytes (the attribute).

How can we optimize this code efficiently?

As I said earlier, the easiest way to do fast code is to unroll the code, and in this particular case, the optimization is relatively obvious:

We just need to have code that can efficiently scroll up ONE column, and then just call it 39 times.

The code would look like this:

; Copy second line to first line
lda $a000+40*1
sta $a000+40*0
; Copy third line to second line
lda $a000+40*2
sta $a000+40*1
(...)
; Copy 199th line to 198th line
lda $a000+40*199
sta $a000+40*198

In this code, we use absolute addressing mode instead of indirect, which means that each LDA instruction takes 4 cycles (and three bytes) instead of the 5 cycles (and two bytes) used by the indirect indexed version, and the STA similarly takes 4 cycles and 3 bytes instead of 6 cycles and 2 bytes.

Since there are no increments and no loop, each of these copied byte takes a grand total of 9 cycles, which happens to be exactly half the amount of cycles per byte of the existing routine.

The drawback is obviously the room taken: Instead of the relatively compact original code, we now use 198*(3+3)=1188 bytes for the routine, plus one for the final RTS.

And obviously we need to be able to scroll more than one column, so we need 39 more like these, one for each column, which pushes the memory usage to 46332 bytes, which is a problem since we do not have this amount of available memory!

Well, let see if we can use X or Y indexed mode, what if we use that instead:

; Copy second line to first line
lda $a000+40*1,x
sta $a000+40*0,x
; Copy third line to second line
lda $a000+40*2,x
sta $a000+40*1,x
(...)
; Copy 199th line to 198th line
lda $a000+40*199,x
sta $a000+40*198,x

Now we can just set the value of X to anything between 0 and 39 and the routine will be able to scroll just this particular column. The drawback is obviously that now we have some increments and compare to do, but it is not as bad as in the original code because this is done on the OUTER LOOP instead of per byte, so we only need 39 of these instead of 39*198.

Clock wise, this solution is slightly worse, because the STA absolute,x takes 1 additional clock cycle, fortunately the LDA absolute,x takes the same amount of time, so we are using 10 cycles per byte, which is still much better than the 18 of the original routine.

Assuming we added this column routine to the source code and call it "FastColumnScroll", here is now our new scroll routine:

_FxTitle_Scroll
lda #28
sta tmp2
big_loop

ldx #0
jsr FastColumnScroll

ldx #2
loop_x
jsr FastColumnScroll
inx
cpx #40
bne loop_x

dec tmp2
bne big_loop

rts

Here is the result:


As you can see, the scroller now runs in one minute instead of two, so about twice as fast as the original code.

Reducing Memory Usage

An obvious issue with this type of optimization is that it makes the code much larger than it could be, which is specially handicapping in a single load demo where all the code and data is present in memory at the same time: If you apply this method everywhere, there's not much room for many effects.

But there is a solution: Generated code!

The beauty of assembler is that code and data is really some purely arbitrary distinctions, you can totally run data as code or display code as data, as long as you know what you are doing.

You can also generate code on the fly, or modify it.

In this particular case, we only need the fast vertical scroller at the end of the show, so why not just generate it when required?

From our earlier calculation we know that this code occupies 1189 bytes, and as it happens, we have plenty of tables that are only used for specific effects and will not be reused at this point in the demo, so why not use the area to store our code.

In table.s we have the following:

//
// A serie of buffers that will contain the one pixel shifted version of scrolls
//
_RotateTableRight //.dsb 256*6
_Scroll_Circular_Buffer
_Scroll_Circular_Buffer_0 .dsb 256
_Scroll_Circular_Buffer_1 .dsb 256
_Scroll_Circular_Buffer_2 .dsb 256
_Scroll_Circular_Buffer_3 .dsb 256
_Scroll_Circular_Buffer_4 .dsb 256
_Scroll_Circular_Buffer_5 .dsb 256
_Scroll_Circular_Buffer_6 .dsb 256
_Scroll_Circular_Buffer_7 .dsb 256

These tables are used in the previous screens, and the whole area is largerly sufficient to store our generated code, so we can just add one more label and be done with it:

//
// A serie of buffers that will contain the one pixel shifted version of scrolls
//
_Final_Scroll_Buffer <-- our new label
_RotateTableRight //.dsb 256*6
_Scroll_Circular_Buffer
_Scroll_Circular_Buffer_0 .dsb 256
_Scroll_Circular_Buffer_1 .dsb 256
_Scroll_Circular_Buffer_2 .dsb 256
_Scroll_Circular_Buffer_3 .dsb 256
_Scroll_Circular_Buffer_4 .dsb 256
_Scroll_Circular_Buffer_5 .dsb 256
_Scroll_Circular_Buffer_6 .dsb 256
_Scroll_Circular_Buffer_7 .dsb 256

and of course we need the code generator:

_FxTitleScroll_GenerateCopyBuffer
lda #<$a000
sta tmp0+0
lda #>$a000
sta tmp0+1

lda #<_Final_Scroll_Buffer
sta tmp1+0
lda #>_Final_Scroll_Buffer
sta tmp1+1

ldx #198
loop
ldy #3
lda #$9D ; STA abs,x
sta (tmp1),y
iny
lda tmp0+0
sta (tmp1),y
iny
lda tmp0+1
sta (tmp1),y
iny

clc
lda tmp0+0
adc #40
sta tmp0+0
lda tmp0+1
adc #0
sta tmp0+1

ldy #0
lda #$BD ; LDA abs,x
sta (tmp1),y
iny
lda tmp0+0
sta (tmp1),y
iny
lda tmp0+1
sta (tmp1),y
iny


clc
lda tmp1+0
adc #6
sta tmp1+0
lda tmp1+1
adc #0
sta tmp1+1

dex
bne loop

ldy #0
lda #$60 ; RTS
sta (tmp1),y

rts

And that's it!

We just need to call this before we start the scroll, it will build the routine, and we can replace it by something else later if needed.




1. The demo should build as-is with the latest OSDK
2. 2 cycles for the base branch instruction, plus 1 if the branch is taken, plus 1 if doing so crosses a page boundary
comments powered by Disqus

Coverity Scan Build Status