← Overlay MemoryDungeon Master intro →
  Optimizing Buggy Boy (2)
Wed 13th May 2020   
Optimizing Buggy Boy
 Part 1 Part 2 

Now that the final scroller is finally running at an acceptable speed, the second most pressing issue is the Mandelbrot fractal part.

Optimizing a scroller is just a matter of finding a fast way to copy things around, but optimizing a fractal is much more complicated due to the limitations of the 6502 processor regarding doing mathematics.

Theoric Magazine

The idea of having a fractal came from an old Oric Magazine called Theoric.

One of the issues contained a small BASIC program which could render arbitrary parts of the Mandelbrot set, in full color.

It was obviously terribly slow, but you could just change the parameters and let it run during the night, and on the next morning you would have your beautiful fractal on the screen ready to be saved to tape :)

4 video modes

Just for the sake of completeness here is the complete original1 source code:

0 PAPER 0:INK 4
10 HIRES:DOKE #306,#FFFF
11 PAPER 0:INK 3

15 PR=19:D1=64:D2=32:D3=512:D4=15

20 FOR I=1 TO 199 STEP 2
30 CURSET 0,I,3:FILL 1,1,16+4
40 CURSET 6,I-1,3:FILL 1,1,1
50 NEXT

60 FOR X=11 TO 237 STEP 2
70 FOR Y=0 TO 199 STEP 2
79 N=PR:X1=X-100:Y1=Y-100:XN=X1:YN=Y1
80 REPEAT
81 D=XN+YN:B=XN-YN:C=XN*YN
82 XN=D*B/D1-X1:YN=C/D2-Y1:N=N-1
83 UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3)
90 IF N AND 1 THEN CURSET X,Y,1
91 IF N AND 2 THEN CURSET X+1,Y,1
92 IF N AND 4 THEN CURSET X,Y+1,1
93 IF N AND 8 THEN CURSET X+1,Y+1,1
120 NEXT
125 NEXT
130 DOKE #306,10000:END

The program is not very complicated:
  • Line 15 defines the Mandelbrot set to compute
  • Lines 20-50 sets some attributes on the screen to achieve the color effects
  • Line 60 is the outer loop iterating horizontally on the screen
  • Line 70 is the second loop iterating vertically on the screen
  • Lines 79-83 compute the actual color of the point at this location
  • Lines 90-93 draw some points to achieve the dithering effect based on the point color
One doe not need a PhD to figure out that the core of the performance is spent in the REPEAT-UNTIL section2!

An important element to notice is the STEP 2 on each of the FOR loops: We are actually computing only a quarter of the total number of points, because each of the point is then rendered on screen using a 2 by 2 pixel matrix to achieve the nice colored dithering effect.

Something else to notice is that in this particular case the image is symmetrical along the horizontal axis, so a simple symmetry could have achieved a near doubling in performance... but that's definitely cheating:

Let's keep that as an option for later if we really need to.

If you are interested by the topic, you can read the very long wikipedia page about the Mandelbrot set (in this particular case we are particularly interested in the section called
Computer drawings), and there are also many different implementations in various languages on the Rosetta Code site.

Optimization strategies

Until we profile the code we will not actually know where most of the time is spent, but we can already notice a few things:
  • The BASIC variable format uses floating point representation, do we need that?
  • What is the range of the variables, do they fit on one byte, do they get negative?
  • There are two multiplications and two divisions in the inner loop...
  • The loop test performs absolute value computations
  • Drawing 2x2 sized pixels requires additional masking when blitting individually
If you have seen the demo, you would also probably have noticed that the fractal was not rendered linearly row by row or column by column, instead if appeared as some kind of recursively drawn squares:

This is a common optimization seen in fractal rendering code, it's based on the observation that if all the points on the border of a section of the Mandelbrot3 set have the same color (depth) then all the points inside ALSO have the same depth, thus do not require additional calculation.

I did the same thing as I did with the scroller, and modified the demo so it jumped directly to the fractal part, without playing the music.


Something worth noting is that without the music playing, the fractal finishes just a fraction of time after displaying the WRATHDESIGNS group name, skipping the WWW.C64.ORG and WWW.ORIC.ORG names.

Indeed, comparing the computation time with and without the music gives us a running time of 1:30 without and 1:40 with the music.

Not a huge difference, but worth keeping in mind when doing adjustments at the end if we want the music to be in sync with the effects for example.

So let see how the demo code is written so we can see where to optimize.

Original implementation

The mandelbrot computation is called from main.c:

//
// Mandelbrot
//
Mandelbrot:
SwitchToHires();
ClearHiresScreen();

Mandel_InitDisplay();
Mandel_DrawFractal();

The file mandel.s contains the actual assembler code:

_Mandel_DrawFractal
lda #0
sta _x
lda #0
sta _y
lda #128
sta _w
jsr _Mandel_HLine

lda #0
sta _x
lda #64
sta _y
lda #128
sta _w
jsr _Mandel_HLine

lda #0
sta _x
lda #0
sta _y
lda #64
sta _h
jsr _Mandel_VLine

lda #128
sta _x
lda #0
sta _y
lda #64
sta _h
jsr _Mandel_VLine

jsr _Mandle_BlitBigBuffer

lda #0
sta _VblCounter

lda #0
sta _x
lda #0
sta _y
lda #128
sta _w
lda #64
sta _h
jsr _Mandel_Divide

jsr _Mandle_BlitBigBuffer
rts

What this routine does is to first compute the top, bottom, left and right side values, and then call a recursive routine.

While writing this article I noticed that some of the values seem to be computed more than once: When drawing a rectangle, the corner points are shared, and in this case it looks like the point (0,0) at least is computed more than once. If this happens on each recursive level, that may end up being a significant amount of time.

Another obvious thing is that despite being computed, we do not see the left vertical column being displayed!

So maybe we are actually computing more pixels than we are actually able to draw on screen...

Something else to check is if intermediate variables are using zero page locations or normal memory:

.zero
_x .dsb 1
_y .dsb 1
_w .dsb 1
_h .dsb 1

_x1 .dsb 2
_y1 .dsb 2
_xn .dsb 2
_yn .dsb 2

Everything seems alright, we are indeed using zero page for all these variables.

Horizontal drawing

So, let see how these subroutines are implemented:

_Mandel_HLine
jsr _Mandel_ComputeParameters

lda _w
pha

loop_x
clc
lda _x1+0
sta _xn+0
adc #2
sta _x1+0
lda _x1+1
sta _xn+1
adc #0
sta _x1+1

lda _y1+0
sta _yn+0
lda _y1+1
sta _yn+1

jsr _compute_outer

lda _n
ldy _x
sta (big_buff),y ; Store pixel in buffer
inc _x

dec _w
bne loop_x

pla
sta _w

rts

The lda _w/pha/pla/sta _w seem peculiar, but then we have to remember that the code is doing recursive calls, so it makes sense to store the original Width and Height values, else Bad Things Would Happen.

The rest of the code seems reasonable, first calling _Mandel_ComputeParameters to initialize some parameters, then iterate for each of the values along the requested _w parameters, for each calling _compute_outer and then storing the result in some intermediate buffer.

Let see how these two sub-routines looks like... dig down into the rabbit hole.

Span initialization


#define MANDEL_X_POS 32
#define MANDEL_Y_POS 92

_Mandel_ComputeParameters
ldy _y
lda _BufferAddrLow,y
sta big_buff+0
lda _BufferAddrHigh,y
sta big_buff+1

clc ; Compute X1=-64+(x*2)
lda _x
rol
sta _x1+0
lda #0
rol
sta _x1+1

sec
lda _x1+0
sbc #<MANDEL_X_POS
sta _x1+0
lda _x1+1
sbc #>MANDEL_X_POS
sta _x1+1

clc ; Compute y1=-64+(y*2)
lda _y
rol
sta _y1+0
lda #0
rol
sta _y1+1

sec
lda _y1+0
sbc #<MANDEL_Y_POS
sta _y1+0
lda _y1+1
sbc #>MANDEL_Y_POS
sta _y1+1

rts

What this routine does is not very complicated, it takes the vertical position (stored passed in the _y variable) and use it to compute which scanline in the buffer will be used to store the result of the computations, as well as computing the X1=X-100:Y1=Y-100 part of the BASIC program, except here 100 was replaced by 64 because we prefer power of two values when doing assembler code.

This routine can most probably be improved considering all the initial values are 8 bit, but not being part of the inner loop this would seem to be a premature optimization to me.

Iteration loop


#define MANDEL_ITER_BIS 19
#define MAX_MANDEL_ITER 512

_outer_exit
rts

_compute_outer
// n=pr
lda #MANDEL_ITER_BIS
sta _n

compute_outer_loop
jsr _compute_inner

dec _n
beq _outer_exit


lda _xn+1 ; Compute _axn=ABS(xn)
bmi xn_neg
sta _axn+1
lda _xn
sta _axn
jmp xn_abs
xn_neg
sec
lda #0
sbc _xn
sta _axn
lda #0
sbc _xn+1
sta _axn+1
xn_abs

lda _yn+1 ; Compute _ayn=ABS(yn)
bmi yn_neg
sta _ayn+1
lda _yn
sta _ayn
jmp yn_abs
yn_neg
sec
lda #0
sbc _yn
sta _ayn
lda #0
sbc _yn+1
sta _ayn+1
yn_abs

// sum axn+ayn => axn
// if axn+ayn>d3 exit loop
clc
lda _axn
adc _ayn
sta _axn
lda _axn+1
adc _ayn+1
cmp #>MAX_MANDEL_ITER
bcc compute_outer_loop

lda _axn
cmp #<MAX_MANDEL_ITER
bcc compute_outer_loop

end_outer
rts

This code is pretty much the straight conversion from the BASIC original code UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3).

First it calls _compute_inner then it checks if we have reached MANDEL_ITER_BIS (the end of iteration), or if the computed _xn and _yn values gets larger than MAX_MANDEL_ITER.

The reason why the code is so long compared to what it actually does, is that the 6502 only have 8 bit registers, and does not have any way to compute an absolute value, which means we manually need to check the sign of the most significant byte, eventually perform a subtraction from 0 to get the positive value.

Let's continue digging...

Depth computation


_compute_inner
clc ; Compute d=xn+yn
lda _xn
adc _yn
sta _d
lda _xn+1
adc _yn+1
sta _d+1

sec ; Compute b=xn-yn
lda _xn
sbc _yn
sta _b
lda _xn+1
sbc _yn+1
sta _b+1

lda _xn ; Compute c=xn*yn
sta op1
lda _xn+1
sta op1+1
lda _yn
sta op2
lda _yn+1
sta op2+1
jsr mul16i
stx _c
sta _c+1

ldx _c+1 ; Compute yn=c/d2-y1
ldy _c

lda _TableDiv32_low,x
and #%11111000
sta _tempfastdiv

lda _TableDiv32_low,y
and #%00000111
ora _tempfastdiv
sec
sta _c
sbc _y1
sta _yn
lda _TableDiv32_high,x
sbc _y1+1
sta _yn+1

lda _d ; Compute xn=d*b/d1-x1
sta op1
lda _d+1
sta op1+1
lda _b
sta op2
lda _b+1
sta op2+1
jsr mul16i
stx op1
sta op1+1

ldx op1+1
ldy op1

lda _TableDiv64_low,x
and #%11111100
sta _tempfastdiv

lda _TableDiv64_low,y
and #%00000011
ora _tempfastdiv
sec
sta op1
sbc _x1
sta _xn
lda _TableDiv64_high,x
sbc _x1+1
sta _xn+1

rts

Now we are talking.

This routine is both long, and full of slow code (Like calling mul16i twice).

This routine is called for every single point, up to 20 times, so any improvement done there will have a huge impact on the total running time.

Too Smart???

So we know that this particular implementation is using a recursive divide and conquer in order to speed up calculations by not having to recompute pixels belonging to the same depth level, but do we actually know if this improved the computation time?

Sure, from a mathematical point of view we can expect better results, but in the real world it means we may have to compute any particular value of any particular point on the Mandelbrot set, which means we can't really use any of the usual optimizations:

If you iterate horizontally or vertically, you can just move a pointer forward instead of having to recompute the position in the buffer.

So what about trying first to hack around a function that just try to fill the screen from top to bottom and see if there is any speed difference?

_Mandel_DrawFractalNaive
lda #0
sta _VblCounter

lda #0
sta _y
loop
lda #0
sta _x
lda #128
sta _w
jsr _Mandel_HLine ; h_line(0,0,largeur,XO,YO,PAS,buf,screen,pitch);

jsr _Mandle_BlitBigBuffer

inc _y
ldy _y
cpy #64
bne loop

rts

As you can see this is ultra basic code: I did not change anything else, all I did was to call the horizontal drawing loop code 64 times.

As expected the performance is not awesome (1 minute and 47 seconds), specially toward the bottom part (which was also slow on the original demo), but it could have been much worse, specially considering that we call the fullscreen buffer blitting after each scanline!


Keep It Simple Stupid

There are some clear benefits in using a simpler code: It is generally much easier to optimize and debug.

So I continued on the KISS4 road, and I modified the blit routine to only display the scanline we just computed (instead of blitting the entire buffer), and then adjusted the computed width from 128 to 114 pixels (the actually visible part).

This made the whole computation time sink to 1 minute and 20 seconds.

Basically what that mean is that all the convoluted recursive cha-bang is not actually faster on this particular set, or more exactly, the overhead of doing complicated recursive code is larger than the performance boost given by not doing some of the computations.

Another possibility is that there's a bug in the recursive call, but this code is quite complex, so I think changing the code to just draw horizontal lines is probably the way to go!

I guess I'll have to dig a bit more, and who knows, maybe there will be a part III to this article.

Optimizing Buggy Boy
 Part 1 Part 2 



1. I added spaces and tabs for readability.
2. The inner-most loop is where most of the time is spent, so always check there first because any optimisation done there is multiplicated by the number of iteration made by all the outer calls, which can be very many
3. On the borders of the set, this is not valid for the bounding box that contains the set itself.
4. See the title, it's a common idiom in programming, opposing complexity to simplicity
comments powered by Disqus

Coverity Scan Build Status