← Efficient rasterization
  Flood Fill
Sat 4th November 2023   
Today we will see how to convert a complicated algorithm from BASIC to C, then to 6502 assembler, and the optimization process involved.

It all started when Goyo asked in September 2021 if anyone had a fast floodfill routine to share.

What is Flood Fill

Flood Fill leakage
Flood Fill leakage
Flood Fill (or seed fill) is the technical name of the algorithm used to implement the "paint bucket" in your favorite painting program, where you click inside a closed area and all the pixels inside are changed to match the painting color.

If the shape is not properly closed, the filling will escape and propagate outside until possibly the entire screen get covered.

Modern machines have enough memory and color depth to allow for fancy features such as gradient filling, or applying patterns, but unfortunately on 8bit machines we are seriously limited on pretty much all criteria.

Technically, Flood Fill is not a very difficult thing to achieve, and you can consult Wikipedia for a pretty complete coverage of the various ways to implement that, complete with animated images showing the process.

So, let's go back to the Oric.

Known implementations

This is by no means an exhaustive coverage of all the programs implementing flood fills, but I personally can only remember two games ("The Hobbit" and "Le Retour du Dr Genius"), and two painting programs ("Lorigraph" and "Master Paint").

"Le Retour du Dr Genius" from Loriciels uses the distinctive "diamond shape" method, and the way it is used in the game is relatively efficient: You don't spend your time waiting for the scene to finish drawing.

If you are interested in the algorithm, Symoon took the time to disassemble the code, you can find it on the Defence Force forum.

Le Retour du Dr Genius - Loriciels 1984
Le Retour du Dr Genius - Loriciels 1984

This is unfortunately not the case for "The Hobbit" which has some pretty longgggggg sequences.

The Hobbit - Melbourne House 1983
The Hobbit - Melbourne House 1983

The first image with the hobbit house is reasonably fast, but at some point you reach a hidden path with trolls footsteps that take more than a minute to finish drawing.

The Hobbit - Hidden Path
The Hobbit - Hidden Path

We'll check in details how Lorigraph and Master Paint perform in a moment, but first we need to check another contender!

Geffers implementations

Then somebody mentioned there was such a routine in Geoff Phillips book Oric Atmos and Oric 1 Graphics and Machine code techniques.

Oric Atmos and Oric 1 Graphics and Machine code techniques
Oric Atmos and Oric 1 Graphics and Machine code techniques

The author happens to have written quite a lot of software for the Oric at Tansoft: Chess II, The House of Death, Trouble in Store, Oric Mon, Oric Munch, Zodiac, etc... definitely a serious person who knows his business.

His paint routine is a part of the "Faster High-Resolution graphics" section, and similarly to my previous article he uses tables to multiply by 40, divide by six and know the bit position of each pixel (Section 7.2).

So far so good.

We find the PAINT subroutines in section 7.8, starting by a bit of theory, and with an accompanying version of the code written in BASIC to make it easier to understand.

Then follows the assembler code, which I typed... and it did not work.

Some other people have typed it, and it did not work for them either, so at this point either the algorithm is not working, or there was some publishing issues, maybe the wrong version of the code was printed?

Just to be sure, I typed the BASIC version, and behold, it worked! Slowly, but it did work.

Here is what it looks like:

5 REM BASIC VERSION OF PAINT
10 DIM A(100):S=100:DOKE 630,0
15 INPUT X,Y
20 RF=0:S=S-1:A(S)=255:S=S-1:A(S)=255:GOTO 35
30 Y=A(S):S=S+1:X=A(S):S=S+1
35 IF X=255 THEN PRINT"BASIC Total time: ";65535-DEEK(630):GET K$:END
40 IF RF=0 THEN UF=TRUE:DF=TRUE
45 T=S:R=T
46 IF A(R)=255 THEN 50
47 IF A(R)=YANDA(R+1)=X THEN R=R-1:FOR K=R TO T STEP-1:A(K+2)=A(K):NEXT:S=S+2:GOTO 50
48 R=R+2:GOTO 46
50 CURSET X,Y,1
60 IF UF AND POINT(X,Y-1)=0 THEN S=S-1:A(S)=X:S=S-1:A(S)=Y-1
70 UF=POINT(X,Y-1)
80 IF DF AND POINT(X,Y+1)=0 THEN S=S-1:A(S)=X:S=S-1:A(S)=Y+1
90 DF=POINT(X,Y+1)
100 RF=0:IF POINT(X-1,Y)=0 THEN S=S-1:A(S)=X-1:S=S-1:A(S)=Y:RF=TRUE
120 IF POINT(X+1,Y)=0 THEN S=S-1:A(S)=X+1:S=S-1:A(S)=Y:RF=TRUE
130 GOTO 30

Instead of paraphrasing, here is what the author writes about the code:

This program runs very slowly because, for every point plotted, the routine must look in the four surrounding positions. Here is a summary of what is happening:
  1. The flag RF (right flag) is set to true whenever it is possible to move either left or right.
  2. The flags UF and DF (up and down flags) are set to the state of the pixels above and below the current dot position.

    Before doing this, the subroutine looks for an empty pixel above or below, and if the up or down flag is set as well, that position is put on the stack as a point to investigate.

    These flags are used in order to stop the stack from being saturated with unnecessary values.

    Since all dots along a line are investigated, it is not necessary to look at all the dots above and below, since any one of them will scan its own horizontal brothers.
  3. As each point is set, the stack is examined for any outstanding references to that point, and these are removed.


Thanks Geoff.

Having a working BASIC program is a good start, but this is never going to win any speed contest, so I converted it to C instead.

From BASIC to C

Here is my C conversion of the BASIC program:

void paint(int x,int y)    
//
// 5 REM BASIC VERSION OF PAINT
// 10 DIM A(100):S=100
{
int a[100];
int s=100;

// 20 RF=0:S=S-1:A(S)=255:S=S-1:A(S)=255:GOTO 35
int uf=0;
int df=0;
int rf=0;
int t=0;
int r=0;
int k=0;

a[--s]=255;
a[--s]=255;
goto line_35;

while (1)
{
y=a[s++]; // 30 Y=A(S):S=S+1:X=A(S):S=S+1
x=a[s++];

line_35:
if (x==255) // 35 IF X=255 THEN END
{
return;
}

if (rf==0) // 40 IF RF=0 THEN UF=TRUE:DF=TRUE
{
uf=1;
df=1;
}

t=s; // 45 T=S:R=T
r=t;

line_46:
if (a[r]==255) // 46 IF A(R)=255 THEN 50
{
goto line_50;
}

if ( (a[r]==y) && (a[r+1]==x) ) // 47 IF A(R)=Y AND A(R+1)=X THEN
{
// R=R-1
// FOR K=R TO T STEP-1
// A(K+2)=A(K)
// NEXT
// S=S+2
// GOTO 50
r--;
for (k=r;k>t;k--)
{
a[k+2]=a[k];
}
s+=2;
goto line_50;
}

r+=2; // 48 R=R+2:GOTO 46
goto line_46;

line_50:
curset(x,y,1); // 50 CURSET X,Y,1

if (uf && (point(x,y-1)==0)) // 60 IF UF AND POINT(X,Y-1)=0 THEN
{
a[--s]=x; // S=S-1:A(S)=X:S=S-1:A(S)=Y-1
a[--s]=y-1;
}
uf=point(x,y-1); // 70 UF=POINT(X,Y-1)

if (df && (point(x,y+1)==0)) // 80 IF DF AND POINT(X,Y+1)=0 THEN
{
a[--s]=x; // S=S-1:A(S)=X:S=S-1:A(S)=Y+1
a[--s]=y+1;
}
df=point(x,y+1); // 90 DF=POINT(X,Y+1)

rf=0; // 100 RF=O:IF POINT(X-1,Y)=0 THEN
if (point(x-1,y)==0)
{
a[--s]=x-1; // S=S-1:A(S)=X-1:S=S-1:A(S)=Y:RF=TRUE
a[--s]=y;
rf=1;
}

if (point(x+1,y)==0) // 120 IF POINT(X+1,Y)=0 THEN
{
a[--s]=x+1; // S=S-1:A(S)=X+1:S=S-1:A(S)=Y:RF=TRUE
a[--s]=y;
rf=1;
}
}
}


This is a literal conversion, complete with gotos instead of using the control flow features of the C.

A better C version

Since my idea was to eventually convert the routine to assembler, I did a few changes to help with the conversion.

There are a few changes compared to the BASIC version:
  • I renamed all the single letter variables to something more understandable
  • Instead of the single A array where both X and Y positions were pushed, I decided to use a dedicated array for each of the coordinates (xa and ya)
  • I replaced the gotos by some while and if
  • Instead of testing if X or Y is equal to 255 to detect the end, I check if we've reached the stack position 99
Other than this, the code does the same thing, in the same order, just considerably faster than the BASIC version.


unsigned char xa[100];
unsigned char ya[100];

void paint(int x,int y)
{
int stack_pointer=100;
int scan_above=0;
int scan_below=0;
int scan_horizontal=0;
int stack_top=0;
int r=0;
int p;

--stack_pointer;
xa[stack_pointer]=255;
ya[stack_pointer]=255;

--stack_pointer;
xa[stack_pointer]=x;
ya[stack_pointer]=y;

while (1)
{
if (stack_pointer==99)
{
return;
}

y=ya[stack_pointer];
x=xa[stack_pointer];
stack_pointer++;

stack_top=stack_pointer;
r=stack_pointer;

while (r!=99)
{
if ( (ya[r]==y) && (xa[r]==x) )
{
do
{
r--;
xa[r+1]=xa[r];
ya[r+1]=ya[r];
}
while (r>=stack_top);
stack_pointer+=1;
break;
}
r+=1;
};

if (scan_horizontal==0)
{
scan_above=1;
scan_below=1;
}
scan_horizontal=0;

curset(x,y,1);

p=point(x,y-1);
if (scan_above && (p==0))
{
--stack_pointer;
xa[stack_pointer]=x;
ya[stack_pointer]=y-1;
}
scan_above=p;

p=point(x,y+1);
if (scan_below && (p==0))
{
--stack_pointer;
xa[stack_pointer]=x;
ya[stack_pointer]=y+1;
}
scan_below=p;

if (point(x-1,y)==0)
{
--stack_pointer;
xa[stack_pointer]=x-1;
ya[stack_pointer]=y;
scan_horizontal=1;
}

if (point(x+1,y)==0)
{
--stack_pointer;
xa[stack_pointer]=x+1;
ya[stack_pointer]=y;
scan_horizontal=1;
}
}
}

At this point I had a routine running at an acceptable speed, and I gave it a few test pictures to try to see if everything worked fine.

A few test pictures
A few test pictures

Having a small set of test images allows to test for regressions in performance, catch bugs, and see how the code performs generally speaking.

Optimizing The Hobbit

I was quite intrigued by why The Hobbit was so slow: After all the game is written 100% in assembly code, and really it should not have been that slow.

By using the Oricutron debugger, it was quite easy to find the multiple causes of the problem.

The first one, is that the code perform loops to multiply by 40 when it need to computer the address of the scanline for a specific Y coordinate.

The Hobbit - Multiply by 40
The Hobbit - Multiply by 40

And similarly, the code perform loops to find the byte position for a specific X coordinate (which on the Oric is a multiple of 6).

The Hobbit - Multiply by 40
The Hobbit - Multiply by 40

After having discussed with a number of people, including the original authors of the game, the only explanation is that they were very very tight in term of memory use, and they did not have the luxury to use look up tables.

40 years later, we can afford to build versions of games which require Floppy Disk drives, allowing us to do all kinds of trickery, including the overlay memory!

So it's what I did, and I basically modified the existing disk version of the game by dynamically patching the code after loading by adding the usual two tables.

Then the original code calls this function instead when it needs to compute information about pixels:
ComputeScreenAddress
.(
; Mul 40
ldx $31
lda _HiresAddrLow,x
sta $38
lda _HiresAddrHigh,x
sta $39

; Div 6
ldx $30
lda _Div6,x
tay
lda _Mod6,x

tax
lda ($38),y
rts
.)

_Div6 .dsb 256 ; Values divided by 6
_Mod6 .dsb 256 ; Values modulo 6

_HiresAddrLow .dsb 128 ; Values multiplied by 40 + a000
_HiresAddrHigh .dsb 128 ; Values multiplied by 40 + a000

You can see the details in this video which shows a comparison of the speed difference of the two versions, disassembled code, etc...


The tldr; is that the original version takes about 1 minute to paint the "hidden path" image, and the patched version takes a few seconds.

I also liked to point out that the fact we got such a huge speedup means that the actual flood fill routine they use was quite efficient!

Performance comparison

Since the original objective was to come up with a proper generic flood fill routine which could be easily integrated in new projects, I continued my investigations.

So, how did actually Lorigraph and Masterpaint compare in practice?

There was a small hurdle to handle: Neither Masterpaint or Lorigraph use standard file format, and unfortunately to do proper comparisons we need to test them on the same images!

Fortunately now with emulators we can load and save binary files at arbitrary locations... including in the screen area, which is what I did.

I made some quick and dirty image, converted it with PictConv, loaded it in the two programs, and I recorded the operation.

Speed comparison of Lorigraph and Master Paint
Speed comparison of Lorigraph and Master Paint

This picture shows the lenght of the various clips in my video editing software:
  • Master Paint was the fastest
  • Lorigraph was twice slower
  • My naive conversion of Geffer's routine was twice slower than Lorigraph!

    Just to add insult to the injury, I'd like to point out that the Master Paint routine supports pattern fills!

    Master Paint pattern filling
    Master Paint pattern filling

    In order to perform this feat, the program makes a copy of the image before the filling, so it knows which pixels were modified, which also means it requires an additional 8000 bytes.

    The actual flood fill routine is doing quite a lot of smart things, including copying multiple batches of bytes from the scanline directly into the zero page to improve the performance of bit and pixel operations, they get copied back into the video memory when the work is done.

    So the question is: Can we do better?

    Lode's routine

    The "Lode's Computer Graphics Tutorial" page has quite a few interesting variants of routines, including the Scanline Floodfill Algorithm With Stack which looks like that:


    //The scanline floodfill algorithm using stack instead of recursion, more robust
    void floodFillScanlineStack(int x, int y, int newColor, int oldColor)
    {
    if(oldColor == newColor) return;

    int x1;
    bool spanAbove, spanBelow;

    std::vector<int> stack;
    push(stack, x, y);
    while(pop(stack, x, y))
    {
    x1 = x;
    while(x1 >= 0 && screenBuffer[y * w + x1] == oldColor) x1--;
    x1++;
    spanAbove = spanBelow = 0;
    while(x1 < w && screenBuffer[y * w + x1] == oldColor)
    {
    screenBuffer[y * w + x1] = newColor;
    if(!spanAbove && y > 0 && screenBuffer[(y - 1) * w + x1] == oldColor)
    {
    push(stack, x1, y - 1);
    spanAbove = 1;
    }
    else if(spanAbove && y > 0 && screenBuffer[(y - 1) * w + x1] != oldColor)
    {
    spanAbove = 0;
    }
    if(!spanBelow && y < h - 1 && screenBuffer[(y + 1) * w + x1] == oldColor)
    {
    push(stack, x1, y + 1);
    spanBelow = 1;
    }
    else if(spanBelow && y < h - 1 && screenBuffer[(y + 1) * w + x1] != oldColor)
    {
    spanBelow = 0;
    }
    x1++;
    }
    }
    }


    It's basically the equivalent of the traditional recursive routine, but it uses an actual dedicated buffer to act as a stack instead of using the CPU stack, which is good for a machine like the Oric which only has a 256 bytes CPU stack.

    I converted the routine to assembler and added it to the speed test, and here is what the result looks like:

    Master Paint vs Lode (194 bytes)
    Master Paint vs Lode (194 bytes)

    As you can see, the Lode routine is a tiny bit faster than Master Paint on my first test image, and a tiny bit slower on The Hobbit image, so it definitely is a viable replacement and we can safely ignore Lorigraph and Geffer's routines!

    Size wise, this Lode routine is about 194 bytes long (plus the lookup tables), but I kept poking at it to see if I could make it faster.

    I added some checks to see how much stack memory the routine used, and tried it on a few images:

  • Osdk logo: 8 bytes
  • Test picture: 44 bytes
  • The Hobbit Troll path: 24 bytes

    The memory usage is very dependent of pictures, and you can easily create images which will make specific flood fill routines to blow up if you now how the behave, but the bottom line here is that we could technically use the 6502 stack to push the values, so it's what I checked next!

    Replacing the manual buffer by pha/pla made it possible to reduce the routine size from 194 to 156 bytes. The speed did not increase in any measurable way, but the code became more compact.

    Here is the final result.

    Master Paint vs Lode (141 bytes)
    Master Paint vs Lode (141 bytes)

    My few last optimizations were enough to make the routine consistently faster than Master Paint - not by much, but still -.

    Here is a working version of the project if you want to play with the code.

    This was mostly some test/development code, so the naming is far from optimal (like the x and y variables...), but basically the idea is that you first need to call GenerateTables once to initialize all the necessary tables, then you can just call FillLoop after having set proper values for the x and y coordinates.

    I have some more advanced versions, but they were not tested and may contain bugs, so better safe than sorry!

    Have fun :)
  • comments powered by Disqus

    Coverity Scan Build Status