What is Flood FillKnown implementationsGeffers implementationsFrom BASIC to CA better C versionOptimizing The HobbitPerformance comparisonLode's routineToday we will see how to convert a complicated algorithm from BASIC to C, then to 6502 assembler, and the optimization process involved.
It all started when Goyo asked in September 2021 if anyone had a fast floodfill routine to share.
What is Flood Fill
If the shape is not properly closed, the filling will escape and propagate outside until possibly the entire screen get covered.
Modern machines have enough memory and color depth to allow for fancy features such as gradient filling, or applying patterns, but unfortunately on 8bit machines we are seriously limited on pretty much all criteria.
Technically, Flood Fill is not a very difficult thing to achieve, and you can consult Wikipedia for a pretty complete coverage of the various ways to implement that, complete with animated images showing the process.
So, let's go back to the Oric.
Known implementationsThis is by no means an exhaustive coverage of all the programs implementing flood fills, but I personally can only remember two games ("The Hobbit" and "Le Retour du Dr Genius"), and two painting programs ("Lorigraph" and "Master Paint").
"Le Retour du Dr Genius" from Loriciels uses the distinctive "diamond shape" method, and the way it is used in the game is relatively efficient: You don't spend your time waiting for the scene to finish drawing.
If you are interested in the algorithm, Symoon took the time to disassemble the code, you can find it on the Defence Force forum.
This is unfortunately not the case for "The Hobbit" which has some pretty longgggggg sequences.
The first image with the hobbit house is reasonably fast, but at some point you reach a hidden path with trolls footsteps that take more than a minute to finish drawing.
We'll check in details how Lorigraph and Master Paint perform in a moment, but first we need to check another contender!
Geffers implementationsThen somebody mentioned there was such a routine in Geoff Phillips book Oric Atmos and Oric 1 Graphics and Machine code techniques.
The author happens to have written quite a lot of software for the Oric at Tansoft: Chess II, The House of Death, Trouble in Store, Oric Mon, Oric Munch, Zodiac, etc... definitely a serious person who knows his business.
His paint routine is a part of the "Faster High-Resolution graphics" section, and similarly to my previous article he uses tables to multiply by 40, divide by six and know the bit position of each pixel (Section 7.2).
So far so good.
We find the PAINT subroutines in section 7.8, starting by a bit of theory, and with an accompanying version of the code written in BASIC to make it easier to understand.
Then follows the assembler code, which I typed... and it did not work.
Some other people have typed it, and it did not work for them either, so at this point either the algorithm is not working, or there was some publishing issues, maybe the wrong version of the code was printed?
Just to be sure, I typed the BASIC version, and behold, it worked! Slowly, but it did work.
Here is what it looks like:
5 REM BASIC VERSION OF PAINT
10 DIM A(100):S=100:DOKE 630,0
15 INPUT X,Y
20 RF=0:S=S-1:A(S)=255:S=S-1:A(S)=255:GOTO 35
35 IF X=255 THEN PRINT"BASIC Total time: ";65535-DEEK(630):GET K$:END
40 IF RF=0 THEN UF=TRUE:DF=TRUE
46 IF A(R)=255 THEN 50
47 IF A(R)=YANDA(R+1)=X THEN R=R-1:FOR K=R TO T STEP-1:A(K+2)=A(K):NEXT:S=S+2:GOTO 50
48 R=R+2:GOTO 46
50 CURSET X,Y,1
60 IF UF AND POINT(X,Y-1)=0 THEN S=S-1:A(S)=X:S=S-1:A(S)=Y-1
80 IF DF AND POINT(X,Y+1)=0 THEN S=S-1:A(S)=X:S=S-1:A(S)=Y+1
100 RF=0:IF POINT(X-1,Y)=0 THEN S=S-1:A(S)=X-1:S=S-1:A(S)=Y:RF=TRUE
120 IF POINT(X+1,Y)=0 THEN S=S-1:A(S)=X+1:S=S-1:A(S)=Y:RF=TRUE
130 GOTO 30
Instead of paraphrasing, here is what the author writes about the code:
This program runs very slowly because, for every point plotted, the routine must look in the four surrounding positions. Here is a summary of what is happening:
- The flag RF (right flag) is set to true whenever it is possible to move either left or right.
- The flags UF and DF (up and down flags) are set to the state of the pixels above and below the current dot position.
Before doing this, the subroutine looks for an empty pixel above or below, and if the up or down flag is set as well, that position is put on the stack as a point to investigate.
These flags are used in order to stop the stack from being saturated with unnecessary values.
Since all dots along a line are investigated, it is not necessary to look at all the dots above and below, since any one of them will scan its own horizontal brothers.
- As each point is set, the stack is examined for any outstanding references to that point, and these are removed.
Having a working BASIC program is a good start, but this is never going to win any speed contest, so I converted it to C instead.
From BASIC to CHere is my C conversion of the BASIC program:
void paint(int x,int y)
// 5 REM BASIC VERSION OF PAINT
// 10 DIM A(100):S=100
// 20 RF=0:S=S-1:A(S)=255:S=S-1:A(S)=255:GOTO 35
y=a[s++]; // 30 Y=A(S):S=S+1:X=A(S):S=S+1
if (x==255) // 35 IF X=255 THEN END
if (rf==0) // 40 IF RF=0 THEN UF=TRUE:DF=TRUE
t=s; // 45 T=S:R=T
if (a[r]==255) // 46 IF A(R)=255 THEN 50
if ( (a[r]==y) && (a[r+1]==x) ) // 47 IF A(R)=Y AND A(R+1)=X THEN
// FOR K=R TO T STEP-1
// GOTO 50
r+=2; // 48 R=R+2:GOTO 46
curset(x,y,1); // 50 CURSET X,Y,1
if (uf && (point(x,y-1)==0)) // 60 IF UF AND POINT(X,Y-1)=0 THEN
a[--s]=x; // S=S-1:A(S)=X:S=S-1:A(S)=Y-1
uf=point(x,y-1); // 70 UF=POINT(X,Y-1)
if (df && (point(x,y+1)==0)) // 80 IF DF AND POINT(X,Y+1)=0 THEN
a[--s]=x; // S=S-1:A(S)=X:S=S-1:A(S)=Y+1
df=point(x,y+1); // 90 DF=POINT(X,Y+1)
rf=0; // 100 RF=O:IF POINT(X-1,Y)=0 THEN
a[--s]=x-1; // S=S-1:A(S)=X-1:S=S-1:A(S)=Y:RF=TRUE
if (point(x+1,y)==0) // 120 IF POINT(X+1,Y)=0 THEN
a[--s]=x+1; // S=S-1:A(S)=X+1:S=S-1:A(S)=Y:RF=TRUE
This is a literal conversion, complete with gotos instead of using the control flow features of the C.
A better C versionSince my idea was to eventually convert the routine to assembler, I did a few changes to help with the conversion.
There are a few changes compared to the BASIC version:
- I renamed all the single letter variables to something more understandable
- Instead of the single A array where both X and Y positions were pushed, I decided to use a dedicated array for each of the coordinates (xa and ya)
- I replaced the gotos by some while and if
- Instead of testing if X or Y is equal to 255 to detect the end, I check if we've reached the stack position 99
unsigned char xa;
unsigned char ya;
void paint(int x,int y)
if ( (ya[r]==y) && (xa[r]==x) )
if (scan_above && (p==0))
if (scan_below && (p==0))
At this point I had a routine running at an acceptable speed, and I gave it a few test pictures to try to see if everything worked fine.
Having a small set of test images allows to test for regressions in performance, catch bugs, and see how the code performs generally speaking.
Optimizing The HobbitI was quite intrigued by why The Hobbit was so slow: After all the game is written 100% in assembly code, and really it should not have been that slow.
By using the Oricutron debugger, it was quite easy to find the multiple causes of the problem.
The first one, is that the code perform loops to multiply by 40 when it need to computer the address of the scanline for a specific Y coordinate.
And similarly, the code perform loops to find the byte position for a specific X coordinate (which on the Oric is a multiple of 6).
After having discussed with a number of people, including the original authors of the game, the only explanation is that they were very very tight in term of memory use, and they did not have the luxury to use look up tables.
40 years later, we can afford to build versions of games which require Floppy Disk drives, allowing us to do all kinds of trickery, including the overlay memory!
So it's what I did, and I basically modified the existing disk version of the game by dynamically patching the code after loading by adding the usual two tables.
Then the original code calls this function instead when it needs to compute information about pixels:
; Mul 40
; Div 6
_Div6 .dsb 256 ; Values divided by 6
_Mod6 .dsb 256 ; Values modulo 6
_HiresAddrLow .dsb 128 ; Values multiplied by 40 + a000
_HiresAddrHigh .dsb 128 ; Values multiplied by 40 + a000
You can see the details in this video which shows a comparison of the speed difference of the two versions, disassembled code, etc...
The tldr; is that the original version takes about 1 minute to paint the "hidden path" image, and the patched version takes a few seconds.
I also liked to point out that the fact we got such a huge speedup means that the actual flood fill routine they use was quite efficient!
Performance comparisonSince the original objective was to come up with a proper generic flood fill routine which could be easily integrated in new projects, I continued my investigations.
So, how did actually Lorigraph and Masterpaint compare in practice?
There was a small hurdle to handle: Neither Masterpaint or Lorigraph use standard file format, and unfortunately to do proper comparisons we need to test them on the same images!
Fortunately now with emulators we can load and save binary files at arbitrary locations... including in the screen area, which is what I did.
I made some quick and dirty image, converted it with PictConv, loaded it in the two programs, and I recorded the operation.
This picture shows the lenght of the various clips in my video editing software:
Just to add insult to the injury, I'd like to point out that the Master Paint routine supports pattern fills!
In order to perform this feat, the program makes a copy of the image before the filling, so it knows which pixels were modified, which also means it requires an additional 8000 bytes.
The actual flood fill routine is doing quite a lot of smart things, including copying multiple batches of bytes from the scanline directly into the zero page to improve the performance of bit and pixel operations, they get copied back into the video memory when the work is done.
So the question is: Can we do better?
Lode's routineThe "Lode's Computer Graphics Tutorial" page has quite a few interesting variants of routines, including the Scanline Floodfill Algorithm With Stack which looks like that:
//The scanline floodfill algorithm using stack instead of recursion, more robust
void floodFillScanlineStack(int x, int y, int newColor, int oldColor)
if(oldColor == newColor) return;
bool spanAbove, spanBelow;
push(stack, x, y);
while(pop(stack, x, y))
x1 = x;
while(x1 >= 0 && screenBuffer[y * w + x1] == oldColor) x1--;
spanAbove = spanBelow = 0;
while(x1 < w && screenBuffer[y * w + x1] == oldColor)
screenBuffer[y * w + x1] = newColor;
if(!spanAbove && y > 0 && screenBuffer[(y - 1) * w + x1] == oldColor)
push(stack, x1, y - 1);
spanAbove = 1;
else if(spanAbove && y > 0 && screenBuffer[(y - 1) * w + x1] != oldColor)
spanAbove = 0;
if(!spanBelow && y < h - 1 && screenBuffer[(y + 1) * w + x1] == oldColor)
push(stack, x1, y + 1);
spanBelow = 1;
else if(spanBelow && y < h - 1 && screenBuffer[(y + 1) * w + x1] != oldColor)
spanBelow = 0;
It's basically the equivalent of the traditional recursive routine, but it uses an actual dedicated buffer to act as a stack instead of using the CPU stack, which is good for a machine like the Oric which only has a 256 bytes CPU stack.
I converted the routine to assembler and added it to the speed test, and here is what the result looks like:
As you can see, the Lode routine is a tiny bit faster than Master Paint on my first test image, and a tiny bit slower on The Hobbit image, so it definitely is a viable replacement and we can safely ignore Lorigraph and Geffer's routines!
Size wise, this Lode routine is about 194 bytes long (plus the lookup tables), but I kept poking at it to see if I could make it faster.
I added some checks to see how much stack memory the routine used, and tried it on a few images:
The memory usage is very dependent of pictures, and you can easily create images which will make specific flood fill routines to blow up if you now how the behave, but the bottom line here is that we could technically use the 6502 stack to push the values, so it's what I checked next!
Replacing the manual buffer by pha/pla made it possible to reduce the routine size from 194 to 156 bytes. The speed did not increase in any measurable way, but the code became more compact.
Here is the final result.
My few last optimizations were enough to make the routine consistently faster than Master Paint - not by much, but still -.
Here is a working version of the project if you want to play with the code.
This was mostly some test/development code, so the naming is far from optimal (like the x and y variables...), but basically the idea is that you first need to call GenerateTables once to initialize all the necessary tables, then you can just call FillLoop after having set proper values for the x and y coordinates.
I have some more advanced versions, but they were not tested and may contain bugs, so better safe than sorry!
Have fun :)