RasterizationCost of AbstractionWhy is it slow?Let's look at CURSETA C equivalentLet's optimize!My bytes have 6 bits!Let's modulateFinding the scanlineAre we done yet?A little bit more?Final notes

I was writing an article about how to create an **efficient**Flood Fill routine on the Oric, but then I realized that many of the optimizations were

**generic**and could be part of a dedicated article about how to write efficient rasterization code.

## Rasterization

When doing computer graphics, on modern machines as well as on retro computers, it is important to**exploit the architecture**of the hardware to our advantage.

Generally speaking this involves understanding how the hardware is structured, and then write code which will benefit from this structure, like how the

**pixels**are organized, the location of the

**video memory**, but also the features of the processing units available

^{1}

Writing code at a high level of abstraction will tend to be quite inefficient compared to code designed to closely match the hardware.

## Cost of Abstraction

A typical example of**abstraction**is the "Put Pixel" function.

In a perfect mathematical world, you can just have very high level functions (like "draw a pixel") and glue them together to create more complex shapes like "lines" and "circles".

In practice, this means passing parameters, calling functions, iterating, etc... and ignoring all the possible optimizations coming from knowing that pixels are topologically set up in patterns we can exploit:

- If the next pixel is on the
**same byte**, can we write them together? - If the next pixel is on the next scanline, can't we just do "+40" on the pointer instead of
**recomputing the address**? - If we draw a large fill rectangle which fits neatly on
**multiples of 6**, why not just "memset" the whole area?

This simple program should illustrate the point easily:

10 HIRES:PAPER 4:INK 3On the

20 ' Draw rectangle using FILL command

30 CURSET 30,75,3:FILL 50,10,64+1+4+16

40 ' Draw rectangle using CURSET loop

50 FOR Y=75 TO 125:FOR X=150 TO 210 STEP 2:CURSET X,Y,1:NEXT:NEXT

**left side**, the rectangle is drawn using the

**FILL**command of the Oric BASIC ROM

^{2}, which simply fills with a byte value a rectangular area defined by a

**number of scanlines**and

**columns of bytes**.

It is not very flexible, but it is

**very fast**at doing anything involving changing the content of rectangles on the screen, as long as you can express that in a single byte value to replicate, which in this case is made of 64 (which defines "a byte with pixel content") plus 1+4+16 which draws one pixel every two.

On the

**right side**, a similar looking rectangle is drawn, but this time done instead by calling the

**CURSET**function

^{3}for each of the pixels we want to change.

As you can see, it is

**considerably slower**, but on the other hand we could have done a circle or triangle using this method, while this would have been

**impossible using FILL**.

## Why is it slow?

There are**multiple reasons**for this horrible performance, the first being that in is written in BASIC, an

**interpreted language**where each variable is actually a

**floating point**value which needs to be

**converted to integer**screen coordinates for every single pixel we draw.

*By the way, I gave the BASIC its best chances by having all the code on the same line - did not help much, did it?*

So, how better would it be if we

**replaced BASIC by C**, and kept the rest of the code identical?

Let's try that!

#include <lib.h>

void main()

{

int x, y;

hires();

paper(4);

ink(3);

// Draw rectangle using FILL command

curset(30, 75, 3);

fill(50, 10, 64 + 1 + 4 + 16);

// Draw rectangle using CURSET loop

for (y = 75; y <= 125; y++)

{

for (x = 150; x <= 210; x+=2)

{

curset(x, y, 1);

}

}

}

Here is what the result looks like:

It's

**definitely faster**than the BASIC, but still nothing too get too impressed with, and if we were to convert the code to assembler, the speed up would be even more

**negligible**, because the problem is the curset function itself.

## Let's look at CURSET

The simplest way to look at why**CURSET is so slow**, is to just look at how it is

**implemented**.

You could just

**press F2**in Oricutron and disassemble from the address $F0C8, or you could just look in the existing

**commented disassembly**listings of the ROM, such as the one in the Oric Advanced User Guide, l'Oric a Nu ou Au Coeur de l'Oric Atmos

^{4}.

The

**first issue**comes from the way

**parameters**are passed, which we can here see in the C OSDK library calls:

_cursetBasically, the graphic functions are using some variables in page 2 (in the $2e0 area), so before calling the CURSET function we need first to extract the coordinates from the callee language, and write them down in this buffer.

jmp _curset

ldx #3 ;Get three parms

jsr getXparm

jsr $f0c8 ;curset

jmp grexit ;common exit point

getXparm ; Get X params (16bit) from stack

ldy #0 ; X is the number of params

sty $2e0 ; Zero error indicator.

stx tmp ; store X in storage byte

ldx #0

getXloop

lda (sp),y

sta $2e1,x

inx

iny

lda (sp),y

sta $2e1,x

inx

iny

dec tmp ; decrement pointer

bne getXloop

rts

Or as the Oric user manual says:

The CURSET function has three parameters: X, Y and FB. That last one is used to define if the pixel should be SET, UNSET, INVERTED or not drawn at all

^{5}.

All these parameters would naturally fit on an unsigned 8 bit value, but because of the parameter passing conventions they have to be passed as 16 bit values, which is never going to win any speed contest!

Here in summary is what the function does:

We start in F0C8

**$F0C8**Check each of the parameters (x,y,fb) by calling $F2F8 to see if they are in the accepted ranges (less than 240, 200 and 4 respectively), and if not sets the error code in the system variable $2E0, else it continues to $EEE8^{6}**$EEE8**This function saves the registers A and Y on the stack and then sucessively call $EEDE, $F049 and finally $F024 before restoring the registers and returning to the caller**$EEDE**All it does is to shift and rotate the content of $212 before returning**$F049**Calls $F731 to compute the screen offset from Y coordinates, adds $A000 and stores the final results in $10/$11, then calls $EFC8 to compute X divided by 6 and X modulo 6 and uses that to compute both the bit mask (stored in $215) and the offset in the screen scanline (added to $10/$11)**$F024**Loads the byte at the address pointed in $10, tests if it's an attribute by checking the bit 6, and then based on FB's value will perform a OR, AND or EOR before writing back the value.**$F731**Multiply the content of A by 40, the result is stored in A and Y**$EFC8**Divides the content of $0C/0D by the content of $200 and stores the quotient in $0C/$0D and the remainder in $0E/$0F

That's a lot of code, many function calls, many pushes and pops, many loops to compute values.

If you are curious about how long exactly this routines takes to execute, you can look at the previous article about Performance Profiling, here are the results I found.

When calling

**curset(120, 100, 1);**, the JSR CURSET takes exactly

**4662 clock cycles**to execute, and just so you realize how bad that is, it's about a quarter of the total number of clock cycles you have during a 50hz screen refresh.

Of course, since it's all in the 16KB of ROM, it makes sense that the code tries to be as compact as possible, but the cost is real, and in our own routines we can gain a humongous amount of time by using

**tables**instead of computations.

## A C equivalent

An actual**equivalent**in C language would be the following:

char CURSET(unsigned char x, unsigned char y, unsigned char fb)

{

if (fb >= 4) return 1;

if (x >= 240) return 1;

if (y >= 200) return 1;

{

int div6 = (x / 6);

int mod6 = x-(div6*6);

unsigned char bit_mask = (32 >> mod6);

int address = 0xa000 + div6 + (y*40);

unsigned char value = peek(address);

if (fb == 0) value &= ~bit_mask;

else

if (fb == 1) value |= bit_mask;

else

if (fb == 2) value ^= bit_mask;

poke(address, value);

}

return 0;

}

Let's test this C version against the BASIC ROM routine!

Ouch,

**304**hundreds of a second vs

**612**, this means our C routine is about

**two times slower**than the ROM routine!

## Let's optimize!

Maybe we can**improve**the code a bit, like for example by

**removing**the things we don't need: We are drawing pixels, so

**FB**is always set to 1, and we don't use the

**return value**anyway.

Here is our new version:

void CURSET(unsigned char x, unsigned char y)

{

if (x >= 240) return;

if (y >= 200) return;

{

int div6 = (x / 6);

int mod6 = x-(div6*6);

unsigned char bit_mask = (32 >> mod6);

int address = 0xa000 + div6 + (y*40);

unsigned char value = peek(address);

poke(address, value | bit_mask);

}

}

The results?

Well, we still have

**304**for the BASIC obviously, but the C version is now down to

**576**.

Still some way to go.

We don't need to

**check**if the coordinates are

**valid**, and instead of peek and poke we can just use pointer arithmetic directly.

void CURSET(unsigned char x, unsigned char y)

{

int div6 = (x / 6);

int mod6 = x-(div6*6);

unsigned char bit_mask = (32 >> mod6);

unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40);

*address |= bit_mask;

}

The C version is now down to

**548**.

We do not actually need to have a

**function**, so instead we can make it a

**macro**, which will remove the parameters passing.

#define CURSET(x,y) \

{ \

int div6 = (x / 6); \

int mod6 = x-(div6*6); \

unsigned char bit_mask = (32 >> mod6); \

unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40); \

*address |= bit_mask; \

}

This brings us down to

**447**.

At this point, we can look at the

**generated code**(in TMP/linked.s) where most of the time

^{7}is spent in the

**divide**and

**multiply**routines:

**// div6 = (x / 6)**

lda reg0 : sta tmp0 : lda reg0+1 : sta tmp0+1

lda tmp0 : sta op1 : lda tmp0+1 : sta op1+1

lda #<6 : sta op2 : lda #>6 : sta op2+1

**jsr div16i**: stx tmp0 : sta tmp0+1

**// mod6 = x-(div6*6)**

lda tmp0 : sta reg2 : lda tmp0+1 : sta reg2+1

lda reg0 : sta tmp0 : lda reg0+1 : sta tmp0+1

lda reg2 : sta tmp1 : lda reg2+1 : sta tmp1+1

lda #<6 : sta op1 : lda #>6 : sta op1+1

lda tmp1 : sta op2 : lda tmp1+1 : sta op2+1 :

**jsr mul16i**: stx tmp1 : sta tmp1+1

**// bit_mask = (32 >> mod6)**

sec : lda tmp0 : sbc tmp1 : sta tmp0

lda tmp0+1 : sbc tmp1+1 : sta tmp0+1

lda tmp0 : sta reg3 : lda tmp0+1 : sta reg3+1

lda reg3 : sta tmp0 : lda reg3+1 : sta tmp0+1

lda #<32 : sta tmp : lda #>32 : ldx tmp0 : beq *+8 : lsr : ror tmp : dex : bne *-4

ldx tmp : stx tmp0 : sta tmp0+1

**// address = (unsigned char*)0xa000 + div6 + (y*40)**

lda tmp0 : sta reg4

lda reg1 : sta tmp0 : lda reg1+1 : sta tmp0+1

lda #<40 : sta op1 : lda #>40 : sta op1+1 : lda tmp0 : sta op2 : lda tmp0+1 : sta op2+1

**jsr mul16i**: stx tmp0 : sta tmp0+1

lda reg2 : sta tmp1 : lda reg2+1 : sta tmp1+1

clc : lda #<$a000 : adc tmp1 : sta tmp1 : lda #>$a000 : adc tmp1+1 : sta tmp1+1

clc : lda tmp0 : adc tmp1 : sta tmp0 : lda tmp0+1 : adc tmp1+1 : sta tmp0+1

lda tmp0 : sta reg5 : lda tmp0+1 : sta reg5+1

lda reg5 : sta tmp0 : lda reg5+1 : sta tmp0+1

**// *address |= bit_mask**

ldy #0 : lda (tmp0),y : sta tmp1

lda tmp1 : sta tmp1 : lda #0 : sta tmp1+1

lda reg4 : sta tmp2

lda tmp2 : sta tmp2 : lda #0 : sta tmp2+1

lda tmp1 : ora tmp2 : sta tmp1 : lda tmp1+1 : ora tmp2+1 : sta tmp1+1

ldy #0 : lda tmp1 : sta (tmp0),y

So, what can we do to

**improve this mess**?

Well, considering the Oric does not have any faster way to compute a divide or a multiply, the usual solution is to

**use tables**, in this case we need three:

- A table for Y
**multiplied by 40**(to access the right scanline) - A table for X
**divided by 6**(to find the right byte on the scanline) - A table for X
**modulo 6**(to get the bit position in the byte)

## My bytes have 6 bits!

One of the peculiarities of the Oric, is that there are only**6 pixels**in each of the

**bytes**of the screen

^{8}, and a consequence of that is that we end-up having to deal with the number six, which is

**not**a power of two, so you

**can't**just shift around to find where to draw a pixel.

Given an horizontal coordinate between 0 and 240, we need to divide the value by 6 to find which byte on the scanline contains the pixel we are interested in.

Let's start with the

**div 6 table**and change the code to the following:

unsigned char Div6[240]; // Values divided by 6

void GenerateTables()

{

int x;

for (x = 0; x < 240; x++)

{

Div6[x] = x / 6;

}

}

#define CURSET(x,y) \

{ \

int div6 = Div6[x]; \ x/6

int mod6 = x-(div6*6); \

unsigned char bit_mask = (32 >> mod6); \

unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40); \

*address |= bit_mask; \

}

We were down to

**447**, and now...

**292**!

**We've now beaten the ROM routines which were at 304.**## Let's modulate

Obviously, if we had to divide by six to find the proper byte, it means we can**use the remainder**of the operation to get a value between 0 and 5 which will indicate the bit we are interested in.

Let's continue with the

**modulo 6 table**.

unsigned char Mod6[240]; // Values modulo 6

void GenerateTables()

{

int x;

for (x = 0; x < 240; x++)

{

Mod6[x] = x % 6;

}

}

#define CURSET(x,y) \

{ \

int div6 = Div6[x]; \

int mod6 = Mod6[x]; \ x%6

unsigned char bit_mask = (32 >> mod6); \

unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40); \

*address |= bit_mask; \

}

New record:

**186**

## Finding the scanline

Next on the list, the**multiply by 40**table. That one will have to be slightly different since the values will

**not fit**on a 8bit result.

unsigned int Mul40[200]; // Values multiplied by 40

void GenerateTables()

{

int y;

for (y = 0; y < 200; y++)

{

Mul40[y] = y * 40;

}

}

#define CURSET(x,y) \

{ \

int div6 = Div6[x]; \

int mod6 = Mod6[x]; \

unsigned char bit_mask = (32 >> mod6); \

unsigned char* address = (unsigned char*)0xa000 + div6 + Mul40[y]; \ // y*40

*address |= bit_mask; \

}

**Our final result is 90, so about 3.3 times faster than the ROM routine.**## Are we done yet?

We could totally stop there, enjoying the snow falling outside, until we start having a**nagging feeling**that maybe there are

**more optimizations**we could unleash.

Up to this point we had some relatively

**generic tables**(multiply by 6, modulo 6, multiply by 40), but if these routines are ever going to be used for graphical routines, why not just

**push**even farther?

Let's look at this line for example:

unsigned char* address = (unsigned char*)0xa000 + div6 + Mul40[y];

We compute the final address by adding together the base address of the screen, the offset to reach the proper scanline, and the offset to access the proper byte, but as it happens,

**the base address never change**so we could just put it in the table as well!

Here is what the slightly

**reorganized**code looks like:

unsigned char* Scanlines[200]; // Values multiplied by 40 + a000

void GenerateTables()

{

int y;

for (y = 0; y < 200; y++)

{

Scanlines[y] = (unsigned char*)0xa000 + y*40;

}

}

#define CURSET(x,y) \

{ \

unsigned char* address = Scanlines[y] + Div6[x]; \

*address |= (32 >> Mod6[x]); \

}

Which brings us down to

**73**(4.16 times faster that the ROM)

But we can still do better, like by realizing that the

**bit mask**is no stored in the same

**order**as the modulo value, but instead of shifting we could just put that in the table instead.

Let's do that.

unsigned char Mod6[240]; // Values modulo 6 (inverted)

void GenerateTables()

{

int x;

for (x = 0; x < 240; x++)

{

Mod6[x] = 32 >> (x % 6);

}

}

#define CURSET(x,y) \

{ \

unsigned char* address = Scanlines[y] + Div6[x]; \

*address |= Mod6[x]; \

}

This small change brings us down to

**64**(4.75 faster).

## A little bit more?

At this point we are**limited**by the C compiler

**inefficient**code, and the fact we have no control on the memory location of our tables.

The 6502 typically will spend more time accessing data when it has to cross a 256 bytes page boundary, so it is quite important to try to

**align things in memory**.

Another important thing is that it is a 8 bit processor, it does not have any native way to handle

**16 bit values**, so things like our integer arrays of screen addresses would be better replaced by two 8 bit arrays, one for the high byte of the address and one for the lower byte.

Unfortunately at this optimization level it become very difficult to fight against the

**C language**rules such as "implicit integer promotion", so we have to turn to

**assembler**.

And intermediate step is to use

**inline assembler**!

unsigned char ScanlinesHigh[200]; // High byte Values multiplied by 40 + a000

unsigned char ScanlinesLow[200]; // Low byte Values multiplied by 40 + a000

unsigned char PosX; // Temporary: Can't access locals from inline ASM

unsigned char PosY; // Temporary: Can't access locals from inline ASM

void GenerateTables()

{

int y;

for (y = 0; y < 200; y++)

{

unsigned int address = 0xa000 + y*40;

ScanlinesHigh[y] = (address>>8);

ScanlinesLow[y] = (address & 255);

}

}

#define CURSET(x,y) \

{ \

PosX = x; \

PosY = y; \

asm( \

"ldy _PosY;" \

"lda _ScanlinesLow,y;" \

"sta tmp0+0;" \

"lda _ScanlinesHigh,y;" \

"sta tmp0+1;" \

"ldx _PosX;" \

"ldy _Div6,x;" \

"lda (tmp0),y;" \

"ora _Mod6,x;" \

"sta (tmp0),y;" \

); \

}

This change from C to assembler bring us down to

**29**(10.4 times faster than the ROM).

Just to get more accurate numbers, let's look at

**exactly**how many

**clock cycles**we are talking about.

The original ROM CURSET took

**4662**clock cycles to execute, and our improved one only uses

**85**clock cycles, which is actually

**54 times faster**

^{9}, which means we could draw 235 pixels per frame instead of 4.

Normally we would not need the PosX and PosY variables, but the inline assembler syntax of the compiler is quite limited and it does not have access to the local variables, so we had to do a temporary copy.

The assembler itself is

**quite simple**:

- Use the Y value to get a byte from _ScanlinesLow and _ScanlinesHigh that we store in the 16 bit
**zero page pointer**tmp0 - Use the X value to get from the Div6 table the
**scanline offset**into the Y register **Load**the byte from the video memory using the screen address stored in tmp0 and the**column offset**in the Y register- Mask in the
**bit pattern**we get from the Mod6 table for this X coordinate **Write back**the modified byte at the same video memory location

**eight 6502 assembler instructions**.

## Final notes

Something important to consider, is that a**pixel**drawing routine is rarely used to just draw a single dot. Most of the time it's part of a more

**complex shape**like a line, circle, polygon, etc...

In such a context, there are many optimizations you can apply, which will make the pixel drawing even

**faster**, like for example:

- If all the pixels are on the
**same vertical line**, the Div6 and Mod6 values stay the same, so you can just fetch it once and reuse it all over - If all the pixels are on the
**same horizontal scanline**, the pointer value does not change, so you can skip the ScanLineHigh and Low table accesses - In this situation, you can also just draw the left and right extremities using some precomputed masks for each of the 6 possible lengths, and
**fill in between**all the bytes with the value 127 (64+1+2+4+8+16+32)

I'm probably forgetting things, so feel free to ask question or point out inefficiencies!

Anyway, I hope that was interesting and that you learnt a thing or two :)

1. On the Oric we only have a 6502, but some machines have graphic cards, Blitting chips, coppers, etc...↩

2. See page 129 of the Oric Atmos User Manual↩

3. See page 120 of the Oric Atmos User Manual↩

4. All these books are available for download in the Oric Library.↩

5. This last case is useful for when you need to move the current cursor position before drawing something else like a CIRCLE↩

6. Also the code does JSR $EEE8 followed by RTS instead of doing a JMP which would be faster↩

7. Yes, the OSDK compiler generates horrible code, but here it's the div16i and mul16i which dwarf the rest↩

8. Which is why we get 240 pixels horizontaly using 40 bytes per scanline↩

9. The discrepancy between the high level counter and the cycle profiler is caused by the time spent in the two for loops↩