← Global Game Jam 2018Optimizing Buggy Boy (1) →
  Performance Profiling
Sun 11th November 2018   
A few weeks ago NekoNoNiaow asked on the Defence Force forums if there was any profiling tools available for Oric programs "aside from the reliable but quite basic OSDK clock() function".

I had already worked on the topic, but no actual official release came out of it: Time to fix that!

The Forum Post

If you are interested by the topic, you can read the original discussion on this forum thread.

What came out of the discussion is that there was two main possibilities:
  • Use the built-in cycle counting feature of Oricutron
  • Use my 6502 profiler routines
Chema also pointed out that having a debugging expansion to allow Oricutron to communicate with a modern IDE would be nice, resulting in NekoNoNiaow writing a ticket requesting external debugger support.

We will see how it goes, if you are interested the post is here.

Why should I care?

On modern machines you have heaps of CPU power, many cores, GPUs that can alleviate the work made by the other processors, sound cards that can play audio on their own.

On the machines like the Oric, not so much luck: There is no dedicated GPU, no DMA access to replay music, no sprites, no copper list. Everything that moves or plays is entirely made through actual 6502 code running... and this processor is not the fastest around.

By counting clock cycles, manually or automatically, you can actually figure a number of things that matter:
  • How far is your current code from running at an acceptable framerate
  • Is your new optimized routine actually faster that your previous one
  • It helps decide between two different ways of doing things
So let's get some numbers.

First, there is the concept of Frame.

The Oric is generally connected to a CRT screen running at 50hz1, meaning that the screen gets refreshed every 50th of a second2.

Our processor is a 1mhz MOS 6502, and using simple arithmetic we can come up with some numbers:

A 1mhz processor can execute 1.000.000 (1 million) cycles per second, so if you divide that my 50 (because of the 50hz refresh frequency), you get 20.0000 (20 thousand) cycles per frame.

The actual real value is 19968 cycles, which is important for things like setting the correct VIA interrupt values, but as far as profiling goes, 20000 is easier to remember.

The choice is yours :)

Another value we can derive from these numbers, is the rough number of clock cycles per scanline: 19968/312=64: We can execute 64 cpu cycles during the drawing of each of the horizontal screen lines.

What this all means, is that if your main loop runs in 19968 clock cycles or less, then your program can refresh the screen at 50 frames per second, giving some nice and smooth animation and blazing fast keyboard and joystick reaction.

When you reach the 40000 cycles per frame, you are now in the 25 fps range, which is still acceptable for most games.

When you start to drop in the 10 or less fps range, you can forget games based on reaction time and smooth animation, they will just be painful to play.

Oricutron's cycle counter

For a complete description of Oricutron's debugger, please refer to the actual documentation page.

The core of the profiling feature is the clock cycle counter, visible on the main debugger page.

4 video modes

You can switch between the normal execution mode and the debugger mode by pressing F2 at anytime3

The idea is that each time Oricutron executes a 6502 instruction, it increases the value of the CY counter by the correct amount: If you write down the value at the start of an operation and then compare with the value at the end of the operation, the difference is the number of clock cycles spent running it.

Since doing subtractions is somewhat boring and error prone, you can just use the F9 key to reset the counter :)

Please note that the value is displayed in decimal instead of hexadecimal: What you read is really what you got!

Additionally, if you have a version of Oricutron released after November 20164, it may support a new feature added by iss: CPU breakpoints now support flags that can be used to automate the cycle counting:
  • z to set the Cycle Counter to zero
  • c to continue the execution after the breakpoint has been reached
Used together, you can set two breakpoints, the first will reset the counter and continue the execution, and the second one will actually do break, allowing you to read the value:
bs 1000 zc
bs 2000
You can read more about the feature on the forum post.

6502 profiler

The Oricutron built-in cycle counter is ideal to get quick results easily, but sometimes it is nice to have something that gives a bit more information, or is able to analyze the performance of more than one routine at time, or do statistics over time.

This is where the profiler routine comes in.

As of today, this is not yet integrated in the OSDK, this is just some test code I had written for myself5 which I'm now sharing. Hopefuly I can figure out a clean and easy way to integrate that in the OSDK as a native feature.

A the moment, the code is available in my SVN user folder.

Conceptually the way it works (at the moment at least) is that you decide which routines you want to profile (up to 256), then a set of macros will expand to statically allocated arrays containing informations about the function names, number of time they have been called, cpu time spent in them, etc...

Some other strategically placed macros and function calls will take care of tracking all that and finally report the results on the screen (in the three last lines of TEXT) and on the printer output (which on most emulators gets printed into a nice text file).

Typically, an instrumented program looks like this:

#include "lib.h"
#include "profile.h"

void DoSomething()
{
static int x=1;
static int y=1;
static int r=1;

PROFILE_ENTER(ROUTINE_DOSOMETHING);

if (r<99)
{
curset(x,y,1);
circle(r,2);

r+=1;
x+=1;
y+=1;
}

PROFILE_LEAVE(ROUTINE_DOSOMETHING);
}


void RandomSubroutine()
{
PROFILE_ENTER(ROUTINE_SUBROUTINE);
PROFILE_LEAVE(ROUTINE_SUBROUTINE);
}


void TestSpeed()
{
PROFILE_ENTER(ROUTINE_TESTSPEED);
RandomSubroutine();
PROFILE_LEAVE(ROUTINE_TESTSPEED);

PROFILE_ENTER(ROUTINE_TESTSPEED);
RandomSubroutine();
PROFILE_LEAVE(ROUTINE_TESTSPEED);

PROFILE_ENTER(ROUTINE_TESTSPEED);
RandomSubroutine();
PROFILE_LEAVE(ROUTINE_TESTSPEED);
}



void DoSomethingElse()
{
static int x=238;
static int y=1;
static int r=1;

PROFILE_ENTER(ROUTINE_DOSOMETHINGELSE);

if (r<99)
{
RandomSubroutine();
curset(x,y,1);
RandomSubroutine();
circle(r,1);
RandomSubroutine();

r+=1;
x-=1;
y+=1;
}
PROFILE_LEAVE(ROUTINE_DOSOMETHINGELSE);
}


void main()
{
ProfilerInitialize();

while (1)
{
ProfilerNextFrame();

PROFILE_ENTER(ROUTINE_GLOBAL);
TestSpeed();
DoSomething();
DoSomethingElse();
PROFILE_LEAVE(ROUTINE_GLOBAL);

ProfilerDisplay();
}

ProfilerTerminate();
}

So, what does this code do?

At the minimum, you need to initialize the system by calling ProfilerInitialize, and then each new frame you need to call the ProfilerNextFrame function, and at the end call the ProfilerDisplay function6.

When you want to profile a particular block of code, you just need to call PROFILE_ENTER(id) and PROFILE_LEAVE(id) with the id of the functions you defined earlier, in this case:

#define ROUTINE_GLOBAL 0
#define ROUTINE_DOSOMETHING 1
#define ROUTINE_DOSOMETHINGELSE 2
#define ROUTINE_SUBROUTINE 3
#define PROFILER_ROUTINE_COUNT 4

FUNCTIONNAME(ROUTINE_GLOBAL,"Global")
FUNCTIONNAME(ROUTINE_DOSOMETHING,"DoSomething")
FUNCTIONNAME(ROUTINE_DOSOMETHINGELSE,"DoSomethingElse")
FUNCTIONNAME(ROUTINE_SUBROUTINE,"SubRoutine")
FUNCTIONNAME(PROFILER_ROUTINE_COUNT,0) // End marker

These functions are available for both C and Assembler routines thanks to the use of macro expansion.

For assembler modules you can streamline the code by using PROFILE(id), the LEAVE will automatically be called on the first RTS encountered in the code.

Real world use

Here is what it looks like on one of my old test projects:

Profiling the racing game
Profiling the racing game

And here is the content of the exported printer file:

Profiling started

Frame:0001 Time=1BA9BE
0x01 1BA952 MainLoop
1x01 1B35DA DrawRoad
2x01 00ABD7 ScrollColors
3x01 006F5E TurnLoop
4x01 1A870D DrawLoop

Frame:0002 Time=1BACCD
0x01 1BAC61 MainLoop
1x01 1B38E9 DrawRoad
2x01 00ABD7 ScrollColors
3x01 006F5E TurnLoop
4x01 1A8A1C DrawLoop

Frame:0003 Time=1BAD0C
0x01 1BACA0 MainLoop
1x01 1B3928 DrawRoad
2x01 00ABD7 ScrollColors
3x01 006F5E TurnLoop
4x01 1A8A5B DrawLoop

Frame:0004 Time=1BADF7
0x01 1BAD8B MainLoop
1x01 1B3A13 DrawRoad
2x01 00ABD7 ScrollColors
3x01 006F5E TurnLoop
4x01 1A8B46 DrawLoop

From this we can see that the main loop takes around 1812926 (1BA9BE hexa) cycles, which is roughly equal to 90 frames (1812926/19968), so almost 2 seconds per screen refresh!

This is not surprising considering this was a C prototype containing some particularly inefficient code.

The two fastest routines in the profiled code are ScrollColors and TurnLoop with respectively about 44000 (2.2 frames) and 28510 cycles (1.5 frame).

MainLoop should not have been present, it's basically duplicating the work done by the ProfilerNextFrame and ProfilerDisplay:

while (1)
{
ProfilerNextFrame();
PROFILE_ENTER(ROUTINE_MAIN_LOOP);
(...)
PROFILE_LEAVE(ROUTINE_MAIN_LOOP);
ProfilerDisplay();
}

But it allows us to see what the cost of the profiler is: The actual profiler loop on the first frame was 1BA9BE and the MainLoop was reported as 1BA952. This is a difference of 108 cycles, it's not going to kill you, but if you have to call the same small routine very often this may start to take a significant amount of time.

That being said, when profiling you are interested by the relative cost of things, and how they improve, so that's not too much of a problem in practice... plus the fact that when disabling the profiler the code will actually run faster :)

The two actual biggest offenders are DrawRoad (1B35DA) and DrawLoop (1A870D), which take almost as much time as the main frame time! How is that even possible?

Well, as it happens, DrawLoop is a part of DrawRoad, so it makes perfect sense:

void RoadDraw()
{
PROFILE_ENTER(ROUTINE_DRAW_ROAD);
ScrollColors();

PROFILE_ENTER(ROUTINE_DRAW_LOOP);
for (y=0;y<128;y++)
{
(...)
}
PROFILE_LEAVE(ROUTINE_DRAW_LOOP);
PROFILE_LEAVE(ROUTINE_DRAW_ROAD);
}

Optimization example

Normally you would optimize first what takes the more time to run, but another viable strategy is to try to fix a smaller routine that has similar type of code as the large one, and extrapolate the performance gain of the smaller routine to evaluate if it makes sense to even bother optimizing the larger one.

So, let's look at this TurnLoop routine.

Right now it's reported as about 006F5E by the profiler.

Here is the code:

void TurnLeft(unsigned char count)
{
unsigned char y;
PROFILE_ENTER(ROUTINE_TURN_LOOP);
while (count--)
{
for (y=0;y<128;y++)
{
RoadMiddleTable[y]+=RoadOffsetTable[y];
}
}
PROFILE_LEAVE(ROUTINE_TURN_LOOP);
}

void TurnRight(unsigned char count)
{
unsigned char y;
PROFILE_ENTER(ROUTINE_TURN_LOOP);

while (count--)
{
for (y=0;y<128;y++)
{
RoadMiddleTable[y]-=RoadOffsetTable[y];
}
}
PROFILE_LEAVE(ROUTINE_TURN_LOOP);
}

These are actually two different routines, one that adds and the other one that subtracts, but they use the same profiler ID because their performance profile is identical.

Considering how simple this code is, there's not much optimization that can be done if we stick to C, so let's turn to 6502 assembler instead!

Generally, efficient assembler requires dedicated data structures and often benefits from loop unrolling, but in this case the RoadMiddleTable and RoadOffsetTable are unfortunately standard integer arrays:
int RoadMiddleTable[128];
int RoadOffsetTable[128];
also the TurnLeft and TurnRight functions are expecting a parameter passed on the stack, which is not normally how we do things in assembler world, so the quick and dirty way I came up with is to just call an assembler inner loop routine from the old C routine:

extern void TurnLeftSimple();

void TurnLeft(unsigned char count)
{
unsigned char y;
PROFILE_ENTER(ROUTINE_TURN_LOOP);
while (count--)
{
TurnLeftSimple();
}
PROFILE_LEAVE(ROUTINE_TURN_LOOP);
}

_TurnLeftSimple
.(
ldy #0
loop
clc
lda _RoadMiddleTable,y
adc _RoadOffsetTable,y
sta _RoadMiddleTable,y
iny
lda _RoadMiddleTable,y
adc _RoadOffsetTable,y
sta _RoadMiddleTable,y
iny
bne loop
rts
.)

And the winner is:
Optimizing the racing game
Optimizing the racing game
As you can see, TurnLoop has reduced from 006F5E to 0012CC (and the global frame time from 1BA9BE to 1B456A), which is a reduction by more than 23000 cycles, more than a full frame difference!

Frame:0001 Time=1B456A
0x01 1B44FE MainLoop
1x01 1B2E5A DrawRoad
2x01 00ABD7 ScrollColors
3x01 0012CC TurnLoop
4x01 1A7F8D DrawLoop
This new code is indeed almost 6 times faster than the old one, despite not being optimal at all.

Now, would that be enough to have this whole code run at full speed without the emulator turbo mode?

No, with 89 frames in total we have much more work to do, but considering the rendering loop is using CURSET and DRAW, there is a lot of room for improvement!

Integration in the OSDK

The reason why this code was not integrated in the OSDK was that there was still a lot of work in progress and things requiring manual hacking:
  • Editing the "profile.h" to add macros is not super elegant
  • If you profile more than 5 routines, the display gets garbled
  • No easy/automatic way to enable/disable the profiler
An integration will probably happen at some point, but that will require some work to make it easy to use by everybody.

I hope at least you learnt some things from this article, and that you now have a better idea on how to measure where you lost the CPU time in your program!




1. You can use attributes that can set the frequency at 60hz, but the concept is the same
2. Yes, it's more complicated, half frames, interlace, yada yada
3. You may also enter it automatically by crashing the program.
4. The OSDK version will be updated as soon as a new official release is available
5. With some help from Jonathan Bristow
6. Technically one single function could be used to do both, but by separating them we can just profile some specific sequences of code
comments powered by Disqus

Coverity Scan Build Status