Writing a Goboy: running ROMs

August 27, 2019

Finally the opcodes are implemented. I took the approach of getting all of the opcodes implemented and unit-tested before moving on to other components. In retrospect I wouldn’t recommend this approach because you go through a long period with nothing really to show for it. On the other hand, once the opcodes were done all I had to do was implement basic ROM loading and all of the blargg test ROMs would pass straight away, right?

The implementation of most instructions is fairly tedious and not worth going into in much detail. Architecturally, though, the project has matured a fair bit. I had reached the point where I had to take the emulator from something that just iterated through an array of instructions to something that emulated a proper computing system. It had also outgrown my previous “throw everything into package cpu” approach so it was time to refactor.

Here’s the state of the code for this post.

Decoding and assembling

Writing the instruction decoding logic became fairly routine after a while. Pattern-match on the opcode, create a corresponding Instruction struct, add the execution logic, add decoding and execution tests. The times when things were tricky were usually when I tried to wrap multiple opcodes into one instruction struct (e.g. Rotate) that would have flags to identify the different flavours of the instruction: rotate right, rotate left, rotate left with carry etc etc. This always ended up being fiddly and made it harder to see which opcodes corresponded to which structs. I ended up replacing complicated decoding logic like this:

func GetDirection(opcode byte) Direction {
	if opcode&RotateDirectionMask > 0 {
		return Right
	return Left

func GetWithCopyRotation(opcode byte) bool {
	return opcode&RotateCopyMask == 0

func Decode(...) {
// ...
    case op&in.RotateMask == in.RotateAPattern:
		// RLCA. 0b0000 0111
   		// RLA. 0b00001 0111
		// RRCA. 0b0000 1111
	    // RRA. 0b0001 1111
		handle(in.RotateA{Direction: in.GetDirection(op), WithCopy: in.GetWithCopyRotation(op)})
// ...

With simple checks against each individual rotate instruction’s opcode. It still made sense to me to treat things like source registers as arguments to be decoded but in general I erred on the side of keeping things dumb and simple.

About two-thirds of the way through the opcode implementations I was beginning to despair and decided to shift my attention for a bit. One utility that I wanted to have was a disassembler. The implementation should be straightforward because all I had to do is take the binary opcodes, decode them into my intermediate struct format and then dump that to stdout. The disassembler would only need to have access to the instructions and the decoder function - no need to drag in the actual CPU implementation. Writing the disassembler was therefore a good task to drive refactoring the various components into their own packages.

The main problem I encountered was that a large number of isntructions are double- or triple-length. They consist of an opcode and one or two bytes of immediate values. I had previous got around this by having the decoder only decode the first byte into the instruction and then let the CPU implementation fetch the immediate values and add them to the instruction struct. This had always felt a bit inadequate because it made it impossible to test decoding multi-byte instructions without including the CPU implementation. It felt odd to be “decoding” opcodes but leaving work to the CPU. The situation became untenable when my decoding work reached double-length instructions that consisted of a prefix (0xCB) and then the actual opcode in the second byte. It would be terrible if the decoder only decoded the prefix and then left the CPU to do the rest.

So, it was clear that the decoder needed to be able to move through the list of opcodes itself, rather than relying on the CPU to feed them one by one. However, because I wanted to be able to use the decoder by itself in the disassembler, the decoder couldn’t use any CPU methods like cpu.Next() or whatever. The Golang solution to this problem is to use an interface. The decoder package defines the Interator interface:

type Iterator interface {
	Next() byte

It follows the Golang tradition of keeping iterators simple. The decoder can move to the next opcode whenever it needs to:

func Decode(il Iterator) in.Instruction {
	op := il.Next()
	var instruction in.Instruction
	switch {
    // ...
	case op == in.Prefix:
		operand := il.Next()
		switch { ... }
    // ...

Admittedly I need to improve the names of op and operand. The CPU implements this interface with a function that handles incrementing the program counter and cycle counts:

func (cpu *CPU) Next() byte {
	value := cpu.memory.get(cpu.GetPC())
	return value

Whereas the disassembler can use a simpler implementation that only has to iterate through an array holding the opcodes:

func (il *List) Next() byte {
	if il.currentPosition >= len(il.Instructions) {
		return 0
	instruction := il.Instructions[il.currentPosition]
	il.currentPosition += 1
	return instruction

Loading and debugging test ROMs

When a ROM is inserted into a Gameboy it loads some of the ROM’s data to a memory block starting at 0x100. The Gameboy includes a BIOS/bootstrap program that is initially loaded at address 0x0, sets up the stack, clears the video RAM and then starts running the ROM code at 0x100. For now I’ve just started the PC at 0x100 and manually set the SP to 0xFFFE. Video RAM is already zeroed as its an empty array. I think the blargg tests include empty space at the beginning so that they can be loaded beginning at 0x0 and the actual code starts at 0x100. At least, things ran when I did it like that.

Things ran but they didn’t stop running. At first I thought my Goboy implementation was absymally slow - was I failing to use pointer receivers and inadvertently copying large chunks of memory around? I did have a few instances of that but it was concerning that a modern Macbook Pro would struggle to emulate a Gameboy. Dumping the executing instructions showed that I was getting stuck in an infinite loop. The blargg tests were running but because stuff was fundamentally broken they weren’t able to report failure.

Fixing this required quite a lot of sleuthing. I’ve been holding off implementing debugging functionality until I get some kind of working screen so my technique was to step through the test ROM using BGB’s emulator and compare its progress to the execution logs of my emulator. Since the tests involve lots of loops this was really pretty tedious but eventually I found a point of divergence. The reference BGB emulator executed a jump that was something like this:

JP 0x1234

But mine was doing a jump like this:

JP 0x3412

Backing up a bit, I saw that the address to jump to was generated by a LOAD instruction and I was loading the higher and lower bytes of the immediate value the wrong way around. This meant the emulator jumped to the wrong place and this happened to cause the infinite loop.

Suitably chastened, I was delighted that my tests ran correctly now. None of them actually passed but at least they were reporting that they were failing! In most of the individual blargg CPU tests literally every test failed apart from in the first test ROM, where at least a few passed. The failing test was this one:

     set_test 5,"POP AF"
     ld   bc,$1200
-    push bc
     pop  af
     push af
     pop  de
     ld   a,c
     and  $F0
     cp   e
     jp   nz,test_failed
     inc  b
     inc  c
     jr   nz,-

It loads the value 0x1200 into register pair BC, pushes it on and off the stack a bit, does an AND operation on A and then checks whether the comparison of A and E results in the zero flag being set. Since this had to do with the stack I spent a long time checking that my stack implementation was correct and then spent another long time checking that I was setting the flags correctly and performing the AND correctly. Eventually I looked for help online and found this. The problem was that pop AF would write to the flags register F and only the top four bits of F are used. Any writes to the bottom four should be ignored. I had missed this since all of the flag-checking code worked fine no matter what the bottom four bits had in them. This seems like a good example of how writing an emulator can be tricky. You’re not just checking that the flags work just like the Gameboy flags, you’re checking that the flags work just like the Gameboy flags and don’t do anything else.

Masking out the lower bits when writing to F fixed this particular bug. Even without a screen I was able to see the output of the tests by intercepting writes to 0xFF01:

func (m *Memory) set(address uint16, value byte) byte {
	switch {
    // ...
	case address >= 0xFF00 && address <= 0xFF7F:
		if address == 0xFF01 {
			fmt.Printf("%c", value)
		m.ioram[address-0xFF00] = value


By this stage I’d got the test ROMs running and the first one passing. The unit tests were passing too, apart from the instruction cycle tests that I hadn’t got around to updating to work correctly with the new and improved CPU implementation. However, as a next step I was keen to get a screen going and start seeing some lovely pixels.