Blargg's 6502 Emulation Notes

I have written several efficient 8-bit CPU emulators in C that each compile to under 6K of code and data. Described are several techniques I used to implement a NES 6502 CPU emulator; they apply in varying degrees to other 8-bit CPUs.

The techniques are demonstrated primarily with incomplete code examples; to maintain simplicity, some finer points of NES emulation aren't handled. Measure the actual performance benefit on your system before making a decision, since some might not help enough to justify their complexity.

Emulate multiple instructions in a large switch block

Emulate multiple instructions within a single function containing a large switch block to decode the opcode. This allows the compiler to optimize all the code as a single unit and assign registers to often-used variables, and allows manual optimization of the loop. The current values of CPU registers won't be available externally during an emulation run, but they aren't normally needed.

int clocks_remain;

int emulate_cpu( int clock_count )
{
    clocks_remain = clock_count;
    
    int pc = cpu.pc;
    int a = cpu.a;
    ...
    while ( clocks_remain > 0 )
    {
        int opcode = read_memory( pc );
        switch ( opcode )
        {
            case 0xA9: // LDA immediate
                a = read_memory( pc + 1 );
                set_nz( a );
                pc = pc + 2;
                clocks_remain = clocks_remain - 2;
                break;
        }
    }
    
    cpu.pc = pc;
    cpu.a = a;
    ...
    return clocks_remain;
}

Have the loop controlled only by clock cycles executed, and provide a way for the loop to be stopped externally. Don't insert checks for interrupts, external hardware synchronization, etc. each instruction. Instead, determine in advance the earliest time the next interrupt might occur and synchronize external hardware when actually necessary. Also, stop emulation when CLI or PLP clears the interrupt inhibit (I) status flag, since this might require action if an IRQ is already pending.

int clocks_remain;

void stop_cpu()
{
    clocks_remain = 0;
}

int emulate_cpu( int clock_count )
{
    clocks_remain = clock_count;
    ...
    while ( clocks_remain > 0 )
    {
        int opcode = read_memory( pc );
        switch ( opcode )
        {
            case 0x78: // SEI
                status = status | 0x04;
                clocks_remain = clocks_remain - 2;
                break;
            
            case 0x58: // CLI
                clocks_remain = clocks_remain - 2;
                if ( status & 0x04 )
                {
                    status = status & ~0x04;
                    goto stop;
                }
                break;
        }
    }
stop:
    ...
    return clocks_remain;
}

Be very careful in designing the loop control so that unnecessary operations are minimized. It might be better to keep clocks_remain in a local variable and write it to a global variable once per loop, for example.

Defer status flag calculation

All access to system state is done through emulation. This allows the current system state to be kept in whatever format is best for emulation and converted to/from the hardware format only when accessed by the emulated system. Many instructions set several flags in the status register, based on the result of the operation, but most of the time the flags aren't used. Defer calculation of these often-changing status flags by keeping the most recent value each particular flag is based on. For example, keep the most recent value the zero flag is based on and test this value for zero when the flag's value is actually needed:

    int not_zero = !(cpu.status & 0x02);
    ...
    switch ( opcode )
    {
        // CMP
        not_zero = a - operand;
        ...
        
        // INX
        x = x + 1;
        not_zero = x;
        ...
        
        // BNE
        if ( (not_zero & 0xFF) != 0 )
            ... // branch taken
        ...
    }

Factor out common operations

Many instructions share common behavior, like storing a value in memory or setting the status flags. Factor out common instruction endings and jump to them using goto:

    int addr;
    
    case 0xBD: // LDA absolute,X
        addr = read_memory( pc + 2 ) * 0x100 + read_memory( pc + 1 );
        addr = (addr + x) & 0xFFFF;
        goto lda_addr;
    
    case 0xAD: // LDA absolute
        addr = read_memory( pc + 2 ) * 0x100 + read_memory( pc + 1 );
    lda_addr:
        a = read_memory( addr );
        set_nz( a );
        pc = pc + 3;
        break;

Most instructions have at least one operand byte. Always read this along with the opcode and many new opportunities for re-use emerge:

    int opcode = read_memory( pc );
    int data = read_memory( pc + 1 );
    
    switch ( opcode )
    {
        case 0xB5: // LDA zero-page,X
            data = (data + x) & 0xFF;
            goto lda_zp;
        
        case 0xBD: // LDA absolute,X
            data = data + x;
            // fall through to next case
        
        case 0xAD: // LDA absolute
            data = data + read_memory( pc + 2 ) * 0x100;
            data = data & 0xFFFF;
            pc = pc + 1;
            // fall through to next case
            
        lda_zp:
        case 0xA5: // LDA zero-page
            data = read_memory( data );
            // fall through to next case
        
        case 0xA9: // LDA immediate
            a = data;
            set_nz( a );
            pc = pc + 2;
            break;
    }

Wrapping it up, a few tweaks can be made. Increment the program counter after reading the opcode, since it needs to be done eventually. Place the most-often used instruction endings at the beginning of the emulation loop:

    int data;
    int opcode;
    
    goto loop; // skip over instruction endings
    
branch_taken:
    pc = pc + (signed char) data;
increment_pc:
    pc = pc + 1;
loop:
    
    opcode = read_memory( pc );
    pc = pc + 1;
    data = read_memory( pc );
    
    switch ( opcode )
    {
        case 0xB0: // BCS relative
            if ( status & 0x01 )
                goto branch_taken;
            goto increment_pc;
        
        case 0x38: // SEC
            status = status | 0x01;
            goto loop;
    }

Access non-I/O memory directly

The NES CPU (and others) have no input/output (I/O) instructions, so I/O devices must be made to look like memory to allow the CPU to communicate with them. Usually one or more registers are mapped to specific memory addresses. This means that in general, emulated access to memory must be checked for addresses mapped to I/O.

Some memory accesses can be known in advance to never use addresses mapped to I/O. On the NES, zero-page and stack accesses always map to memory, and (to my knowledge) there are no cartridges that do anything special when those areas of memory are accessed. Optimize zero-page and the stack to directly access the memory these are stored in:

    byte low_mem [0x800]; // The NES has 2K of RAM starting at 0
    
    // STA zero-page
    int addr = read_memory( pc + 1 );
    low_mem [addr] = a;

    // PHA
    low_mem [s + 0x100] = a;
    s = (s - 1) & 0xFF;

Instruction opcode and operand memory reads can also be optimized to directly access memory, since execution of the values in I/O registers will usually result in chaos and is thus unlikely to be done intentionally. There is usually some kind of memory mapping scheme, which allows more than 64K of memory to be used by breaking the logical address space of the CPU into banks and allowing each bank to be mapped independently to physical addresses of the larger ROM chip. This mapping must be taken into account when reading instructions directly from memory. Use an array of pointers covering the entire 64K address space, and point unused entries to a dummy page:

    byte unmapped_page [0x1000] = { 0 };
    byte* code_map [16]; // each page covers 4K of address space
    
    code_map [0] = low_mem;
    code_map [2] = unmapped_page;
    ...
    code_map [8] = rom;
    code_map [9] = rom + 0x1000;
    ... // etc.
    
    // Warning: addr is evaluated twice, so it must not have any side-effects
    #define READ_CODE( addr ) code_map [(addr) >> 12] [(addr) & 0x0FFF]
    
    int opcode = READ_CODE( pc );
    switch ( opcode )
    {
        case 0xA9: // LDA immediate
            a = READ_CODE( pc + 1 );
            set_nz( a );
            pc = pc + 2;
            break;
        ...
    }

The final masking of the address with 0x0FFF can be eliminated by "biasing" the page pointers so that the address can be used directly as the index into the page. This isn't strictly portable C, but will work on most systems.

    code_map [0] = -0x0000 + low_mem;
    code_map [2] = -0x2000 + unmapped_page;
    ...
    code_map [8] = -0x8000 + rom;
    code_map [9] = -0x9000 + rom + 0x1000;
    
    #define READ_CODE( addr ) code_map [(addr) >> 12] [addr]

Optimize for often-used instructions

Some instructions occur over 10% of the time, so optimization of these yield many times the benefit over other instructions. Keep the processor state in a format that is most suited to these instructions, and put them near the beginning of the switch block to make them more likely to be in the cache. Optimize the remaining instructions for code size to reduce impact on the cache.

Below are some profiles of relative instruction frequency for a handful of NES games. The percentages are relative to the total number of instructions executed; for example, for every ten instructions executed, on average one was a BNE.

    11.82%  $A5 LDA zero-page
    10.37%  $D0 BNE
     7.33%  $4C JMP absolute
     6.97%  $E8 INX
     4.46%  $10 BPL
     3.82%  $C9 CMP immediate
     3.49%  $30 BMI
     3.32%  $F0 BEQ
     3.32%  $24 BIT zero-page
     2.94%  $85 STA zero-page
     2.00%  $88 DEX
     1.98%  $C8 INY
     1.77%  $A8 TAY
     1.74%  $E6 INC zero-page
     1.74%  $B0 BCS
     1.66%  $BD LDA absolute,X
     1.64%  $B5 LDA zero-page,X
     1.51%  $AD LDA absolute
     1.41%  $20 JSR absolute
     1.38%  $4A LSR A
     1.37%  $60 RTS
     1.35%  $B1 LDA (zero-page),Y
     1.32%  $29 AND immediate
     1.27%  $9D STA absolute,X
     1.24%  $8D STA absolute
     1.08%  $18 CLC
     1.03%  $A9 LDA immediate
     ...
     (rest of data omitted)

Back to Blargg's Information