NES boot loader specification

This page specifies a protocol and behavior for a flexible NES boot loader. See NES boot loader usage for examples, tools, and implementations.

Contents

Introduction

A boot loader is a tiny program which receives a larger program from a PC connected to the NES via RS-232. The larger program is loaded into zero-page and executed there, where it can then communicate with the PC to determine what to do next.

In most cases a full loader will be sent to the boot loader, which will then receive the full program. There are many benefits to this arrangement:

Program block format

The program sent to the boot loader is contained in a 256-byte program block. The format allows multiple implementations that make differing tradeoffs between code size and robustness, without any changes to the program block format. This allows the PC to work with them without having to know which is being used. The block consists of a signature, checksums, code.

Offset Size Content
+0 4 Signature: $E2 $5D $CC $75
+4 1 8-bit checksum: -$E2 - (sum of all 256 bytes, treating this one as zero)
+5 2 16-bit CRC of 249-byte user data. See below for calculation.
+7 249 User code/data, copied to $07-$FF in zero-page

Before sending a program block, all 256 bytes must be transformed. This involves bit order reversal and complement. For example, the byte $21 must be sent as $7B ($21 bit reversed is $84, and complemented is $7B).

To accommodate a minimal loader which simply receives 256 bytes without checking for the signature, nothing should be sent before the data block.

The CRC is calculated in such a way that a normal CRC-16 calculated on the last 251 bytes of the block will yield a zero CRC at the end (start the CRC out at zero when calculating, rather than the usual $FFFF).

The following C code accepts a pointer to a 256-byte block that has user code/data beginning at offset 7 (bytes before that are overwritten with the header). After returning, that 256-byte block contains a valid program block, ready to be sent to a boot loader.

/* Converts 256-byte block of user code (beginning at offset 7)
into 256-byte program block ready to send to boot loader */
void make_block( unsigned char block [256] )
{
    unsigned char header [7] = { 0xE2, 0x5D, 0xCC, 0x75, 0, 0, 0 };
    long crc;
    int i, n;
    
    /* Write header to beginning of block */
    for ( i = 0; i < 7; i++ )
        block [i] = header [i];
    
    /* Calculate 16-bit CRC that will cancel out in the end */
    crc = 0;
    for ( i = 255; i >= 5; i-- )
    {
        for ( n = 0; n < 8; n++ )
            crc = (crc >> 1) ^ ((crc & 1) * 0x8810);
        
        crc = crc ^ (block [i] << 8);
    }
    block [5] = crc >> 8 & 0xFF;
    block [6] = crc      & 0xFF;
    
    /* Calculate 8-bit checksum AFTER 16-bit CRC */
    n = -0xE2;
    for ( i = 0; i < 256; i++ )
        n = n - block [i];
    block [4] = n;
    
    /* Reverse bit order and complement bits */
    for ( i = 0; i < 256; i++ )
    {
        int flipped = 0;
        for ( n = 0; n < 8; n++ )
            flipped = (flipped << 1) | (block [i] >> n & 1);
        
        block [i] = flipped ^ 0xFF;
    }
}

Boot loader operation

A boot loader performs the following actions:

  1. Wait for signature (optional). It can wait for the full 4-byte signature, just the first byte or two, or simply assume that nothing comes before the program block.
  2. Receive checksums.
  3. Receive program data. The 249 bytes of user code/data are written to $07-$FF in zero-page.
  4. Verify checksum (optional). It may verify the 8-bit and/or the 16-bit checksum. If invalid, it should go back to first step.
  5. Begin running program at $0007.

An additional 17600 CPU cycles may be taken after the last program block byte has been received, before it starts executing the program. This allows it up to 70 cycles per byte to calculate the checksum.

No other register or memory initialization is necessary before running the program. In particular, the stack pointer and the first 7 bytes of zero-page may be left uninitialized.

If the full 4-byte signature is verified, it is best if it's not stored in the code in unmodified form. This prevents the loader itself from looking like a program block.

Execution environment

The received program begins executing in the following environment:

PC $0007
A, X, Y, P, S Uninitialized
$00-$06 Uninitialized
$07-$FF User code/data
$100-$7FF Uninitialized

Execution begins no more than 17600 cycles after the last byte of the program block was received.

RS-232 serial interface

Serial data is received at 57600 bits per second, 8 data bits, and no parity. The number of stop bits must be at least 1. This data rate is reasonably fast and has proven very reliable over many years of use. If a higher rate is needed, it can be switched to after sending the program block.

The Data 0 pin of the second controller port is used to receive data. It should be inverted from RS-232 levels, so that the Mark level (-12V) drives the Data 0 pin high. Standard RS-232-to-TTL converters like the MAX232 chip and the FTDI USB-TTL cable do this. If connecting RS-232 using discrete components, use an inverting circuit similar to the following.

Design rationale

At least 256 bytes: A loader needs to receive a good number of bytes and then execute them as code. It could receive a small number, say 100, but there's no reason not to receive 256 bytes, given 6502 indexing.

No more than 256 bytes: Beyond 256 bytes, more code is needed on the 6502. This adds quite a few bytes to the loader code. It also leads to the inevitable desire to specify the destination address, and then the size of the received data, and finally support for multiple blocks of data to different regions of memory.

Load into zero-page: The received program is likely to be a loader itself, albeit more capable than the loader. This means that it shouldn't be in a region of memory where it will be loading code into. Since most programs use zero-page for variables rather than code, that is an obvious choice. This also allows the received program to be self-modifying and use the more compact zero-page addressing to do so. Finally, the received program can be the first 256 bytes of a larger 512-byte program at 0-$1FF, with the initial half receiving the rest on its own.

Checksum of data: The loader must be able to verify that it received the program block without error, so that it doesn't execute corrupt data and produce unpredictable results. Even if the received program tried to checksum itself, the checksum code itself relies on not being corrupt because it otherwise might believe the checksum is correct even when it isn't.

Signature at beginning: A signature at the beginning allows a loader to ignore any other data it might receive before the program block. The checksum might seem able to handle this, and while it would prevent running the mal-formed block, it would result in the program block being ignored. With a signature, the loader can wait until it finds the signature, then receive the rest of the program block, and be able to handle junk data before it without ignoring the program block itself.

Multi-byte signature: The signature must consist of multiple bytes, not just a single one. This greatly reduces the possibility of random data containing the signature. A two-byte signature is still somewhat likely to occur, while a four-byte signature is extremely unlikely. The particular values for the signature have been chosen after scanning lots of NES code and data for the sequences least likely to occur.

8-bit and 16-bit checksums: An 8-bit checksum can be implemented with very little code, and is a lot better than nothing. A 16-bit checksum is much more robust, so it's also included. The 8-bit checksum is after the 16-bit checksum because the 8-bit checksumming loader will be more compact and thus we want it to be smaller. It's simpler if it checksums all the data, rather all but the last two bytes.

The 16-bit checksum starts out as zero rather than the usual $FFFF. This allows the loader to clear the CRC as a side-effect of checking the signature. The 16-bit CRC is then stored with the most significant byte first. This allows the loader to calculate the CRC of the data and this CRC at the end and have it cancel out to zero if it's correct.

Signature and checksums as part of 256 bytes, rather than in addition: Fundamentally, a loader must keep track of how many bytes it's received of the program block. If the block is larger than 256 bytes, it must use more than 8 bits to keep track of the position. Handling more than 8 bits requires more code, and prevents keeping the state in a single register. The 4-byte signature and 3 bytes of checksums are thus put into the 256-byte block, rather than added before and after it. This allows a minimal loader to receive exactly 256 bytes and then begin executing, without leaving any unread or having to skip it. It also allows putting variables into zero-page so that they are initialized as a part of receiving data, reducing loader size.

Program begins at $07: Since the signature and checksums are part of the 256-byte program block, the user code size is reduced below 256 bytes. If it were placed at address 0, it wouldn't go all the way to the end of zero-page. So we load the user code at $07 in zero-page, so that it covers all bytes through the end of zero-page. User code can easily receive more code at $100 and have it connect seamlessly. It can then use $00-$06 for variables. A minimal loader can easily achieve this load address by writing the first byte of the program block to $03, so that the program data at offset 4 gets written to $07 in zero-page.

Extra time for loader to calculate checksum: The loader is given extra time to calculate the 16-bit checksum after receiving the data. While it's possible for it to calculate the checksum as it's receiving a serial byte, it's more involved and makes the loader larger. While this additional time means that the received code must re-synchronize with serial, it must do that anyway, since it already takes enough time to begin executing it that at least one serial byte will be lost.

Uninitialized registers on program entry: Not specifying initial register values on entry to the received program means that it can't assume they are cleared. If it needed A, X, and Y clear, it would need four bytes of code to do so. In most causes it might need one cleared, which adds two bytes. Also, many programmers will still clear things at the beginning anyway, just to be more robust. Guaranteeing cleared registers would add a byte to the smallest boot loader implementation.

Data sent complemented and bit-reversed: By complementing the data and bit-reversing it, the loader's serial receive code can be simplified. This data transformation isn't a problem on the PC side, since the checksums must be calculated in a custom tool anyway.

Contrast the two. First, bit reversed and complemented on the PC side:

        lda #$01        ; Wait for start bit
start:  bit $4017
        beq start
        ldy #6          ; Delay from start bit to middle of data bit
dbit:   dey
        bne dbit
        ldy #3          ; Delay between bits
        nop
        nop
        lsr $4017       ; Read data bit
        rol a
        bcc dbit

what would be necessary if data were sent normally by PC:

        ldy #6          ; Delay from start bit to middle of data bit
        lda #$01        ; Wait for start bit
start:  bit $4017
        beq start
        lda #$80        ; *** Added
dbit:   dey
        bne dbit
        ldy #3          ; Delay between bits
        nop
        nop
        lsr $4017       ; Read data bit
        ror a
        bcc dbit
        eor #$FF        ; *** Added

This adds two instructions, and reduces the available processing time between received bytes by four cycles.

Signature and checksums at beginning: By putting the extra data at the beginning of the block, the user code/data is at the same offset in the block as it will be in zero-page. This simplifies thinking about loading, removing an unnecessary complication in implementation. It also eliminates the possibility that a boot loader could start executing the program before it's all been received, since the last byte of the block is part of the program rather than the checksum as before. This change didn't increase the size of any of the loaders, though it did require figuring out how to do a CRC-16 calculation from end to beginning.

Change log