C640: 2020

Saturday, June 6, 2020

The ALU

The 65020's ALU is its most complex component. It must be able to add and subtract in both binary and decimal modes, perform logical and, or, and exclusive or, shift and rotate values arbitrary distances, and also set, clear, toggle, and test individual bits. All of these operations must work on 8, 16, or 32 bit values, and produce appropriate values for the flags.

To do all of this, I've broken it down into three smaller sub-components.

The A input is usually the first operand, and B is usually the second operand. B can be optionally shifted (not shifting is the same as shifting by 0) or inverted. The two operands go into the Add/Logic sub-component, which can do binary or BCD addition, or logical and/or/exclusive or. Subtraction is done by inverting the second operand before adding it to the first.

So ADC A0, #123 is handled by sending the contents of A0 to input A, the constant 123 to input B, setting the shift input to 0, not inverting it, and then adding the two.

SBC A0, $1234 has A = A0, B = value from memory, shift = 0, B inverted, then adding.

The various shift and rotate instructions take their operand on the B input. A is set to 0, and the adder/logic unit performs an OR, to pass the shifted result to the output.

LDA doesn't look like an instruction that uses the ALU, but I have it using the same microcode and passing the loaded value through the ALU. A is set to 0, there there is no shift or invert, and the Add/Logic sub-component performs an OR. This send the loaded value through unchanged, but allows flags to be set.

CLC also doesn't look like an ALU instruction. But on the 65020, it's a special-case of a more general set of bit-clearing instructions. These have the bit number encoded in the instruction, and the bit to be cleared in any register, or in memory. The A input takes the register or memory value, B is set to 1, and the bit number goes to the shift input. The shifter output is inverted, and then ANDed with the value. SEC is done in a very similar way, but using OR and not inverting the shifter output.

The Inverter

The inverter is a very simple component, but there are a couple of interesting points. If we didn't have to support decimal mode, it would be a simple array of XOR gates, with one input of each connected to the 'invert' control signal.

To support BCD subtraction, it also needs to be able to give us the 9's complement of a value: this is obtained by subtracting each nybble from 9. At first glance, this is simple: take the input 4 bits at a time, add the 'invert' and 'decimal' control signals, and that's a perfect fit for the Spartan 6's 6-input LUTs. A 32 bit inverter will take 32 LUTs, or 8 slices.

But we can do better than that. Each LUT actually has two outputs, allowing two functions of 5 and 6 inputs respectively. If we can use only five inputs, we can have two independent functions implemented in a single LUT.

input binary decimal

0000 1111 1001

0001 1110 1000

0010 1101 0111

0011 1100 0110

0100 1011 0101

0101 1010 0100

0110 1001 0011

0111 1000 0010

1000 0111 0001

1001 0110 0000

1010 0101 1111

1011 0100 1110

1100 0011 1101

1101 0010 1100

1110 0001 1011

1111 0000 1010

Examining the truth table of the inverter, it is apparent that the bottom two bits of the output depend on only the bottom two bits of the input. And the top two bits of the output depend on only the top three. Add the two control lines, and that's 4 inputs for one pair of outputs and 5 for the other pair. That means we can do the whole inverter in only 16 LUTs, or 4 slices.

Monday, April 6, 2020

A New Simulator

It's taken a while, but that was worth doing. The new hardware-style simulator has clarified the structure needed for the FPGA version, and as a bonus given me the contents of the microcode ROM.

The final list of components:

Registers
ALU
MulDivMod
PCReg
SPReg
FlagsReg
OpcodeReg
InReset
Cycle
BranchCycle
MicrocodeROM
NanocodeROM
Address
Operand
OperandAddr
MemoryInterface

InReset stores a single bit indicating that the CPU is in reset. If set, BRK's writes to memory as it pushes flags and PC are disabled.

PC, SP, and Flags are registers, but since they have special functions and can be written and read outside the standard register access, they get implemented separately.

Cycle is a 3 bit counter which stores the number of the cycle within execution of each instruction. It normally increments on each cycle, but nanocode can request a conditional jump to a different cycle. This allows a single nanocode routine to implement instructions with one or two byte addresses (requiring one or two cycles), and operands of one or two bytes. If an instruction only needs a one byte address, for example, then its microcode includes a 'BaseAddr16' flag which signals to the nanocode to skip the cycle which fetches the second address byte.

Branch instructions are too complex for this simple system. For example, a simple branch with a one byte offset will go directly from the fetch of the first offset byte to fetching the next opcode (from either the next instruction or from the destination address, depending on the branch condition). If the branch has the 'link' bit set, then it must go from the offset fetch to the cycles that push the current PC. If 'indirect' is set instead, then it goes to the cycles that fetch the destination address from memory.

To handle all this complexity, BranchCycle is a 128x3 ROM. The address is made from bits from the opcode (Link, Indirect, OffsetWidth), the branch condition, and the current cycle number. The output is the next cycle. If the low 5 bits of the opcode are 10000 (a branch instruction), then this ROM overrides the usual cycle selection.

The microcode ROM has 512 entries: one for each opcode and their alternates (instructions with bit 15 of the opcode set). The outputs are

ALUCIn: Selects the source of the ALU's C input (carry in). 0 and 1 select constants (0 for 'add without carry', 1 for 'CMP', for example). C selects the carry flag (for 'add with carry'). Ext and Rot select one end of the shifted value or the other, and are used for shifts and rotates.
ALUInvB: If set, inverts the B input of the ALU. This is used to implement subtraction and the BIC (bit-clear) instruction.
ALUOp: Selects the ALU operation. It can be Add, And, Eor, InB (output the B input unmodified), Neg, Or, ShiftL, ShiftR. InB allows the nanocode routine that handles ADC, EOR, and so on to also implement LDA, LDX, and LDY. ShiftL and ShiftR implement all of the shift and rotate instructions through the choice of the C input.
BaseAddr16: If set, signals to the nanocode to skip fetching of the second address byte.
BitNum: Selects which bit instructions like CLC and SED operate on. It can be 0, 2, 3, or 6, and is combined with bits from the opcode extension to select any of the 32 bits.
DataWidthSel: Either '32' to signal that this instruction always works with 32 bit data, or '8_9' to use bits 8 and 9 of the opcode to select the data width.
DefaultReg: Selects the main register that the instruction uses. It can be A0, X0, Y0, P, or SP.
RegMod: The choice of main register can be modified by bits from the opcode instruction, and this field determines which ones are used. It can be None (don't modify), MOV (special modification for the MOV instructions), 8_11, 10_12, 10_13, 11_14, 13_14, or 13_15.
DefaultIndex: Selects the second, or index, register. It can be A0, X0, Y0, P, SP, or PC. Instructions with indexed addressing modes use this as the index register.
IndexMod: Which opcode bits modify the choice of index register. It can be None, MOV (again, special handling for MOV instructions), 8_11, or 10_12.
FlagWrite: Four separate flags to enable writing to the C, Z, V, and N flags.
MulDivOp: Selects which of the MUL, DIV, and MOD instructions this is.
NoRegWrite: If set, disables the usual write to the destination register. Instructions like CMP behave almost identically to other ALU instructions like SBC. This allows them to use the same nanocode.
NSel: Some instructions have a small constant encoded in the opcode. This field selects where it is. It can be 1 (the constant is 1) or 13_14 (the constant is encoded in bits 13 and 14).

Nanocode fields are

AluASel: Select the source for the A input to the ALU. This can be from Operand or from the A output of Registers.
AluBSel: Select the source for the B input to the ALU. This can be Operand, the B output of Registers, OperandOrReg (the choice between the two is made by bit 14 of the opcode, for read-modify-write instructions like LSR, which can use an immediate operand either directly as a shift amount, or as the number of a register containing the shift amount), N (for instructions with a small constant encoded in the opcode), or BitNum (the bit selected by the BitNum field of the microcode).
CycleCond: The condition for jumps to other nanocode instructions. This can be Always, BaseAddr16, Data16, Branch, or MulDivRunning (it is anticipated that MUL, DIV, and MOD will take more cycles than Cycle can handle. This lets us repeat a single nanocode instruction until the MulDiv unit has finished)
CycleJump: The destination for Cycle to jump to if the condition specified by CycleCond is true.
ExitReset: Clear the InReset flag, starting normal operation.
AddressInputSel: Which value to send to the address of the memory interface. It can be OperandAddr, PC, SP, or Vector (the address is determined by bits from the opcode, for the BRK instruction)
AddressInc: Add 1 to the address. This is used for accessing two-byte values.
MemWriteDataSel: Selects the source of the data to be written to memory. It can be ALUOutL, ALUOutH, RegAOutL, RegAOutH, selecting the low or high bytes of either the ALU output or the A output of Registers.
MemWriteDataWidth: The size of data to write to memory. 16, 32, or D. 32 bit writes are implemented as two separate write cycles, with (for example) RegAOutL and RegAOutH selecting which half two write. But the odd layout of 32 bit data required for compatibility with the 6502 means different parts of the value are written depending on whether a cycle is a 16 bit write or the first half of a 32 bit write. D means 'take the size from bits 8 and 9 of the opcode).
WriteEnable: If set, this cycle writes to memory.
OpcodeLoad: If set, load the memory read data into the Opcode register.
OperandLoad: If set, load the memory read data into the Operand register.
OperandExtend: If set, combine the memory read data with the current contents of the Operand register to make a 32 bit value.
OperandAddrLoad: Load the OperandAddr register.
OperandAddrExtend: Extend the OperandAddr register.
OpernadAddrExtendFromOperand: This combines the memory read data with the contents of Operand, but writes the result to OperandAddr. This lets us load a two byte address into OperandAddr from a location given by OperandAddr, using Operand as temporary storage for the first byte.
PCInc: Increment PC.
PCLoad: Copy OperandAddr into PC.
SPDec: Decrement SP.
SPInc: Increment SP.
RegASel: Which register to select for the A port of Registers: P, PC, or the register given by the microcode DefaultReg and RegMod fields.
RegBSel: Which register to select for the B port of Registers: Index (use DefaultIndex and IndexMod from microcode), Operand (the register number is in the low 4 bits of Operand, used for the immediate mode of read-modify-write instructions that store a register number in immediate data), or Zero (used by branch instructions, but I can't remember why)
RegBIsIndex: If set, the B port of Registers is used as an index register. It is automatically added to the memory address, and if the flags register P is selected, 0 is used instead.
RegWriteSel: Selects the source of data to be written to a register. It can be ALU (ALU output), Data (memory read data), or MulDiv (MulDiv unit output).
RegWriteEnable: If set, and the microcode hasn't selected NoRegWrite, writes a value to the register selected by DefaultReg and RegMod.
FlagsWriteEnable: Enables writing to flags. The flags that get written are chosen by microcode.
RunMulDiv: Starts the MulDiv unit.
SetB: Clear the B flag if an interrupt it being handled, Set the B flag if it isn't. Only used by the BRK instruction. Interrupts are handled by loading a BRK instruction into the Opcode register.
SetI: Sets the I flag. This happens in BRK.

The simulator is now running both CPU simulations in parallel, comparing all register values after each instruction. Commodore 64 BASIC and my simple graphics tests run fine, with no differences between the simulations. It's ready for the FPGA!

Tuesday, March 17, 2020

A New Start

One day I'll learn to listen to my own advice. In the previous post, I had a very ugly nested-if implementation of the handful of instructions I needed to run a simple test program. That worked, but it clearly wasn't a good way to continue. It was obvious that I needed to take a much more hardware-oriented approach, and definitely use a microcode ROM.

So of course I didn't. I continued with the nested-if style, implementing more and more instructions. Synthesis was taking longer and longer. Eventually, with fewer than half the instructions done and each iteration taking about 15 minutes, I stopped and looked at the log. That showed 105% FPGA resource use. I can only assume that Place-And-Route was doing some heroic optimisation to squeeze it all in.

So I need a new implementation. It wasn't clear what architecture would be needed - what components there should be, what internal busses, and how it should all connect together. My usual approach is to do a rough first draft, then keep tweaking it as I fill in the details. I'm comfortable doing that in software, but I still find writing VHDL enough of an effort that I was reluctant to try it. But I still needed to know what the implementation should look like before I could start.

So I'm back to the software simulator for a while. I've re-worked the code a little so it now supports two separate implementations of the CPU interface. One is the old simulator, which will serve as a reference. The new one is written to have the same structure as the (eventual) hardware. There's a class for each type of component, and they communicate through explicit signal variables. It's all controlled by a two-level microcode/nanocode component.

First, there's a 256 entry microcode ROM which gives global information about each instruction - what registers it uses, the structure of the opcode extension, and so on.

Then there's a 32x8 entry nanocode ROM, which provides cycle-by-cycle control of the execution of each instruction. Instructions can take up to 7 cycles, and there are 25 different types. Rounding that up to powers of 2, we get 32x8 = 256 entries.

Each nanocode instruction has a conditional jump, to allow skipping of some cycles under various conditions. That allows, for example, LDA abs,Y and ADC zp,X to use the same type. Microcode selects the index register, and ADC zp,X can skip the cycle that fetches the high byte of the base address.

Right now, only one instruction type is implemented, and that type has only one instruction: BRK. The original 6502 implemented its reset sequence as a variant of BRK - the usual writes of P and PC to the stack are suppressed (although their cycles still take place), and the vector is fetched from $fffc instead of $fffe. I'm doing the same, loading $0100 into the opcode register and setting a flag that disables writes until the end of the next instruction. The 65020's BRK instruction has a 4 bit vector selection field in its extension bits, so it can select $fffc through that instead of using extra logic. Here's the nanocode for BRK:

AddressInputSel_SP | RegASel_PC | MemWriteDataSel_RegAOutH | WriteEnable | SPDec
AddressInputSel_SP | RegASel_PC | MemWriteDataSel_RegAOutL | WriteEnable | SPDec
AddressInputSel_SP | RegASel_P | MemWriteDataSel_RegAOutL | WriteEnable | SPDec
AddressInputSel_Vector | OperandAddrLoad
AddressInputSel_Vector | AddressInc | OperandAddrExtend
AddressInputSel_OperandAddr | PCInputSel_OperandAddr | PCLoad
AddressInputSel_PC | OpcodeLoad | PCInc | CycleCond_Always | CycleJump0

Each line represents one cycle. The first three push P and PC to the stack: the address output selects the SP register, the register file output A selects PC or P, the memory write data bus selects either the high or low half of the selected register, a memory write is requested, and SP is decremented.

In the next two cycles, a vector address (generated from the opcode extension) is placed on the address bus, and the data read from memory is loaded into the OperandAddr register. This takes two cycles because the register is 32 bits wide, but the data bus is only 16. The first cycle loads the low 16 bits of the register and sets the high 16 bits to 0. The second cycle (OperandAddrExtend) takes the 16 bits already loaded and combines them with 16 new bits to make a 32 bit address.

Next, OperandAddr is sent to the address bus (this is probably not needed) and PC is loaded with the contents of OperandAddr. If PC was given the same ability to load and extend as OperandAddr, this whole cycle could be removed. That sort of refinement is the whole purpose of writing this new simulator.

On the last cycle, PC is sent to the address bus and incremented, the Opcode register is loaded from memory, and we unconditionally jump to cycle 0 to start execution of the instruction that was just loaded.

The rest of the simulator is still set up to load the Commodore 64's ROMs, and the first instruction in their reset sequence is $a2 $ff: LDX #$ff. So that will be the next instruction. Since each nanocode routine handles all instructions that need the same sequence of operations, that's going to end up implementing the immediate mode of all of the 'main group' of instructions: LDA, ADC, CPX, and so on.