So of course I didn't. I continued with the nested-if style, implementing more and more instructions. Synthesis was taking longer and longer. Eventually, with fewer than half the instructions done and each iteration taking about 15 minutes, I stopped and looked at the log. That showed 105% FPGA resource use. I can only assume that Place-And-Route was doing some heroic optimisation to squeeze it all in.
So I need a new implementation. It wasn't clear what architecture would be needed - what components there should be, what internal busses, and how it should all connect together. My usual approach is to do a rough first draft, then keep tweaking it as I fill in the details. I'm comfortable doing that in software, but I still find writing VHDL enough of an effort that I was reluctant to try it. But I still needed to know what the implementation should look like before I could start.
So I'm back to the software simulator for a while. I've re-worked the code a little so it now supports two separate implementations of the CPU interface. One is the old simulator, which will serve as a reference. The new one is written to have the same structure as the (eventual) hardware. There's a class for each type of component, and they communicate through explicit signal variables. It's all controlled by a two-level microcode/nanocode component.
First, there's a 256 entry microcode ROM which gives global information about each instruction - what registers it uses, the structure of the opcode extension, and so on.
Then there's a 32x8 entry nanocode ROM, which provides cycle-by-cycle control of the execution of each instruction. Instructions can take up to 7 cycles, and there are 25 different types. Rounding that up to powers of 2, we get 32x8 = 256 entries.
Each nanocode instruction has a conditional jump, to allow skipping of some cycles under various conditions. That allows, for example, LDA abs,Y and ADC zp,X to use the same type. Microcode selects the index register, and ADC zp,X can skip the cycle that fetches the high byte of the base address.
Right now, only one instruction type is implemented, and that type has only one instruction: BRK. The original 6502 implemented its reset sequence as a variant of BRK - the usual writes of P and PC to the stack are suppressed (although their cycles still take place), and the vector is fetched from $fffc instead of $fffe. I'm doing the same, loading $0100 into the opcode register and setting a flag that disables writes until the end of the next instruction. The 65020's BRK instruction has a 4 bit vector selection field in its extension bits, so it can select $fffc through that instead of using extra logic. Here's the nanocode for BRK:
AddressInputSel_SP | RegASel_PC | MemWriteDataSel_RegAOutH | WriteEnable | SPDecEach line represents one cycle. The first three push P and PC to the stack: the address output selects the SP register, the register file output A selects PC or P, the memory write data bus selects either the high or low half of the selected register, a memory write is requested, and SP is decremented.
AddressInputSel_SP | RegASel_PC | MemWriteDataSel_RegAOutL | WriteEnable | SPDec
AddressInputSel_SP | RegASel_P | MemWriteDataSel_RegAOutL | WriteEnable | SPDec
AddressInputSel_Vector | OperandAddrLoad
AddressInputSel_Vector | AddressInc | OperandAddrExtend
AddressInputSel_OperandAddr | PCInputSel_OperandAddr | PCLoad
AddressInputSel_PC | OpcodeLoad | PCInc | CycleCond_Always | CycleJump0
In the next two cycles, a vector address (generated from the opcode extension) is placed on the address bus, and the data read from memory is loaded into the OperandAddr register. This takes two cycles because the register is 32 bits wide, but the data bus is only 16. The first cycle loads the low 16 bits of the register and sets the high 16 bits to 0. The second cycle (OperandAddrExtend) takes the 16 bits already loaded and combines them with 16 new bits to make a 32 bit address.
Next, OperandAddr is sent to the address bus (this is probably not needed) and PC is loaded with the contents of OperandAddr. If PC was given the same ability to load and extend as OperandAddr, this whole cycle could be removed. That sort of refinement is the whole purpose of writing this new simulator.
On the last cycle, PC is sent to the address bus and incremented, the Opcode register is loaded from memory, and we unconditionally jump to cycle 0 to start execution of the instruction that was just loaded.
The rest of the simulator is still set up to load the Commodore 64's ROMs, and the first instruction in their reset sequence is $a2 $ff: LDX #$ff. So that will be the next instruction. Since each nanocode routine handles all instructions that need the same sequence of operations, that's going to end up implementing the immediate mode of all of the 'main group' of instructions: LDA, ADC, CPX, and so on.
No comments:
Post a Comment