Monday, November 15, 2021

Long Overdue Progress

In the very first post on this project, I listed my design constraints.  The most important of these came last:

  • Working on it must make me happy
It turns out that writing VHDL testbeds does not make me happy.  And the complete ALU needed a large and complicated testbed.

After a great deal of procrastination (including writing a transpiler for a language that was much more pleasant to work in than VHDL, but never got to the point of being able to create the testbed the ALU needed), I made myself sit down and just do it.  After a lot more work, the ALU finally passed all of its tests.

The high-level structure from the previous post is still true.  There are two main inputs, A and B.  B is fed through the shifter (with its own 'shift amount' input), the inverter, and then both go into the adder.  In addition to ADD, AND, OR, and EOR, the adder can also pass the B input directly to the output.  Not shown are a collection of multiplexors to select a source for each of the inputs, and the large number of control signals to get it to perform the right operation.

There's also another block that wasn't shown on the diagram, but which turned out to be a fairly major chunk of logic.  This takes the inputs, the adder output (including carry out), shifter carry output, and a collection of control lines, and generates values for the flags.  It has to work on 8, 16, and 32 bit operations, and also handle the N and V flags for the BIT instruction.  Here it is, squeezed into as few LUTs as I could manage
The 65020's ALU flags component

With the ALU complete, all that was needed was to construct the rest of the CPU.  That was a much simpler and quicker task: it took far less than the year of procrastination that the ALU did.  I already had the structure of it worked out in the C++ simulator, so most of it was just a matter of turning those components into VHDL, with a bit of fixing up where things weren't a good fit for an FPGA.

And now... it's working.  There was a lot of debugging, of course, but it didn't take too long to get it running its first instructions.  Since that milestone it's been a pleasant process of writing software for it and working out why it doesn't work.  As the fault can be in the software, the assembler, or the CPU itself (and sometimes in all three), that's been a lot of fun.

You'll be wanting a screenshot, of course.  Here it is
The C640 computer running a variety of colourful tests

Each new test uses some instructions that the others hadn't, and these occasionally throw up new bugs to be fixed.  Fortunately these problems seem to be getting less frequent.  It's almost looking like a computer.

The next step is getting the PS/2 keyboard interface working.  It's close - the FPGA side appears to be good, but for some reason the software is sometimes dropping key presses.

Saturday, June 6, 2020

The ALU

The 65020's ALU is its most complex component.  It must be able to add and subtract in both binary and decimal modes, perform logical and, or, and exclusive or, shift and rotate values arbitrary distances, and also set, clear, toggle, and test individual bits.  All of these operations must work on 8, 16, or 32 bit values, and produce appropriate values for the flags.

To do all of this, I've broken it down into three smaller sub-components.


The A input is usually the first operand, and B is usually the second operand.  B can be optionally shifted (not shifting is the same as shifting by 0) or inverted.  The two operands go into the Add/Logic sub-component, which can do binary or BCD addition, or logical and/or/exclusive or.  Subtraction is done by inverting the second operand before adding it to the first.

So ADC A0, #123 is handled by sending the contents of A0 to input A, the constant 123 to input B, setting the shift input to 0, not inverting it, and then adding the two.

SBC A0, $1234 has A = A0, B = value from memory, shift = 0, B inverted, then adding.

The various shift and rotate instructions take their operand on the B input.  A is set to 0, and the adder/logic unit performs an OR, to pass the shifted result to the output.

LDA doesn't look like an instruction that uses the ALU, but I have it using the same microcode and passing the loaded value through the ALU.  A is set to 0, there there is no shift or invert, and the Add/Logic sub-component performs an OR.  This send the loaded value through unchanged, but allows flags to be set.

CLC also doesn't look like an ALU instruction.  But on the 65020, it's a special-case of a more general set of bit-clearing instructions.  These have the bit number encoded in the instruction, and the bit to be cleared in any register, or in memory.  The A input takes the register or memory value, B is set to 1, and the bit number goes to the shift input.  The shifter output is inverted, and then ANDed with the value.  SEC is done in a very similar way, but using OR and not inverting the shifter output.

The Inverter

The inverter is a very simple component, but there are a couple of interesting points.  If we didn't have to support decimal mode, it would be a simple array of XOR gates, with one input of each connected to the 'invert' control signal.

To support BCD subtraction, it also needs to be able to give us the 9's complement of a value: this is obtained by subtracting each nybble from 9.  At first glance, this is simple: take the input 4 bits at a time, add the 'invert' and 'decimal' control signals, and that's a perfect fit for the Spartan 6's 6-input LUTs.  A 32 bit inverter will take 32 LUTs, or 8 slices.

But we can do better than that.  Each LUT actually has two outputs, allowing two functions of 5 and 6 inputs respectively.  If we can use only five inputs, we can have two independent functions implemented in a single LUT.

input   binary  decimal 

0000    1111    1001
0001    1110    1000
0010    1101    0111
0011    1100    0110

0100    1011    0101
0101    1010    0100
0110    1001    0011
0111    1000    0010

1000    0111    0001
1001    0110    0000
1010    0101    1111
1011    0100    1110

1100    0011    1101
1101    0010    1100
1110    0001    1011
1111    0000    1010

Examining the truth table of the inverter, it is apparent that the bottom two bits of the output depend on only the bottom two bits of the input.  And the top two bits of the output depend on only the top three.  Add the two control lines, and that's 4 inputs for one pair of outputs and 5 for the other pair.  That means we can do the whole inverter in only 16 LUTs, or 4 slices.

Monday, April 6, 2020

A New Simulator

It's taken a while, but that was worth doing.  The new hardware-style simulator has clarified the structure needed for the FPGA version, and as a bonus given me the contents of the microcode ROM.

The final list of components:

  • Registers
  • ALU
  • MulDivMod
  • PCReg
  • SPReg
  • FlagsReg
  • OpcodeReg
  • InReset
  • Cycle
  • BranchCycle
  • MicrocodeROM
  • NanocodeROM
  • Address
  • Operand
  • OperandAddr
  • MemoryInterface

InReset stores a single bit indicating that the CPU is in reset.  If set, BRK's writes to memory as it pushes flags and PC are disabled.

PC, SP, and Flags are registers, but since they have special functions and can be written and read outside the standard register access, they get implemented separately.

Cycle is a 3 bit counter which stores the number of the cycle within execution of each instruction.  It normally increments on each cycle, but nanocode can request a conditional jump to a different  cycle.  This allows a single nanocode routine to implement instructions with one or two byte addresses (requiring one or two cycles), and operands of one or two bytes.  If an instruction only needs a one byte address, for example, then its microcode includes a 'BaseAddr16' flag which signals to the nanocode to skip the cycle which fetches the second address byte.

Branch instructions are too complex for this simple system.  For example, a simple branch with a one byte offset will go directly from the fetch of the first offset byte to fetching the next opcode (from either the next instruction or from the destination address, depending on the branch condition).  If the branch has the 'link' bit set, then it must go from the offset fetch to the cycles that push the current PC.  If 'indirect' is set instead, then it goes to the cycles that fetch the destination address from memory.

To handle all this complexity, BranchCycle is a 128x3 ROM.  The address is made from bits from the opcode (Link, Indirect, OffsetWidth), the branch condition, and the current cycle number.  The output is the next cycle.  If the low 5 bits of the opcode are 10000 (a branch instruction), then this ROM overrides the usual cycle selection.

The microcode ROM has 512 entries: one for each opcode and their alternates (instructions with bit 15 of the opcode set).  The outputs are

  • ALUCIn: Selects the source of the ALU's C input (carry in).  0 and 1 select constants (0 for 'add without carry', 1 for 'CMP', for example).  C selects the carry flag (for 'add with carry').  Ext and Rot select one end of the shifted value or the other, and are used for shifts and rotates.
  • ALUInvB: If set, inverts the B input of the ALU.  This is used to implement subtraction and the BIC (bit-clear) instruction.
  • ALUOp: Selects the ALU operation.  It can be Add, And, Eor, InB (output the B input unmodified), Neg, Or, ShiftL, ShiftR.  InB allows the nanocode routine that handles ADC, EOR, and so on to also implement LDA, LDX, and LDY.  ShiftL and ShiftR implement all of the shift and rotate instructions through the choice of the C input.
  • BaseAddr16: If set, signals to the nanocode to skip fetching of the second address byte.
  • BitNum: Selects which bit instructions like CLC and SED operate on.  It can be 0, 2, 3, or 6, and is combined with bits from the opcode extension to select any of the 32 bits.
  • DataWidthSel: Either '32' to signal that this instruction always works with 32 bit data, or '8_9' to use bits 8 and 9 of the opcode to select the data width.
  • DefaultReg: Selects the main register that the instruction uses.  It can be A0, X0, Y0, P, or SP.
  • RegMod: The choice of main register can be modified by bits from the opcode instruction, and this field determines which ones are used.  It can be None (don't modify), MOV (special modification for the MOV instructions), 8_11, 10_12, 10_13, 11_14, 13_14, or 13_15.
  • DefaultIndex: Selects the second, or index, register.  It can be A0, X0, Y0, P, SP, or PC.  Instructions with indexed addressing modes use this as the index register.
  • IndexMod: Which opcode bits modify the choice of index register.  It can be None, MOV (again, special handling for MOV instructions), 8_11, or 10_12.
  • FlagWrite: Four separate flags to enable writing to the C, Z, V, and N flags.
  • MulDivOp: Selects which of the MUL, DIV, and MOD instructions this is.
  • NoRegWrite: If set, disables the usual write to the destination register.  Instructions like CMP behave almost identically to other ALU instructions like SBC.  This allows them to use the same nanocode.
  • NSel: Some instructions have a small constant encoded in the opcode.  This field selects where it is.  It can be 1 (the constant is 1) or 13_14 (the constant is encoded in bits 13 and 14).

Nanocode fields are

  • AluASel: Select the source for the A input to the ALU.  This can be from Operand or from the A output of Registers.
  • AluBSel: Select the source for the B input to the ALU.  This can be Operand, the B output of Registers, OperandOrReg (the choice between the two is made by bit 14 of the opcode, for read-modify-write instructions like LSR, which can use an immediate operand either directly as a shift amount, or as the number of a register containing the shift amount), N (for instructions with a small constant encoded in the opcode), or BitNum (the bit selected by the BitNum field of the microcode).
  • CycleCond: The condition for jumps to other nanocode instructions.  This can be Always, BaseAddr16, Data16, Branch, or MulDivRunning (it is anticipated that MUL, DIV, and MOD will take more cycles than Cycle can handle.  This lets us repeat a single nanocode instruction until the MulDiv unit has finished)
  • CycleJump: The destination for Cycle to jump to if the condition specified by CycleCond is true. 
  • ExitReset: Clear the InReset flag, starting normal operation.
  • AddressInputSel: Which value to send to the address of the memory interface.  It can be OperandAddr, PC, SP, or Vector (the address is determined by bits from the opcode, for the BRK instruction)
  • AddressInc: Add 1 to the address.  This is used for accessing two-byte values.
  • MemWriteDataSel: Selects the source of the data to be written to memory.  It can be ALUOutL, ALUOutH, RegAOutL, RegAOutH, selecting the low or high bytes of either the ALU output or the A output of Registers.
  • MemWriteDataWidth: The size of data to write to memory.  16, 32, or D.  32 bit writes are implemented as two separate write cycles, with (for example) RegAOutL and RegAOutH selecting which half two write.  But the odd layout of 32 bit data required for compatibility with the 6502 means different parts of the value are written depending on whether a cycle is a 16 bit write or the first half of a 32 bit write.  D means 'take the size from bits 8 and 9 of the opcode).
  • WriteEnable: If set, this cycle writes to memory.
  • OpcodeLoad: If set, load the memory read data into the Opcode register.
  • OperandLoad: If set, load the memory read data into the Operand register.
  • OperandExtend: If set, combine the memory read data with the current contents of the Operand register to make a 32 bit value.
  • OperandAddrLoad: Load the OperandAddr register.
  • OperandAddrExtend: Extend the OperandAddr register.
  • OpernadAddrExtendFromOperand: This combines the memory read data with the contents of Operand, but writes the result to OperandAddr.  This lets us load a two byte address into OperandAddr from a location given by OperandAddr, using Operand as temporary storage for the first byte.
  • PCInc: Increment PC.
  • PCLoad: Copy OperandAddr into PC.
  • SPDec: Decrement SP.
  • SPInc: Increment SP.
  • RegASel: Which register to select for the A port of Registers: P, PC, or the register given by the microcode DefaultReg and RegMod fields.
  • RegBSel: Which register to select for the B port of Registers: Index (use DefaultIndex and IndexMod from microcode), Operand (the register number is in the low 4 bits of Operand, used for the immediate mode of read-modify-write instructions that store a register number in immediate data), or Zero (used by branch instructions, but I can't remember why)
  • RegBIsIndex: If set, the B port of Registers is used as an index register.  It is automatically added to the memory address, and if the flags register P is selected, 0 is used instead.
  • RegWriteSel: Selects the source of data to be written to a register.  It can be ALU (ALU output), Data (memory read data), or MulDiv (MulDiv unit output).
  • RegWriteEnable: If set, and the microcode hasn't selected NoRegWrite, writes a value to the register selected by DefaultReg and RegMod.
  • FlagsWriteEnable: Enables writing to flags.  The flags that get written are chosen by microcode.
  • RunMulDiv: Starts the MulDiv unit.
  • SetB: Clear the B flag if an interrupt it being handled, Set the B flag if it isn't.  Only used by the BRK instruction.  Interrupts are handled by loading a BRK instruction into the Opcode register.
  • SetI: Sets the I flag.  This happens in BRK.

The simulator is now running both CPU simulations in parallel, comparing all register values after each instruction.  Commodore 64 BASIC and my simple graphics tests run fine, with no differences between the simulations.  It's ready for the FPGA!

Tuesday, March 17, 2020

A New Start

One day I'll learn to listen to my own advice.  In the previous post, I had a very ugly nested-if implementation of the handful of instructions I needed to run a simple test program.  That worked, but it clearly wasn't a good way to continue.  It was obvious that I needed to take a much more hardware-oriented approach, and definitely use a microcode ROM.

So of course I didn't.  I continued with the nested-if style, implementing more and more instructions.  Synthesis was taking longer and longer.  Eventually, with fewer than half the instructions done and each iteration taking about 15 minutes, I stopped and looked at the log.  That showed 105% FPGA resource use.  I can only assume that Place-And-Route was doing some heroic optimisation to squeeze it all in.

So I need a new implementation.  It wasn't clear what architecture would be needed - what components there should be, what internal busses, and how it should all connect together.  My usual approach is to do a rough first draft, then keep tweaking it as I fill in the details.  I'm comfortable doing that in software, but I still find writing VHDL enough of an effort that I was reluctant to try it.  But I still needed to know what the implementation should look like before I could start.

So I'm back to the software simulator for a while.  I've re-worked the code a little so it now supports two separate implementations of the CPU interface.  One is the old simulator, which will serve as a reference.  The new one is written to have the same structure as the (eventual) hardware.  There's a class for each type of component, and they communicate through explicit signal variables.  It's all controlled by a two-level microcode/nanocode component.

First, there's a 256 entry microcode ROM which gives global information about each instruction - what registers it uses, the structure of the opcode extension, and so on.

Then there's a 32x8 entry nanocode ROM, which provides cycle-by-cycle control of the execution of each instruction.  Instructions can take up to 7 cycles, and there are 25 different types.  Rounding that up to powers of 2, we get 32x8 = 256 entries.

Each nanocode instruction has a conditional jump, to allow skipping of some cycles under various conditions.  That allows, for example, LDA abs,Y and ADC zp,X to use the same type.  Microcode selects the index register, and ADC zp,X can skip the cycle that fetches the high byte of the base address.

Right now, only one instruction type is implemented, and that type has only one instruction: BRK.  The original 6502 implemented its reset sequence as a variant of BRK - the usual writes of P and PC to the stack are suppressed (although their cycles still take place), and the vector is fetched from $fffc instead of $fffe.  I'm doing the same, loading $0100 into the opcode register and setting a flag that disables writes until the end of the next instruction.  The 65020's BRK instruction has a 4 bit vector selection field in its extension bits, so it can select $fffc through that instead of using extra logic.  Here's the nanocode for BRK:
AddressInputSel_SP | RegASel_PC | MemWriteDataSel_RegAOutH | WriteEnable | SPDec
AddressInputSel_SP | RegASel_PC | MemWriteDataSel_RegAOutL | WriteEnable | SPDec
AddressInputSel_SP | RegASel_P | MemWriteDataSel_RegAOutL | WriteEnable | SPDec
AddressInputSel_Vector | OperandAddrLoad
AddressInputSel_Vector | AddressInc | OperandAddrExtend
AddressInputSel_OperandAddr | PCInputSel_OperandAddr | PCLoad
AddressInputSel_PC | OpcodeLoad | PCInc | CycleCond_Always | CycleJump0
Each line represents one cycle.  The first three push P and PC to the stack: the address output selects the SP register, the register file output A selects PC or P,  the memory write data bus selects either the high or low half of the selected register, a memory write is requested, and SP is decremented.

In the next two cycles, a vector address (generated from the opcode extension) is placed on the address bus, and the data read from memory is loaded into the OperandAddr register.  This takes two cycles because the register is 32 bits wide, but the data bus is only 16.  The first cycle loads the low 16 bits of the register and sets the high 16 bits to 0.  The second cycle (OperandAddrExtend) takes the 16 bits already loaded and combines them with 16 new bits to make a 32 bit address.

Next, OperandAddr is sent to the address bus (this is probably not needed) and PC is loaded with the contents of OperandAddr.  If PC was given the same ability to load and extend as OperandAddr, this whole cycle could be removed.  That sort of refinement is the whole purpose of writing this new simulator.

On the last cycle, PC is sent to the address bus and incremented, the Opcode register is loaded from memory, and we unconditionally jump to cycle 0 to start execution of the instruction that was just loaded.

The rest of the simulator is still set up to load the Commodore 64's ROMs, and the first instruction in their reset sequence is $a2 $ff: LDX #$ff.  So that will be the next instruction.  Since each nanocode routine handles all instructions that need the same sequence of operations, that's going to end up implementing the immediate mode of all of the 'main group' of instructions: LDA, ADC, CPX, and so on.

Sunday, November 10, 2019

First Instructions


There's not much apparent difference between this and the last screenshot.  But it's an important one.  The cursor is blinking.  It's blinking under software control.  We have a working CPU!

I've now got a simple test program in ROM.  It copies the screen data from ROM (doing an ASCII to CBM conversion on the way), then sits in a loop turning the cursor on and off.  Here's the relevant part of the code:
0000e4ea:                         54 reset
0000e4ea: 01a2 03e7               55 ldx.w #999
0000e4ec: 20a9 000e               56 lda a1, #14
0000e4ee:                         57 copyScreen
0000e4ee: 00b4 e000               58 ldy initScreen,x
0000e4f0: 10a5 e3e8               59 lda asciiToCBM,y
0000e4f2: 0095 0400               60 sta $0400,x
0000e4f4: 2095 d800               61 sta a1, $d800,x
0000e4f6: 01ca                    62 dex.w
0000e4f7: 0010 00f5               63 bpl copyScreen
0000e4f9:                         64 loop
0000e4f9: 00a9 0020               65 lda #32
0000e4fb: 0085 04f0               66 sta cursor-initScreen+$400
0000e4fd: 02a2 0823 007a          67 ldx.l #555555
0000e500:                         68 delay1
0000e500: 02ca                    69 dex.l
0000e501: 00d0 00fd               70 bne delay1
0000e503: 00a9 00a0               71 lda #32+128
0000e505: 0085 04f0               72 sta cursor-initScreen+$400
0000e507: 02a2 0823 007a          73 ldx.l #555555
0000e50a:                         74 delay2
0000e50a: 02ca                    75 dex.l
0000e50b: 00d0 00fd               76 bne delay2
0000e50d: 80f0 00ea               77 bra loop
Because I'm not attempting full compatibility with the original 6502, instruction timing is a little different.  DEX is a single cycle.  Branches take two cycles, whether they're taken or not.  There's no penalty for crossing a page boundary.  Thus the delay loop for blinking the cursor takes 3 cycles per iteration, and the loop count of 555,555 gives 3 blinks in 2 seconds at the C640's 5MHz.

The VHDL is still very brute-force and very ugly.  Only a handful of opcodes are supported (the ones needed for this very basic test program), and it's done through nested case and if statements, deciding what to do on each phase of each cycle for each individual opcode.

That's clearly not going to be a sustainable way of implementing the full CPU.  A few things stand out - opcodes 85 (sta zp,0) and 95 (sta zp,x) are really the same instruction, but with different index registers.  And b4 (ldy zp,x) only differs in destination register.  A lot of the work that I'm currently cut-and-pasting between instructions could and probably should be shared.

It would be good for development to put most of the complicated parts into a microcode ROM.  That way changes can be made, and new instructions implemented, by simply building a new ROM and inserting it into the bitstream, leaving the VHDL alone.  When the time comes to develop software for the ROM, it will be a relief to avoid the increasingly lengthy VHDL synthesis whenever possible.

Monday, September 23, 2019

A Character Display

The C640 showing off its character display.  It says
*** C640 COMPUTER SYSTEM ***
2MB RAM SYSTEM  38911 BASIC BYTES FREE
That took a lot longer than I wanted it to.

Last time, we had VIC displaying the contents of RAM as a bitmap, and a dummy CPU component copying ROM into RAM.  It sort of worked, but I wasn't happy with the SRAM interface, which was writing corrupted data to memory.

A little tweaking of the memory timing - what happens on which clock phases - fixed the memory problems.  I still wasn't comfortable with it: why did it work stop working with some signals delayed 6ns, when everything was happening at half the speed that should have worked?  But never mind.  It worked, and I was eager to move on to the character display.

This required getting a number of things to work.  First up, VIC doesn't have enough memory bandwidth to read all the information it needs (well, it does in the 320x200 mode I'm using here.  It wouldn't at higher resolution).  The Commodore 64 would stop the CPU every 8 lines so VIC could fetch character pointers from screen memory at $0400 into an internal buffer.  It could then use those pointers to create addresses for bitmap data, which is read on every line.

So the C640 needs DMA.  It needs to be able to pause the CPU, allow VIC to use the CPU's half of the cycle to access memory, have an internal buffer to store it, and create addresses from it to fetch bitmap data.  There's a fairly long pipeline there, and the sequence must start early enough that the bitmap data is ready for display before the border ends.

I didn't think to take any screenshots of the early attempts.  They weren't pretty, and it took weeks of not-at-all-intensive debugging to fix all the problems.

The first one was that SRAM writes immediately broke.  Thinking that it was obviously a timing problem, and the extra logic I'd added had pushed something past its limit, I dug through Xilinx's documentation and discovered the trce tool for generating a timing report after place and route.  That's the only time that can be expected to give accurate results, as a significant part of the total delay is the time it takes to get a signal from one part of the FPGA to another.

The report revealed a large number of timing violations, mostly in the clock enable signals.

mclk, system clock phase, and some of the C640's clock enables.
This is from a later (working) version of the design, so there are only 16 clock phases
Since I didn't want the extra complexity of dealing with multiple clocks, I'm using a single clock (160MHz at this point, and called 'mclk') with enable signals to tell various parts of the design when they should pay attention to it.  Most enables are only active one in every 5MHz system cycle.  To make them active at the right times, I have a "system clock phase" counter, which says how far through the 5MHz cycle this particular 160MHz clock pulse is.  So the clock phase will be active around the rising edge of mclk (which is the edge that everything else is latched on), this counter is incremented on the falling edge of mclk.

That means there's 3.125ns to increment the counter, combine it with whatever other logic is required for the clock enable in question, and get the result to the clock enable input of the register.  The Spartan 6 is fast, but it's not that fast.  Many clock enables were arriving too late.

So it's back to an 80MHz clock, and this time the memory controller uses both edges to generate control signals for the SRAM.  Memory access is now rock solid, and the timing report has no constraint violations.

There then followed far too much fiddling around trying to get the right sequence of actions to make DMA work.  For weeks I had an almost correct display, but there was always something wrong.  The first character on each row would be duplicated, the last character would appear at the start, the first character would have bitmap data from a different character displayed on its first line, ... my poor software brain was at its limit trying to deal with a system where everything happens simultaneously, but everything must happen in exactly the right sequence.

But, as you can see, I finally got there.  There is now a working character display, and I feel that I'm starting to get the hang of this FPGA thing.

Next, it's time to start on the CPU.  I can see significant failure ahead.

Saturday, July 27, 2019

It's Back!

It's been a while.

The work last year had taken the design as far as it needed to go, and the simulator had done its job in proving that it all fitted together.  It was time to turn to hardware.  My chosen platform is the Papilio Duo from Gadget Factory - a Spartan 6 FPGA coupled with an Arduino and a 2MB static RAM.  The large SRAM was attractive (it's much easier to interface than the usual DDR), and it has a "Classic Computing Shield" add-on with all the necessary ports to turn it into a classic 1980s computer.  I'm ignoring the Arduino side of it.

With great enthusiasm I got started.  And that's when I hit a very solid brick wall.

It didn't take much work to get a basic VGA display

There's a little more going on in this photo that it might appear, but also a lot less.  The FPGA contains three main components: a clock generator, a memory controller (including character ROM), and a video generator.

The clock generator multiplies the Papilio Duo's 32MHz oscillator up to 80MHz (I have since increased this to 160MHz, for reasons that will be explained below).  Why such a high frequency?  I need a 40MHz pixel clock for 800x600 VGA (which becomes 640x400 with a border), so it makes sense to start with a multiple of that.  The pixel clock becomes 20MHz in 320 mode, and with 8 pixels per system clock (like the Commodore 64), that implies a 5MHz system clock.

In a real, and by this point rather impractical, mid 1980s computer, we would have the CPU and VIC both accessing memory every cycle.  VIC has two data busses, so each 5MHz cycle needs to support three independent 16 bit accesses.  Since the SRAM on the Papilio Duo is only 8 bits wide, that means six memory accesses per cycle, which I've rounded up to 8.  The extra slot might get used for some kind of DMA in the future.

My original plan for accessing the SRAM needed two clock cycles per access (write need /WE to be low and then high, so it can't be done in one), so that works out to 80MHz.

But this is still pretending to be an old computer, and old computers didn't do anything at 80MHz.  I didn't want to attempt a design with multiple clock domains on my first serious FPGA project, so I'm using a single clock which can be gated by the different modules.  These clock enable signals are also generated by the clock generator.  There are two for the CPU, reflecting the two phases of the clock, and two for VIC.

The next module is the memory manager.  This takes memory access requests from the CPU and VIC, and translates them into the right signals for the SRAM.  It also contains a character bitmap ROM stored in an FPGA BRAM.

Finally, there is the video generator, VIC.  At the moment it is a very simple design, just generating VGA timing signals and reading a bitmap from memory.  For this screenshot, it's configured to read from character ROM.

The next obvious step is to get SRAM working.  My plan was to build a very simple fake CPU that simply copied character ROM into RAM, then get VIC to read from RAM instead of ROM.  That's when it all fell apart.  It didn't work, and nothing I tried could change that.  Motivation drained away, I moved onto other things, and the project was stalled.


But then... I recently bought a Digilent Digital Discovery.  It does a number of things, but for me the most important function is the 32 channel logic analyser.  Being able to see the real signals on the real hardware should make debugging this thing possible.

And it did!  After a little bit of work, I discovered a number of problems.  First, a bit of re-jigging in VIC had resulted in it always displaying a blank screen no matter what data it was getting from memory.  Oops.  I had also been rather optimistic in the way I was writing to SRAM.

The original design presented address and data, then pulled /WE low for a cycle, then returned it high.  The datasheet suggested that this might work, as the relevant setup and hold times were all zero.  But changing outputs on an FPGA and receiving those as inputs on the SRAM are different things.  I couldn't guarantee that the address or data weren't changing a little bit later than /WE, and the tracks on the PCB were definitely not all the same length.

That prompted the clock doubling.  The FPGA is now running at 160MHz, giving me four clocks for each memory access.  That lets me stagger the signals in a way that has a better chance of fitting the timing.

And it almost does.  Here's what I get now

Can you spot the difference?  Those pixels in the bottom right of each character are written by the fake CPU as it copies data, so I can be sure that VIC is fetching data from RAM.  But zoom in closer and look at the right hand side of the Hs.  There are a few missing pixels there.

It's worse in real time.  Pixels flicker on and off all over the screen.  It starts out OK, and gets worse as the chips warm up.  Clearly I'm not quite meeting some timing somewhere.  I added a third phase to the CPU to test this: now it reads ROM, writes to RAM, then reads RAM and compares.  If the result is different, it turns on an error LED.  At full speed the LED is always on.  If I reduce the clock speed, there are no errors at all.

So that's where it is now.  There's a little more work to do on the memory controller, because there's no point trying to continue if I can't trust SRAM writes to work.  Then next step will be making VIC a little bit closer to the real design, using DMA to read character pointers and colours.  And then, with the ability to display proper data, I can finally start work on making a real CPU.