Sunday, November 10, 2019

First Instructions


There's not much apparent difference between this and the last screenshot.  But it's an important one.  The cursor is blinking.  It's blinking under software control.  We have a working CPU!

I've now got a simple test program in ROM.  It copies the screen data from ROM (doing an ASCII to CBM conversion on the way), then sits in a loop turning the cursor on and off.  Here's the relevant part of the code:
0000e4ea:                         54 reset
0000e4ea: 01a2 03e7               55 ldx.w #999
0000e4ec: 20a9 000e               56 lda a1, #14
0000e4ee:                         57 copyScreen
0000e4ee: 00b4 e000               58 ldy initScreen,x
0000e4f0: 10a5 e3e8               59 lda asciiToCBM,y
0000e4f2: 0095 0400               60 sta $0400,x
0000e4f4: 2095 d800               61 sta a1, $d800,x
0000e4f6: 01ca                    62 dex.w
0000e4f7: 0010 00f5               63 bpl copyScreen
0000e4f9:                         64 loop
0000e4f9: 00a9 0020               65 lda #32
0000e4fb: 0085 04f0               66 sta cursor-initScreen+$400
0000e4fd: 02a2 0823 007a          67 ldx.l #555555
0000e500:                         68 delay1
0000e500: 02ca                    69 dex.l
0000e501: 00d0 00fd               70 bne delay1
0000e503: 00a9 00a0               71 lda #32+128
0000e505: 0085 04f0               72 sta cursor-initScreen+$400
0000e507: 02a2 0823 007a          73 ldx.l #555555
0000e50a:                         74 delay2
0000e50a: 02ca                    75 dex.l
0000e50b: 00d0 00fd               76 bne delay2
0000e50d: 80f0 00ea               77 bra loop
Because I'm not attempting full compatibility with the original 6502, instruction timing is a little different.  DEX is a single cycle.  Branches take two cycles, whether they're taken or not.  There's no penalty for crossing a page boundary.  Thus the delay loop for blinking the cursor takes 3 cycles per iteration, and the loop count of 555,555 gives 3 blinks in 2 seconds at the C640's 5MHz.

The VHDL is still very brute-force and very ugly.  Only a handful of opcodes are supported (the ones needed for this very basic test program), and it's done through nested case and if statements, deciding what to do on each phase of each cycle for each individual opcode.

That's clearly not going to be a sustainable way of implementing the full CPU.  A few things stand out - opcodes 85 (sta zp,0) and 95 (sta zp,x) are really the same instruction, but with different index registers.  And b4 (ldy zp,x) only differs in destination register.  A lot of the work that I'm currently cut-and-pasting between instructions could and probably should be shared.

It would be good for development to put most of the complicated parts into a microcode ROM.  That way changes can be made, and new instructions implemented, by simply building a new ROM and inserting it into the bitstream, leaving the VHDL alone.  When the time comes to develop software for the ROM, it will be a relief to avoid the increasingly lengthy VHDL synthesis whenever possible.

Monday, September 23, 2019

A Character Display

The C640 showing off its character display.  It says
*** C640 COMPUTER SYSTEM ***
2MB RAM SYSTEM  38911 BASIC BYTES FREE
That took a lot longer than I wanted it to.

Last time, we had VIC displaying the contents of RAM as a bitmap, and a dummy CPU component copying ROM into RAM.  It sort of worked, but I wasn't happy with the SRAM interface, which was writing corrupted data to memory.

A little tweaking of the memory timing - what happens on which clock phases - fixed the memory problems.  I still wasn't comfortable with it: why did it work stop working with some signals delayed 6ns, when everything was happening at half the speed that should have worked?  But never mind.  It worked, and I was eager to move on to the character display.

This required getting a number of things to work.  First up, VIC doesn't have enough memory bandwidth to read all the information it needs (well, it does in the 320x200 mode I'm using here.  It wouldn't at higher resolution).  The Commodore 64 would stop the CPU every 8 lines so VIC could fetch character pointers from screen memory at $0400 into an internal buffer.  It could then use those pointers to create addresses for bitmap data, which is read on every line.

So the C640 needs DMA.  It needs to be able to pause the CPU, allow VIC to use the CPU's half of the cycle to access memory, have an internal buffer to store it, and create addresses from it to fetch bitmap data.  There's a fairly long pipeline there, and the sequence must start early enough that the bitmap data is ready for display before the border ends.

I didn't think to take any screenshots of the early attempts.  They weren't pretty, and it took weeks of not-at-all-intensive debugging to fix all the problems.

The first one was that SRAM writes immediately broke.  Thinking that it was obviously a timing problem, and the extra logic I'd added had pushed something past its limit, I dug through Xilinx's documentation and discovered the trce tool for generating a timing report after place and route.  That's the only time that can be expected to give accurate results, as a significant part of the total delay is the time it takes to get a signal from one part of the FPGA to another.

The report revealed a large number of timing violations, mostly in the clock enable signals.

mclk, system clock phase, and some of the C640's clock enables.
This is from a later (working) version of the design, so there are only 16 clock phases
Since I didn't want the extra complexity of dealing with multiple clocks, I'm using a single clock (160MHz at this point, and called 'mclk') with enable signals to tell various parts of the design when they should pay attention to it.  Most enables are only active one in every 5MHz system cycle.  To make them active at the right times, I have a "system clock phase" counter, which says how far through the 5MHz cycle this particular 160MHz clock pulse is.  So the clock phase will be active around the rising edge of mclk (which is the edge that everything else is latched on), this counter is incremented on the falling edge of mclk.

That means there's 3.125ns to increment the counter, combine it with whatever other logic is required for the clock enable in question, and get the result to the clock enable input of the register.  The Spartan 6 is fast, but it's not that fast.  Many clock enables were arriving too late.

So it's back to an 80MHz clock, and this time the memory controller uses both edges to generate control signals for the SRAM.  Memory access is now rock solid, and the timing report has no constraint violations.

There then followed far too much fiddling around trying to get the right sequence of actions to make DMA work.  For weeks I had an almost correct display, but there was always something wrong.  The first character on each row would be duplicated, the last character would appear at the start, the first character would have bitmap data from a different character displayed on its first line, ... my poor software brain was at its limit trying to deal with a system where everything happens simultaneously, but everything must happen in exactly the right sequence.

But, as you can see, I finally got there.  There is now a working character display, and I feel that I'm starting to get the hang of this FPGA thing.

Next, it's time to start on the CPU.  I can see significant failure ahead.

Saturday, July 27, 2019

It's Back!

It's been a while.

The work last year had taken the design as far as it needed to go, and the simulator had done its job in proving that it all fitted together.  It was time to turn to hardware.  My chosen platform is the Papilio Duo from Gadget Factory - a Spartan 6 FPGA coupled with an Arduino and a 2MB static RAM.  The large SRAM was attractive (it's much easier to interface than the usual DDR), and it has a "Classic Computing Shield" add-on with all the necessary ports to turn it into a classic 1980s computer.  I'm ignoring the Arduino side of it.

With great enthusiasm I got started.  And that's when I hit a very solid brick wall.

It didn't take much work to get a basic VGA display

There's a little more going on in this photo that it might appear, but also a lot less.  The FPGA contains three main components: a clock generator, a memory controller (including character ROM), and a video generator.

The clock generator multiplies the Papilio Duo's 32MHz oscillator up to 80MHz (I have since increased this to 160MHz, for reasons that will be explained below).  Why such a high frequency?  I need a 40MHz pixel clock for 800x600 VGA (which becomes 640x400 with a border), so it makes sense to start with a multiple of that.  The pixel clock becomes 20MHz in 320 mode, and with 8 pixels per system clock (like the Commodore 64), that implies a 5MHz system clock.

In a real, and by this point rather impractical, mid 1980s computer, we would have the CPU and VIC both accessing memory every cycle.  VIC has two data busses, so each 5MHz cycle needs to support three independent 16 bit accesses.  Since the SRAM on the Papilio Duo is only 8 bits wide, that means six memory accesses per cycle, which I've rounded up to 8.  The extra slot might get used for some kind of DMA in the future.

My original plan for accessing the SRAM needed two clock cycles per access (write need /WE to be low and then high, so it can't be done in one), so that works out to 80MHz.

But this is still pretending to be an old computer, and old computers didn't do anything at 80MHz.  I didn't want to attempt a design with multiple clock domains on my first serious FPGA project, so I'm using a single clock which can be gated by the different modules.  These clock enable signals are also generated by the clock generator.  There are two for the CPU, reflecting the two phases of the clock, and two for VIC.

The next module is the memory manager.  This takes memory access requests from the CPU and VIC, and translates them into the right signals for the SRAM.  It also contains a character bitmap ROM stored in an FPGA BRAM.

Finally, there is the video generator, VIC.  At the moment it is a very simple design, just generating VGA timing signals and reading a bitmap from memory.  For this screenshot, it's configured to read from character ROM.

The next obvious step is to get SRAM working.  My plan was to build a very simple fake CPU that simply copied character ROM into RAM, then get VIC to read from RAM instead of ROM.  That's when it all fell apart.  It didn't work, and nothing I tried could change that.  Motivation drained away, I moved onto other things, and the project was stalled.


But then... I recently bought a Digilent Digital Discovery.  It does a number of things, but for me the most important function is the 32 channel logic analyser.  Being able to see the real signals on the real hardware should make debugging this thing possible.

And it did!  After a little bit of work, I discovered a number of problems.  First, a bit of re-jigging in VIC had resulted in it always displaying a blank screen no matter what data it was getting from memory.  Oops.  I had also been rather optimistic in the way I was writing to SRAM.

The original design presented address and data, then pulled /WE low for a cycle, then returned it high.  The datasheet suggested that this might work, as the relevant setup and hold times were all zero.  But changing outputs on an FPGA and receiving those as inputs on the SRAM are different things.  I couldn't guarantee that the address or data weren't changing a little bit later than /WE, and the tracks on the PCB were definitely not all the same length.

That prompted the clock doubling.  The FPGA is now running at 160MHz, giving me four clocks for each memory access.  That lets me stagger the signals in a way that has a better chance of fitting the timing.

And it almost does.  Here's what I get now

Can you spot the difference?  Those pixels in the bottom right of each character are written by the fake CPU as it copies data, so I can be sure that VIC is fetching data from RAM.  But zoom in closer and look at the right hand side of the Hs.  There are a few missing pixels there.

It's worse in real time.  Pixels flicker on and off all over the screen.  It starts out OK, and gets worse as the chips warm up.  Clearly I'm not quite meeting some timing somewhere.  I added a third phase to the CPU to test this: now it reads ROM, writes to RAM, then reads RAM and compares.  If the result is different, it turns on an error LED.  At full speed the LED is always on.  If I reduce the clock speed, there are no errors at all.

So that's where it is now.  There's a little more work to do on the memory controller, because there's no point trying to continue if I can't trust SRAM writes to work.  Then next step will be making VIC a little bit closer to the real design, using DMA to read character pointers and colours.  And then, with the ability to display proper data, I can finally start work on making a real CPU.