C640: Some 65020 code

Since the simulator can now handle C64 bitmap graphics, I thought it would be good to write some graphics primitives in proper 65020 code, using the new CPU features.

My little test program starts by entering graphics mode and clearing the screen

sbt #5, $d011
lda #$18
sta $d018
lda #$15
bra.l clear

That first instruction is actually one of the old ones with a new addressing mode. The 6502 has instructions to set and clear flags within the status register. The 65020 uses four of the extension bits to select different destination registers (for the original opcodes, which are now reinterpreted as acc mode), or an index register for an indexed addressing mode. Three others are combined with the bit number implied in the original opcode's flag to select any of the 32 bits in the word. SBT #5 gets assembled as a variant of SEI.

The other new feature here is the bra.l instruction. This is BEQ, with the P bit set to select a different condition (in this case, it becomes "always"). The .l modifier tells it to push the current value of PC to the stack before branching. This turns it into a JSR-like instruction, with relative addressing. Branch instructions also have four bits to select a base register, if you want to branch relative to something other than PC. We'll see a use for that later.

The clear routine is straighforward. The only thing worth noting is that wider registers make it a lot less wordy than the original 6502

clear
ldx.w #1000
clear1
sta $0400,x
dex.w
bpl clear1
ldx.w #8000
lda #0
clear2
sta $2000,x
dex.w
bpl clear2
rts

One of the annoyances of Commodore 64 bitmap graphics is the way that the bitmap is arranged in memory. Most hardware does this linearly, with the first N bytes representing pixels in the first line, from left to right, the next N bytes being the second line, and so on. The Commodore 64 does things differently, as a result of re-using some of the hardware that handles the character screen. The first byte represents the first 8 pixels in the first line (line 0). But then the second byte is the first 8 pixels of line 1. This continues, with byte 7 being the start of line 7. Then byte 8 jumps back up to the second set of 8 pixels on line 0. This continues until byte 319, which is the last 8 pixels of line 7. The next 320 bytes repeat the same pattern 8 lines lower.

An essential function of any graphics code is going to be a routine to map pixel coordinates to the address of a byte, and a mask for the right bit within that byte. Here's the 65020 way:

getPixelAddr
phx.l x1
; a3 = y & 7
mov.w a0, x1
and a0, #7
; x2 = y/8
lsr.w #3, x1
; addr = 320*(y/8) = 256*(y/8) + 64*(y/8)
mov.w a1, x1
asl.w #2, a1
add.w a1, x1
asl.w #3, a1
asl.w #3, a1
; addr = 40*(y/8) = 32*(y&~8) + 8*(y&~8)
; addr += low bits of y
add.w a1, a0
; a = 1<<(x&7)
mov.w a0, x0
and a0, #7
; pre-subtract low bits of x (so adding later only adds the high bits)
sub.w a1, a0
lda a0, bitmasks, a0
; addr += high bits of x
add.w a1, x0
plx.l x1
rts
bitmasks
.byte 128, 64, 32, 16, 8, 4, 2, 1

getPixelAddr takes the x coordinate in X0 and the y coordinate in X1. It returns the offset within the bitmap in A1, and the bit mask for the pixel in A0.

There are a few things worth noting here. The 65020 has 12 main registers (A0-A3, X0-X3, Y0-Y3). The A and X registers are almost completely interchangeable. The Y registers can be used as the destination of ALU instructions, but not the source. It took a few false starts, but I've eventually settled on a convention for register use that I think might suffice for the future. X registers are parameters, and their values are not changed by the routine. A registers are result, and their values can be changed. Y registers are for pointers, and I'm not sure whether they should be saved by routines or not. They probably should be.

So getPixelAddr starts by saving the value of the one X register that it modifies, and ends by restoring it. The rest is the usual sort of bit-fiddling that you expect to see in code like this. I originally wrote it to calculate the mask first, then multiply y/8 by 320, then add all the components together. A bit of re-arranging followed to make it use fewer registers, and also to work around the awkwardness of the two-operand instructions. Being able to have a destination register different from the sources is a very useful feature if you're writing code by hand, but the 65020 just doesn't have enough opcode bits to do it.

The 2 bit constant in acc mode instructions like ASL is occasionally useful. But you can see here I needed a shift by 6 bits, which doesn't quite fit and has to be done in two batches of 3.

getPixelAddr is a lot shorter and faster than the equivalent in 8 bit 6502 code. But it's not the sort of thing you want to call more often than necessary. For drawLine we'd like to call it once at the start of the line, and then use simpler modifications of the bit mask and address to move from the current pixel to one of its neighbours. drawLine uses four helper routines to move right, left, down, and up:

drawLine_incX
rrb a0
bcc drawLine_incX_exit
add.w a1, #8
drawLine_incX_exit
rts

drawLine_decX
rlb a0
bcc drawLine_decX_exit
sub.w a1, #8
drawLine_decX_exit
rts

drawLine_incY
inx.w a1
mov a2, a1
and a2, #7
bne drawLine_incY_skip
add.w a1, #312
drawLine_incY_skip
rts

drawLine_decY
mov a2, a1
and a2, #7
bne drawLine_decY_skip
sub.w a1, #312
drawLine_decY_skip
dex.w a1
rts

To move right or left, I use the RRB and RLB instructions. These are similar to the old ROR and ROL instructions, but instead of rotating a 9 bit value including the carry flag, they rotate within the 8, 16, or 32 bits selected by the instruction width (here it's 8 bit). RRB A0 shifts the low 8 bits of A0 one bit to the right. The old right-most bit gets shifted into bit 7, and also copied to the C flag. Most of the time we can move right or left by just doing this rotate. If C is clear after it, nothing more needs to be done. If C is set, that means we've stepped outside this byte and need to move on to the next. Because we're using RRB and RLB instead of ROR and ROL, the bit that was shifted out is already shifted in to the other end, and all we need to do is add or subtract 8 from the offset. That's done with the ADD instruction, which is an add without carry in. It's the old ADC instruction with the P bit of the extension set.

Now for drawLine itself. First, a bit of set-up

drawLine
phx.l x0
phx.l x1
phx.l x2
phx.l x3
ldy.l y2, #drawLine_incX
sbx.w x2, x0 ; x2 = dx
bpl drawLine_noswap
ldy.l y2, #drawLine_decX
lda a3, #0
sub.w a3, x2
mov.w x2, a3
drawLine_noswap
bra.l getPixelAddr
ldy.l y3, #drawLine_incY
; get displacement of endpoint
sbx.w x3, x1 ; x3 = dy
bpl drawLine_ypositive
; y negative
ldy.l y3, #drawLine_decY
lda a3, #0
sub.w a3, x3
mov.w x3, a3
drawLine_ypositive
mov.w y0, x2 ; loop count = dx x major
cpx.w x2, x3
bgt drawLine_xmajor
mov.w y0, x3 ; loop count = dy y major
; swap x2, x3
mov.w a2, x2
mov.w x2, x3
mov.w x3, a2
; swap y2, y3
mov.w a2, y2
mov.w y2, y3
mov.w y3, a2
drawLine_xmajor
mov.w a3, x2 ; a3 = error
sub.w a3, x3
adx.w x2, x2 ; x2 = 2dx (dy)
adx.w x3, x3 ; x3 = 2dy (dx)

drawLine takes (x1, y1) in X0 and X1, (x2, y2) in X2 and X3.

Save the X registers that are changed, then calculate load Y2 with the address of the appropriate vertical movement helper: if y2 > y1 we increment y, otherwise we decrement. Then we load Y3 with the helper routine for horizontal movement. Increment x if x2 > x1, otherwise decrement.

As a side-effect of those tests, we've also calculated dx = abs(x2-x1) and dy = abs(y2-y1) and put them in X2 and X3. If dy > dx, we need to swap dx and dy, and the pointers to the helper routines in Y2 and Y3.

Finally, we initialise a3 to the difference between dx and dy, and double X2 and Y2. Now we're ready to draw

drawLine_loop
mov a2, a0
ora a2, $2000, a1
sta a2, $2000, a1
bra.l 0,y2
sub.w a3, x3
bpl drawLine_skip
bra.l 0,y3
add.w a3, x2
drawLine_skip
dey.w y0
bne drawLine_loop
plx.l x3
plx.l x2
plx.l x1
plx.l x0
rts

Set the current pixel, call the helper routine that Y2 points to (which will step right or left for x-major lines, down or up for y-major). Then subtract X3 from A3 (this will be double either dx or dy). If A3 goes negative, we need to step on the other axis and add X2. Do this in a loop, and the whole line is drawn. All that's left is to restore the registers we saved at the start and return.

BRA.L 0, Y2 uses a different base register to call a routine whose address is stored in Y2. Normally, a branch instruction will add an offset to PC and jump to that address. Here, the offset is 0, and the base is stored in Y2. This is a very useful feature.

Branches have another extension bit, which I haven't used yet. This enables indirection. Instead of register+offset pointing to the branch destination directly, indirection makes it point to a memory location that contains the address of the destination. This allows tables of function pointers, which could be used to implement virtual functions in a language like C++. If Y0 points to an object, we might say

ldy y1, (0,y0)
bra.il 10, y1

Y0 points to the the start of the object. Its first word (two 16-bit bytes) is a pointer to the virtual table. LDY Y1, (0, Y0) loads this pointer into Y1. The next instruction adds 10 to this pointer, loads the address of the routine we want to call, and calls it.

Throughout drawLine, I'm using the MOV instruction to copy values from one register to another. This isn't a real instruction, but gets translated by the assembler into the appropriate transfer instruction. A move from an X register to an A register, for example, will become TXA. The existing transfer instructions cover a subset of moves between A, X, Y, and S. Two extension bits for each of the source and destination give us A0-A3, X0-X3, Y0-Y3, and P, Z, SP, or PC. Another extension bit for each of source and destination changes the group. A becomes Y, Y becomes A, X becomes S, and S becomes X. This gives a complete set of moves from any register to any other register, but the encoding is complicated enough that it's best to leave it to the assembler.

I think the same applies to the ALU instructions. I currently have different instructions for ADD, ADX, and ADY, which do the same operation (add) on A, X, or Y registers. Since I'm always explicitly specifying the register, it would be much nicer to just say ADD for all of them, and let the assembler choose the right opcode.

The drawLine routine has also highlighted some missing instructions. NEG (negate) and ABS (absolute value) would be very useful. LEA (load effective address, which loads the address specified by an addressing mode into a register, so it can be used later) would also be useful. I also need to add variants of the shift and rotate instructions that allow a variable shift amount. And drawLine would benefit from a SWP instruction to swap the contents of two registers (or possibly a register and a memory location). I'm not sure about that one - it would require adding another write port to the register file, just for one instruction. It's probably not worth it.

The image at the top was created by a simple test program. Most of it isn't worth describing, but the pattern in the bottom left brought up one surprise

; Drawing lines in each octant
ldy y0, #0
octantLoop
ldx.w x0, octantLines,y0
cpx.w x0, #$ffff
beq octant_done
ldx.w x1, octantLines+1,y0
ldx.w x2, octantLines+2,y0
ldx.w x3, octantLines+3,y0
phy y0
bra.l drawLine
ply y0
iny #4, y0
bra octantLoop
octant_done

This tests the drawLine routine, getting it to draw a line in each octant. The surprise wasn't in the code, but in the table of coordinates

octantLines
.byte 10, 180, 40, 190
.byte 40, 190, 50, 160
.byte 50, 160, 20, 150
.byte 20, 150, 10, 180
.byte 20, 190, 50, 180
.byte 50, 180, 40, 150
.byte 40, 150, 10, 160
.byte 10, 160, 20, 190
.byte $ffff

These are all 16 bit values (even though most of them are less than 256), but they're being assembled with .byte. I had originally used .word, but of course that won't work. For compatibility with existing source code, .word has to assemble values to two bytes. Since bytes now contain 16 bits, it splits a 16 bit value into two 8 bit parts, and puts one part in the low 8 bits of each of two bytes. That's not what you want when you're writing new 65020 code. The .w modifier on instructions tells the CPU to use all 16 bits from the byte in question. So .byte has to deal with 16 bit values. .long will handle 32 bit values, putting them in two adjacent bytes. .word will very rarely be used in new code.

C640

Saturday, July 7, 2018

Some 65020 code

No comments:

Post a Comment