head 1.1; branch 1.1.1; access ; symbols start':1.1.1.1 cd16:1.1.1; locks ; strict; comment @# @; 1.1 date 2003.08.15.17.26.03; author beckert; state Exp; branches 1.1.1.1; next ; 1.1.1.1 date 2003.08.15.17.26.03; author beckert; state Exp; branches ; next ; desc @@ 1.1 log @Initial revision @ text @ CD16 User's Manual

The CD16 processor

The CD16 is a cross between a stack CPU and a register CPU. It's a strange critter, like the Nick cartoon character "Catdog". It's also a 16-bit chip. Hence the name CD16. Both CPU camps have strong adherents, most of whom are influenced by language. Strong optimizers transcend language, mapping the chosen language to the underlying hardware using reduction techniques.

A strongly optimized Forth compiler was developed shortly before the CPU design opportunity presented itself. So, the ISA is optimized for the output of the compiler. The CPU will run equally well with C, if targeted by a good C optimizer.

The CD16 was designed with the following goals:

Efficiently run the output of a good Forth or C optimizer.
Efficiently use the resources of an FPGA or ASIC.
Deliver good math performance using simple hardware.

The CD16 performs most operations in one clock cycle. It has a shallow pipeline, so branches and calls aren't too expensive. It has a relatively rich instruction set. Being not heavily pipelined, it's not especially fast compared to RISCs. However, it has good coprocessor support so you can add application specific instructions.

The gain to be had by adding application specific instructions swamps any possible gain to be had by deeper pipelining.

The CPU (synthesized by Xilinx XST) uses about 300 slices (150 CLB) in a Xilinx Spartan. The smallest Xilinx part a CD16-based SOC can be built in is a Spartan XC2S30, which costs about $10. With external Flash program memory, configuration logic and a crystal, the BOM is about $20 in small quantity (1Q03pricing).

So how fast is it? After low effort place-and-route using Xilinx ISE 5.1's XST synthesis tool, a Spartan-based system runs at 30 to 40 MHz depending on the part. A Virtex runs about 10 MHz faster. That's using Block RAM for the stack, program and data memories. Obviously off-chip program memory will slow things down a bit, but not too much if time-critical code is kept on-chip.

SOC Basic Architecture

A minimal CD16-based SOC configuration uses a CD16 connected to:

A program ROM
A data RAM
A stack RAM
A do-nothing coprocessor stub

The architecture presented here is designed to fit larger applications held in external memory. The CPU, data memory and some program memory reside on-chip.

Stack RAM implementation

The stack RAM is a Dual Port RAM with asynchronous read and synchronous write. Implementation is somewhat awkward with most FPGA logic. However, Xilinx and Altera parts provide Block RAM that can be made to work nicely. Block RAM is synchronous, so you have to clock it at double clock speed and trigger a read early in the CPU's clock cycle. With a 2x clock, you can read at 50% and write at 100%. Starting the read halfway through the CPU cycle is reasonable. The time required to decode an instruction and set up the stack address is roughly equal to the time required to get data from the stack and pass it through the ALU.

Since the RAM address will set up quickly, a faster RAM clock might squeeze out more speed. For example, the stack RAM could be clocked at 3x the CPU frequency, starting a read 1/3 way through a cycle. In this case, the synthesis tool's timing constraints should be set up to accommodate certain multi-cycle paths. Otherwise, it won't optimize for the highest speed.

In an ASIC, you can make the required RAM out of asynchronous RAM as shown in Figure 1.

Figure 1. Stack RAM formed from Asynchronous Dual Port RAM

The stack RAM can be formed from asynchronous dual port RAM

The stack memory can't use a full clock cycle to perform a write. In this case, the RAM is written in the first half of a clock cycle and a read in the second half

The waveforms associated with this configuration are shown below.

Latching the read address after it has settled may be a good idea from a power consumption point of view. It avoids a lot of thrashing of RAM address logic as the stack address adders settle. In many FPGAs, you're forced to do this anyway. Read enables are supplied by the CD16. When a port's read enable is low, you can put anything on the bus because it isn't needed. With FPGA block RAMs, the output doesn't change when read enable is low, so that's just less switching going on and less power consumption.

There are very few instructions that write to both ports of the RAM at the same time. They are the auto-incrementing memory fetch instructions. They are not allowed to write to the same address. This condition is flagged in the simulation as an error.

Many other FPGA architectures are available which don't support true dual port RAM. For these, you can use multiple clock cycles and a state machine to emulate the required RAM. Since the memory is much faster than the ALU and other delays in the FPGA, performance isn't degraded too much. The file DPRAM.VHD is a model of the stack memory.

Program Memory

The program counter spans 64K words of program space. So, the maximum code size is 128K bytes. In a system with a RAM-based FPGA, a flash memory can hold both the bitstream and the program code. Usually, a $1 CPLD can be used to control bitstream loading. Today's flash memories are pretty big, so you're likely to have more than 64Kx16 of code space available. This is where banking is useful.

The CD16 has a BANK instruction (0111 0001 01nn nnnn) that loads the BANK register. During instruction fetch, the upper address bits of the ROM will be zero. During a data fetch (or store) the upper address comes from the BANK register. So if you have a lot of ROM data or a large RAM requirement, you can use banking to cover it.

In an FPGA, a large application will generally have to run from off-chip code storage. 55ns, 3.3V Flash memory is cheap and common. With a 25 MHz clock, it requires one wait state. However, the program ROM can be partitioned between on-chip block RAM and off-chip Flash such that time-critical code runs on-chip. Execution starts at program address 0. The program ROM uses synchronous read.

Figure 2. Synchronous program or data RAM formed from asynchronous RAM

Program and data memories are synchronous read and write. Read is delayed. When you perform a read instruction you get the data read out by the last read instruction, not the current one.

Avoid doing a read immediately after a write. If a read operation is performed immediately after a write operation, the result is indeterminate. That is to say, the next read instruction might return the wrong result.

Data Memory

Data space is the place to put I/O ports and on-chip peripherals, as well as your data memory. The compiler supports bit set and bit clear operations by assuming that the upper part of data space consists of I/O ports:

Address Range	Wait clocks	Usage
0000 7FFF	0	Data RAM
8000 FFFF	0	I/O ports

I/O ports are mapped to data memory starting at address 0x8000. If you deal with individual bits, assign a different bit to each address, connected to D15. This will eliminate the need to mask off bits when changing the state of one port pin.

Since memory is synchronous read, the result of the read is delayed by one clock. When there are multiple input sources, you need to select them using a delayed version of DA. Typically, you would use something like DA_R which is clocked when RD=1.

Interrupts

The CD16 supports seven prioritized interrupts. INT(1) through INT(7) are rising-edge triggered interrupt request pins. Interrupt triggers must be wide enough to be caught by the clock. In an asynchronous system this means two or more clocks. In a synchronous system, one clock will do. INT(1) has the highest priority.

The CD16 handles interrupts a little differently than traditional processors. Interrupts are only handled upon execution of a RET or RETD instruction. Since Forth is very call intensive, not much time elapses between calls. You don't necessarily have a lot of latency. If a CODE word eats up a lot of time, make sure it occasionally executes a RET instruction so as to service interrupts in a timely manner.

The advantage of this scheme is that between calls, many registers are allowed to be trashed by the ISR. For example, W, carry, overflow and G4..G7 are used internally by Forth words but aren't needed externally. The ISR is free to modify them. Plus, you don't have to worry about interrupts occurring during critical sections of code -- just don't put a CALL or RET there.

An ISR can use RET to relinquish control to other interrupts or to the main program. Prioritization determines who's next when two or more interrupts are pending.

The cost of servicing an interrupt is small. The shortest possible ISR (a single RETI instruction) consumes two clock cycles total. The cost of context switching is greatly reduced, at the expense of increased interrupt latency. However, since this is a soft CPU many options are available (such as buffering) for meeting hard realtime requirements.

Addresses 8 thru 15 of stack memory contain the addresses of interrupt service routines.

Stack Addr	Vector
08	Level 7 interrupt (lowest)
09	Level 6 interrupt
0A	Level 5 interrupt
0B	Level 4 interrupt
0C	Level 3 interrupt
0D	Level 2 interrupt
0E	Level 1 interrupt (highest)

Coprocessor

COP16.VHD defines a coprocessor with two input ports and two output ports plus some control lines. The port list is:

entity coprocessor is
port (reset, clk: in std_logic;
CPA: out std_logic_vector(15 downto 0); -- Address for data RAM
CPO: out std_logic_vector(15 downto 0); -- Output to stack
YB: in std_logic_vector(15 downto 0); -- Input from data memory
DI: in std_logic_vector(15 downto 0); -- Input from stack
CPctrl: in std_logic_vector(6 downto 0)); -- Control
end coprocessor;

CPctrl(6)='1' when a coprocessor instruction is being decoded. That means CPA is routed to the data memory's address bus. Various IR bits control stack writes. The instruction format is: 0010 cccc ccPW bbbb, where W writes CPO back to stack location B and P postincrements the stack pointer. All coprocessor instructions cause a data memory read cycle. When the coprocessor is configured as a MAC unit, an operand can be pulled from data memory and stack memory at the same time. A FIR filter can run at 1 MAC per clock this way.

CD16 internal signals

Here are some tables that help make sense out of how some internal CD16 signals are decoded.

Stack param A addressing

rex	xbump	postinc	predec	AA	xfb	usage
0	0	0	0	XP+IR[6:4] or IR[6:4]	XP+IR[7:4] unsigned	param A
0	0	0	1	XP-1	XP-1	pre-decrement
0	0	1	0	XP	XP+1	post-increment
0	0	1	1	XP	XP-1	post-decrement
0	1	0	0	XP+IR[7:4] unsigned	XP+IR[7:4] unsigned
0	1	0	1	XP+IR[7:4] unsigned	XP+IR[7:4] unsigned
0	1	1	0	XP	XP+IR[7:4] unsigned
0	1	1	1	XP	XP-1	post-decrement
1	0	0	0	XP+IR[9:4] unsigned	XP+IR[9:4] unsigned
1	0	0	1	XP-1	XP-1	pre-decrement
1	0	1	0	XP	XP+1	post-increment
1	0	1	1	XP	XP-1	post-decrement
1	1	0	0	XP+IR[9:4] signed	XP+IR[9:4] signed
1	1	0	1	XP+IR[9:4] signed	XP+IR[9:4] signed
1	1	1	0	XP	XP+IR[9:4] signed
1	1	1	1	XP	XP-1	post-decrement

Stack param B shifts

ShiftOp	00	01	10	11
0000	B	B+1	B+2	B-1
0100	Not B	Neg B	-B+1	-B-2
1000	B<<2	ROLC B	B*2+W(n)	ROL B
1100	B>>1	RORC B	W(n)+B/2	B/2

@ 1.1.1.1 log @Imported sources @ text @@