FPGA CPU Cores

 

Home

 

 

 

It's funny how the requirements parameters of a processor can affect the instruction set.

Want a high clock frequency ? For some applications this is the only solution that makes sense. If the application has a low level of instruction parallelism, not many instructions can execute at once. What's needed is a processor that can execute instructions sequentially one at a time at a high rate of speed. The only way to do that is with a high clock frequency. There are applications out there that are highly complex in nature and not suited to parallel execution of instructions. For these applications a superscalar processor is overkill. It doesn't matter that the processor is superscalar because the nature of the application limits the performance to single instruction sequential execution. For some embedded applications a processor with a high clock frequency may make sense if there is a limited number of clock sources available. IF there is only a single 100 MHz clock available then the processor has to be able to be run at 100MHz. Also for embedded applications a processor with a small footprint is often desirable.

Need better overall performance ? Can you make use of some instruction parallelism ? Are instructions to be executed somewhat independent of each other with limited changes in instruction flow ? Is it acceptable to use a somewhat lower clock frequency and are more logic resources available ? Maybe a processor with an overlapped pipeline is in order.

Do you really need maximum performance ? Is the nature of the application highly parallel ? Is a lower clock frequency acceptable ? Maybe a superscalar processor is in order.

Want a high clock frequency ? Look at using a simple sequential design. Use flag registers for branching. Don't worry too much about instruction interactions, as the instructions are executing sequentially one at a time. Make the instructions powerful. Use code compression techniques like variable length instruction encodings. Keep the design small.

Want a processor with an overlapped pipeline for better performance ? Take a serious look at eliminating instruction inter-dependencies, in particular the flags register commonly found in sequential non-overlapped pipeline designs. Dependent instructions can slow the processor down due to the need to stall to resolve dependencies. Take a serious look at making all the instructions a fixed size with just a few formats for decoding simplicity.

Want a superscalar processor for maximum performance ? Take a serious look at predicated instructions. Predicated instructions are almost mandatory for a processor capable of fetching and executing multiple instructions at a time. The issue that predicated instructions deal with is the branch miss penalty for a when a branch is miss-predicted. Predicated instructions eliminate some of the branches from the instruction stream, and therefore eliminate some of the branch misses that would occur. Branch misses are expensive because in a superscalar processor a number of instructions have already been fetched, queued and issued by the time the branch miss is detected. On a branch miss the pipeline must be flushed, and a new set of instructions fetched from memory. When a branch isn't present because of instruction predication, it is not necessary to flush the pipeline, and hence performance is increased.

To get an increase in performance level, the clock frequency of the processor seems to have a downwards trend. The following chart is for a hypothetical 64 bit processor.

Max Clock Frequency Clocks per Instruction MIPS Logic Cells Processor Architecture  
100 MHz 3 33 2000 Sequential, non-overlapped  
60 MHz 1.5 40 10000 Overlapped pipeline  
40 MHz 0.75 53 100000 Superscalar 2 way  

The increase in the complexity of the processor makes the processor larger, and the max achievable clock rate is there-for lower. The overall performance of the processor increases with complexity. the superscalar processor has almost double the performance of the sequential processor while operating at only half the frequency.

 

 

 
Processors
Thor

64 bit superscalar (work in progress) 2 way fetch, queue, issue, commit

variable length instructions, eight entry queue

Thor
RTF6809 32 bit addressing version of the 6809. 6809 backwards compatible. RTF6809
RTF65002

32 bit cpu with 16 regs

with 65C02 emulation mode see GitHub robfinch/Cores/RTF65002 for sources

RTF65002
X11G 11 bit CISC 50+ MHz 2800 LUTs More: X11G - 11 Bit CISC
C101 32 bit RISC, 60 MHz 2000LUTs (incomplete)
rtf8088 8088 compatible. 60 MHz! 5000 LUTs (latest synth, incomplete)
xxx32 32-bit (work in progress) more
Raptor64

64-bit (work in progress) more

multi-context processor, 7 stage overlapped pipeline

Raptor64
Tripu32 32-bit 3-way superscalar 3 parallel pipelines, executes a max of 3 instructions per clock cycle. (untested) Tripu32.v
bc6502

8-bit 6502 compatible more

Tested and Working ! (one easily fix known bug (flag pull from stack) encountered .

bc6502.zip
FT816 16-bit 65816 compatible code (running in an FPGA working AFAICT) FT816
bc65816 16-bit 65816 compatible core more (untested), I had several requests to post the code, even though untested). bc65816.zip
bc65000

16/32 bit 68000 source code compatible

hardcoded state-machine (no microcode).

little endian, DBxx decrements whole register

No MOVEP instruction. MOVEM works differently.

branches shift displacement left once, doubling branch range

bus cycles are two clock cycles, making processor twice as fast.

reset SSP fetched from $FFFFFFF0 reset PC from $FFFFFFF4

50MHz(=100MHz+ 68k) 20,000 LUTs

bc65000.v
Butterfly

32 bit RISC cpu

small size - reasonable performance

Tested and Working !

more

MMU paged memory management unit - maps virtual to real addresses using 16 entry 4-way set associative TLB (code not working yet - but it looks pretty). mmu.v
bus_arb five way SoC bus arbitrator more bus_arb.v