Finitron - HDL Artistry

 

 

 

 Follow @robfi680

 

Core SourceSample ProjectsSoftwareDocumentsLinksContact InfoAboutPictures6502 PageFlash Fiction



 

Google

Custom Search

 

September 17, 2024

Most recently I have been working on Q+, a superscalar processor core with vector registers and came up with the following for implementing the RAT and register renamer.

 

Implementing a RAT in an FPGA with a Circular-List Renamer

© 2024 Robert Finch

 

Introduction

The RAT, register alias table, is a component in a superscalar design that tracks register mappings between architectural logic register numbers and physical register numbers. Mapping physical registers for logical ones is used to remove false dependencies between instructions. The RAT typically must operate within one processor clock cycle to avoid increasing the depth of the pipeline. The register renamer is a component that provides a physical register name for a logical one.

 

RAT Dependency Logic

WAW, write-after-write, and RAW, read-after-write dependencies must be detected and accounted for when mapping registers. This typically involves priority logic to cause the newest register mapping to be loaded into the RAT in precedence over older ones. The prioritizing logic is used as a spatial multiplexor. Making using of FPGA characteristics, this can be turned into a temporal prioritizer which is more hardware efficient. The prioritizing logic can be removed, leaving only the multiplexor.

 

Implementation

RAMs in the FPGA are just about as fast as logic, given that logic is implemented with RAMs. So, given a certain logic depth allows the RAM ports to be prioritized temporally while still meeting CPU clock requirements. In the Q+ design, the RAT write ports are implemented by multiplexing them using a five-times overclock of the CPU clock to provide up to four write ports per CPU clock cycle. This takes the place of priority encoding the inputs and handles WAW dependencies. By carefully choosing the order the ports are processed in, newer RAT entries can overwrite older ones, so the RAT mappings always reflect the newest association. Effectively older mappings are loaded into the RAT, but they get overwritten by newer ones.

 

RAM Reduction

Temporally prioritizing the write ports means reducing the amount of RAM required to represent the RAT by a factor of four. In Q+ the RAT requires eight write ports, the multiplex reduces this to two. A RAM with two write ports is easily handled by duplicating the RAM and using a live-value table to track which RAM contains the valid mapping. A side-effect of reducing the RAM requirements significantly is that the size of the RAT is reduced which may help the timing.

 

Circular List (CL) Renamer

Regarding the CL-renamer. The typical fifo based approach allows “perfect’ allocation of registers that never stalls the pipeline. Registers are always available from a pool of registers. For Q+ an imperfect economy renamer is used which uses a minimum number of LUTs. It is implemented as a simple circular list of registers using FPGA shift registers rather than a fifo. The register list has an available bit associated with each register. Every clock cycle when a register name is needed the list is rotated. The register name rotated into view may not be available, in which case the CPU’s pipeline is stalled, and the register list rotated again. Because the oldest registers are rotated into view first, they are typically available for allocation because they have been freed at the commit stage, long ago. The rename list has ¼ of the physical registers in it and four separate lists are used to make rename registers available for four targets each clock cycle. Because there are 127 registers in the list, it is 127 instruction groups before the same register comes up for allocation. In most cases the register will be free already. The implementation is not far enough along to measure the impact of this sort of renamer.

 

Renamer Overclock

The circular-list renamer (CL-renamer) is simple enough that it has the potential to operate at a multiple of the CPU clock rate. FPGA shift registers are extremely fast. This would allow stalls to be avoided by skipping over registers not available for allocation while remaining with the same CPU clock.

 

Software Stall Removal

Another approach to avoiding stalls by the CL-renamer would be for the compiler to insert MOV instructions periodically for values that are long term. For instance, if a value is resident in a register for more than 100 instructions, then it could be moved to the same register forcing a rename operation and releasing the old register. However, this would require that the compiler know about the move instruction during the optimization stage, so that the MOV does not get optimized away.

 

March 25, 2024

Is there a calculus of rules? I asked this question on reddit a while ago and got a couple of references to various forms of calculus.

https://www.reddit.com/r/calculus/comments/118qoqp/is_there_a_calculus_of_rules/

 

Rather than post my notion on reddit again. I thought I would just put it on my webpage. Basically, the idea comes from the fact that there are rules and then more general rules to express mathematics. If there are rules, and more general rules, is it possible to graduate between the rules at a fine level, as is done for other forms of calculus?

 

Just thinking about the calculus of rules some more, and noting that irrational numbers are defined by rules. The difference between two irrational numbers can be taken and might be used to establish a set of rules in between. So, take the difference between pi and e for instance. Where fpi() is the rule determining pi and fe() is the rule determining e. Fpi() – Fe() = fpime(). Fe() plus this difference fpime() in rules would be equal to fpi(). One could differentiate the rule to infinity by taking the limit as the number of differences in rule approaches zero. Limit: Fpi()-Fe() / n where n goes to infinity. I think that would result in a continuous a set of rules between e and pi. <- I think this may just be a calculus of variances though, not of rule.

 

December 15, 2023

Started working on a new CPU core in November called Qupls or Q+. It is a four-way out-of-order superscalar core taking a different approach than the previous Thor cores. While being four-way and having a 32-entry ROB, the core is smaller than the two-way Thor core. The core supports a 64-entry general purpose register file. Instructions are placed in blocks and processed in groups of four. More documentation for the core can be found in github at: https://github.com/robfinch/Qupls/tree/main/doc

 

Also updated the website to remove advertising. It was not generating any revenue and maintaining it is a pita.

 

March 20, 2022

I have been working on Thor2022 since January. Reduced the number of registers from Thor2021 to 32 regs from 64 regs and modified a few of the instructions. Many of the opcodes remain the same. My most recent work has been on a hash table for virtual memory. The hash table is implemented entirely with block RAM using about 1/3 of the RAM in the FPGA. The virtual memory pages are 64kB in size which means the translation table is reasonably small for the 512MB RAM on the board. There are 16384 entries allowed for in the hash table which are grouped together into groups of eight. So, there are 2048 groups of entries. A hash of the virtual address selects a group of entries. Then all eight entries are searched in parallel for a translation match. Given the large size of a page, I came up with a way of using 1kB sections of the page. Multiple virtual addresses may map to different sections in the same physical page. It has kept me busy for a while.

 

December 7, 2021

Most recently I have been working on the Thor processing core version 2021. I post in a blog style almost daily at anycpu.org. Thor is another 64-bit processor. The ISA features support for vector operations. Instructions are 16,32,48 or 64 bits, but most instructions are 48-bit. 64 GPRs. The project is located in github.

 

April 13, 2020

Added a chapter from a book I’ve been working on to the flash fiction section of the website. Lately I’ve been delving into the format of numbers in particular the posit number format which promises higher accuracy and greater dynamic range than regular floating-point numbers.

 

February 27, 2020

I’ve been browsing the S100Computers.com site. Wanting to expand the FPGA computer into multiple boards I was searching for a suitable bus standard to connect the boards. Several bus standards come to mind. PCI Express being a more recent bus. I like the high-speed serial nature of PCI Express, it makes a lot of sense to save pin counts and offers great performance as well. It is however somewhat of a closed bus standard. Other bus standards that come to mind are PCI, ISA, NuBus, ECB bus. I studied them all. They all have features I like and dislike. I’d like to see a bus with 48-bit addressing for instance. It’s possible to modify the bus standards slightly to get 48-bit addressing. I’m leaning towards developing my own bus standard for hobby use. I want prototype boards to be small, several of the bus standards have connectors that are too large. I’ve sketched out a high-speed serial bus that uses HDMI transceivers. Not as fast as PCI Express but hopefully still useful. The interesting thing about using HDMI is the clock and signal recovery capabilities. using differential pair signalling (TMDS) should help with noise immunity. Just a thought for now.

 

February 9, 2020

Most recently working on the Petajon project, a 64-bit machine with the intention of debugging the Femtiki operating system. Not so recently created a 32-bit version of the 6522, via6522 to go along with the uart6551. As usual the cores are under my Github account.

 

July 15, 2019

The most recent update of the website is a reference to the uart6551 core on the 6502 page. The uart core is 32-bit with the low order eight bits register compatible with a 6551. It features fifo’s (not present on a 6551) and extended baud rate selection.

 

June 8, 2019

I spent some time researching and coming up with versions of a reciprocal square root estimate. The estimate is used in computer graphics in particular for shading effects. One implementation for reciprocal square root was used in the Quake game. I coded a pretty direct version of this algorithm in Verilog using floating-point components previously developed for multiply and add/subtract. Then I created a second version using a state machine to eliminate a couple of the multipliers by using them sequentially. Finally, I came up with a table version of the reciprocal square root estimate that’s not based on the algorithm. The table lookup version is the smallest and fastest core, it’s also the least accurate. The table lookup version is accurate to better than 3%. All three versions were placed in the same file and selectable via constant definitions. The code for the core is available on my Github account under the rtfItanium core, floating point unit.

 

The table lookup version makes use of two tables and conditional logic which forces the output to infinity for suitably small values of input. It’s interesting because a 16k entry table was compressed down to 8k entries by noting where values were infinite. At entry 8129 of 16384 and beyond values become meaningful. 64 entries before 8192 are stored in a separate small table (68 entries), then entries 8192 to 16383 are stored in an 8192-entry table. What is stored? The tables can be compressed by noting that the estimate won’t be accurate to more than six bits because that’s all the mantissa bits that can be used to index into the table. So, the table only stores nine bits of the mantissa of the estimate. Seven bits of the table are used to store the exponent of the estimate. In almost all case only seven bits of the exponent need to be stored because the top bit is set to zero by the square root operation. The cases where this isn’t true are fit into the smaller 68 entry lookup table.

A screenshot of reciprocal square root estimate tables

Description automatically generated

 

 

June 3, 2019

It’s been a while since I updated the site. I’ve been busy working on the FT64 and then the rtfItanium. The rtfItanium is a three-way superscalar core capable of executing up to three instructions at a time. In some ways it’s simpler than the FT64 though. It makes use of 40-bit instructions in a 128-bit bundle with an 8-bit template. The instruction decoder is simpler than in FT64. The source code for the core is located under my github account.

October 8, 2018

The Goldschmidt divider is fast and reasonable to implement. It makes use of two multipliers and a shift register. The Goldschmidt divider is described adequately in several different webpages. A key to getting the divider to work well is to choose the first factor carefully.

The FT64 project is coming along nicely. A test system has been built and loaded into an FPGA. It’s still a challenge to get everything working. The test system includes a GPU which executes a subset of the FT64 instruction set.

 

March 12, 2018

Most recently I've been working on a project to emulate the 6567 chip using an FPGA and verilog code. The project is called FAL6567 and project outputs in my Github account. Part of the project is a 65816 based low power computer.

Slightly less recently the FT64 project was born. In it's current rendition it's a two-way super-barrel processor with 32 hardware threads. FT64 uses a 36 bit instruction encoding. Also on Github.

October 13, 2017

I've stayed away from the tech side of things for a couple of months now. It's good to take a break from whatever you're doing once in a while. I've been busy playing video games, setting up a stock trading account, and doing illustrations for a book. The most recently added flash fiction is a story called "Birth of a Mutant". Nothing like a home-made nuclear reactor to warm up those cold winter nights. Like most of my stories it's a mash of real and fictional events.

June 30, 2017

Most recently I've been experimenting with grid computers. FPGA's are large enough to support multiple processing cores. The first grid computer was made from multiple (56) Butterfly16 cores. Each node in the grid has access to 8kB ram and rom along with a router. The grid computer doesn't do much at the moment besides display the results of ping operations to other nodes from the master.

January 8, 2017

Added an updated PSG32 (programmable sound generator or sound interface device) to the audio cores section. It uses 32 bit frequency accumulators rather than 24 bit to allow a higher range of input clock frequencies to be used. Registers are now 32 bits wide. A couple of new features have been added to the core including FM synthesis and reverse sawtooth waveforms.

December 13, 2016

I've moved back to playing with "firm" ware. My most recent endeavour is the DSD7 (Dark Star * Dragon Seven) core. It's a 32 bit core with 80 bit extended double precision arithmetic. I really wanted the core to help validate an FPU. It seems to work okay. For the next core I'd like it to support the 80 bit format and one thought is to just make the entire core 80 bits. There's a lot of problems dealing with 80 bit quantities when memory is often a power of two in size (32/64/128 bits). For DSD7 the problem was delt with by using triple precision (96 bits) for a storage format even though the hardware only supports 80 bits. Using 96 bits rather than 80 bits had the benefit of keeping the stack word aligned. Why not just use 64 bit double precision ? Sometimes it doesn't have enough precision. It's only about 16 digits. Suppose you want to work with numbers accurate to six decimal places, that leaves only 10 digits to the left of the decimal available. In some circumstances that isn't quite enough. 80 bit precision gives a few extra digits. One thought I have for an FPGA based FPU is to use 88 bit precision. The multipliers in the FPGA can produce efficient 72 bit results which would be good for the mantissa. That's about 21 digits. 72 bit mantissa + 15 bit exponent + 1 sign bit is 88 bits. A larger exponent isn't really needed, the 128 bit IEEE format uses only a 15 bit exponent. Going with more mantissa bits in an FPGA uses resources less efficiently.

October 16, 2016

The latest batch of work has been on a simple .MNG file viewer. It is capable of viewing MNG files in the simplest format. Finray has been extended with the ability to loop back and parse multiple frames of information. From this it can generate simple animation. It stores sequences of PNG bitmaps which can then be loaded with FNG (The MNG file viewer) and turned into simple MNG files.

April 27, 2016

For the past couple of weeks bitmap controllers have been on my mind. It's amazing how something fundamentally simple can get to be fairly complex. The basic operation can be summed up in a single line. A bitmap controller reads through memory in a linear fashion and outputs to a display. However once you throw in options to support multiple display resolutions and color depths things start to get complex. For the latest bitmap controller added on top of simple display capability is pixel plotting and fetching. Pixel plot / fetch is a reasonable operation to perform in a bitmap contoller as bus aribtration for memory is already present. Depending on the color depth pixels may fit unevenly into memory locations. This can result in complicated software to fetch or store a pixel. Software can be made simpler by provided a hardware pixel plot and fetch.

March 08, 2016

I've been experimenting with ray-tracing and come up with a "simple" ray-tracing program. The program uses a ray-tracing script file (.finray) to generate images. The script language supports generation of random vectors so that random colors and positions may be used. It also supports composite objects and repeat blocks. The display of a group of object may be repeated a number of time. The image below shows some sample output.

March 03, 2016

I took a break from FPGA cpu's for a bit to develop some games. I created a rendition of the venerable asteroids game. It's available for download in the software directory.

Jan 09, 2016

I've been experimenting with error-correction for the memory components of the latest system. I found a bad bit in the host system and the way to work around it was to use error correcting memory components. The diagram below shows the error correction associated with DRAM memory. It stores an eight bit byte plus five syndrome bits in a sixteen bit memory cell. The reason I chose to error correct on a byte basis rather than a word basis is that correcting on a byte basis doesn't require implementation of read-modify-write cycles.

Once error checking is included there is some justification for using bytes larger than eight bits in size. A five bit syndrome can provide error correction information for up to eleven data bits not just eight bits. Using eleven bit bytes plus five bits for error checking it would fit nicely into 16 bits. One would likely be using a 16 bit path to store an eight bit byte plus five bit syndrome to memory. So why not use all the bits and go with eleven bit bytes instead ?

 

October 24, 2015

Most recently I've been working on porting Fig Forth 6502 to the RTF6809 and converting it to use 32 bit Forth words. It doesn't quite work yet, but it's close. Forth is an interpretive computer language. I hope it to be able to make use of the RTF6809's 32 bit address space. The work is posted on my github account.

June 25, 2015

I've started yet another FPGA processor project called Dark Star Dragon One (D.S.D.1). Featuring variable length oriented instructions, segment registers, branch registers, and multiple condition code registers. Yes this does mean I'm shelving the FISA64 project for now.

The author is of the opinion that any serious processor will have variable length instructions, the improvement in code density and cache usage is just too great to avoid. 16 bit instruction were added to FISA64 and improved code density by about 20%. Having an inherently variable length architecture should improve things even more.

Segment registers do get used in general purpose applications. DSD1 will be reusing some of the segmentation model from the Table888 project.

The branch register set is really just a collection of registers that are specially defined in most instruction sets. This set includes the program counter, exceptioned program counter, return address register and others. In this design they are given their own explicit register array.

April 21, 2015

FISA64 is continuing to occupy my time. I've been posting about it frequently in BLog style at anycpu.org. Yesterday's work was on the compiler try/catch mechanism and getting CTRL-C events to be handed to tasks. In the past month I've written a system emulator for the FISA64 test system and have been using it to test out software. I added then removed bounds registers from the processor design, then added a simpler check (CHK) instruction instead.

March 15, 2015

Tonight's quandary is a design decision that leaves the same FISA64 branch instruction branching to one of two different locations depending on whether or not it's predicted taken. FISA64 makes use of immediate prefixes to extend immediate values beyond a 15 bit limit set in the instruction.

Branch instructions can’t make proper use of an immediate prefix because they don’t detect an immediate prefix at the IF stage in order to keep the hardware simpler. (There is no requirement for conditional branching more than 15 bits). However a branch instruction just uses the same immediate value that is calculated for other instructions in the EX stage. This could lead to branches branching to two different locations if an immediate prefix is used for a branch.

For example if a prefix is used with a branch, BEQ *+$100010 for instance (the $100000 displacement would require a prefix). Then the branch will branch to *+$10 if it is predicted taken (ignoring the prefix), but to *+100010 if it’s predicted not taken, then taken later in the EX stage.

 If the branch is predicted taken, it’ll branch using the 15 displacement field from the instruction. If the branch is predicted not taken, but is taken later in the EX stage, it’ll branch using the full immediate value, which with prefixes could be up to 64 bits. The solution is that the assembler never outputs branches with prefixes. There is no hardware protection against using an immediate prefix with a branch. 

In the IF stage ,rather than look at the previous instructions for an immediate prefix, the processor simply ignores the fact a prefix is present, and sign extends the branch displacement in the instruction without taking into account a prefix.

IF stage:

                                if (iopcode==`Bcc && predict_taken) begin

                                                pc <= pc + {{47{insn[31]}},insn[31:17],2'b00};   // Ignores potential immediate prefix

                                                dbranch_taken <= TRUE;

                                end

However, the EX stage uses a full immediate including any prefix, also to simplify hardware.

EX stage:

                                `Bcc:       if (takb & !xbranch_taken)

                                                                                update_pc(xpc + {imm,2'b00});   // This uses a “full” immediate value

 

December 28, 2014

Addressing modes in a modern processor are boring. For the typical RISC processor only a single address mode is supported because it's the minimum needed. That address mode is register indirect with displacement. A register is added to a displacement to form the memory address. Sometimes indexed addressing using two registers is also supported. Few new processors have available memory indirect addressing modes. The plethora of addressing modes on an older processor like the 680x0 series made the processor interesting. The key benefit to memory indirect addressing modes is that it allows pointers stored in memory to be larger than the size of a register. This is put to good use in the 6502 processor. In the latest ISA FISA64 memory indirect address modes are available to experiment with. a 128 bit address space is supported using memory indirect address modes.

December 12, 2014

Tonight's lesson is one about clock gating. When a clock is gated it introduces a buffer delay to that clock tree. If the ungated version of the clock is also being used, the buffer delay in the ungated version needs to be matched with that of the gated clock. Otherwise if the buffer delay isn't matched the P&R tools may have a heck of time trying to meet timing requirements.

December 10, 2014

IEEE standard for floating point isn't the simplest thing to get working, or so I'm finding out. I've spent some time recently working with floating point units both standard and non-standard. One can do a lot of computing without floating point. Many early micro-processors didn't support floating point at all. How to incorporate floating point into an older system using an eight bit micro came to mind. FT816Float is a memory mapped floating point device oriented towards byte oriented processing. It's a bit non-standard and makes use of a two's complement mantissa rather than a sign-magnitude one.

November 7,2014

Yet another ISA is born this past week. FISA64 is a 64 bit ISA that attempts to overcome the shortcomings run into with the Scarerob-V ISA. Rather than having a segmentation model that works automatically behind the scenes, the FISA64 ISA requires "manual" manipulation of the segment registers. This is possible by supporting two modes of operation: kernel and application. In kernel mode the address space is a flat unsegmented one. This allows the segment values to be manipulated without affecting the processor's addressing. The segmentation model supports up to a 128 bit address. The processor does not support a paging system.

November 3, 2014

I spent the past week or so working on a new ISA. Well I synthesised an implementation of it, and it's too big. Too big at (122 %) the size of the FPGA. It's a shame because it had a nice segmentation and protection model, similar to x86 series. Projects tend to get bigger with bug fixes, so there's no way to shoehorn it into the FPGA. So for now it's another project that's being shelved. Time to get back to a basic simple 32 bit ISA. Why not RISC-V ? I'm not overly fond of the ISA layout and the branch model. There's also fewer instructions than I like to see in the base model. Sure the ISA can be extended with brownfield or greenfield extensions but then there's the issue of compatibility. If one is going to go to the trouble of extending the ISA and developing toolset changes to support the extended ISA, why not just start one's own ISA ? One wants to use an existing ISA to leverage the use of the ISA's toolset.

October 31, 2014

Scarey Halloween. They're back. The nightmare of segment registers. I wasn't going to include them in the latest ISA design, but I've changed my mind after reading up on how they are used in a modern OS. Normally segment registers (CS, DS, SS) are initialized to zero and left alone. However other segment registers (FS, GS) can be used like an additional index register in an instruction to quickly point to thread local storage and global storage areas. So I've added segment base registers to be used in this fashion to the latest ISA design. The latest ISA in the works is called Scarerob-V given that it's halloween, and other recent events. Scarerob-V ISA makes use of variable length instructions which are much shorter than those of Table888.

October 22, 2014

The RISC-V ISA (riscv.org) has a lot going for it. Variable length instructions, extensibility with 32/64/ and 128 bit versions. A simple base ISA and a number of standard extensions. It seems to be one worth studying and I've spent some time studying this recently. It's become an implementation project on my todo list. The RISC-V ISA is an ISA that attempts to please all. It'll be interesting to see how well it works in practice.

October 4, 2014

A couple of Flash-Fiction stories have been added recently to the website. A page for character descriptions has also been added. The Finitron verse is slowly expanding.

August 19, 2014

Back to the drawing board. I've started working on yet another soft processor core, expanding my toybox furthur. The instruction set will be similar to Table888's. Support for a segmentation model is not going to be provided. Also dropped is index scaling on the indexed addressing mode. The new core will stay with a 40 bit fixed size opcode, and 256 registers.

July 15, 2014

I've taken a break from my normal HDL artistry to work on a piece of software that generates artificial maps. The basic map generator is based on something called a Veronoi fracture map. The fracture map simulates lumps of matter composing the planet. Previously the map generator was based on a fractal generator which generated nice looking maps but they weren't very realistic (it placed mountains in the centre of continents). Now mountains are along the coast and where there is extreme difference in elevation, more in line with reality.

July 10, 2014

Learning more about the .ELF file format and how to link object files together was the order of the day. .ELF files are a popular standard file format used to represent executable and relocatable files. I was looking at the extended ELF64 file format developed by HP/Intel with the intent of supporting the format for the Table888 project. The A64 assembler can output .ELF files in addition to binary and listing files. In theory the L64 linker can link together .ELF relocatable files produced by the assembler. It's the first time I ever wrote a linker, and there's still a couple of issues to resolve with it.

July 01, 2014

Got hung up on mneumonics. The compiler called the exclusive or function XOR and the assembler recognized only EOR. A quick fix to the assembler allows it to recognize XOR as well as EOR as the same instruction. I can never make up my mind on that one, so I'll just support it both ways. All kinds of different mnemonics are used to represent essentially the same instructions in different assembly languages. Is that ADDC the same as the ADC in another instruction set ? One has to research carefully sometimes while working with assembler code. Is SED set the decimal mode or set the direction flag ?

June 21, 2014

I needed something small and simple to test the C64 compiler with and I needed some sort of file system available for my system. Luckily I found ChaN's FatFs which fits the bill. ChaN's system provides all the basics for a FAT file system operating in an embedded system. All one needs to do is to supply a few interface routines to the low level disk access. I've been busy working towards a simple SD Card access system. My current goal is to be able to load and run a file from the card. After a few compiler fixes I've got as far as being able to display a directory. It's a slow going circus dance.

June 12, 2014

FPP (Finch's macro pre-processor) has been updated with some bug fixes. It's undergoing testing by compiling the MINIX system. The fixes include an operator precedence problem fix and a macro expansion bug fix. The pre-processor was originally written in 1992 so it's now 22 years old. Recent work has been on the C64 compiler, modifying it to support the Table888 processor.

June 8, 2014

Tonight's escape is clock throttling. Clock throttling or controlling the clock rate can be used to control power consumption. The lower the clock frequency is, the less power is used. Power as we all know is physically proportional to frequency. Being able to control power consumption is one place where a gated clock might be used. Generally speaking gating clocks is not a good idea but occasionally it is done. Fortunately the FPGA vendor provides a clock gate specifically for handling gated clocks. Incorporated into Table888 (the latest processor work) is a clock gating register. This register is filled with a pattern that controls the clock gate, for power control.

June 3, 2014

NOP Ramps are my latest craze. |n order to avoid really complicated hardware, the concept of NOP ramps can be used. I'm talking about what happens when instructions cross page boundaries in a system with memory management. The problem with instructions spanning page boundaries is classic. If there is memory management page miss, the instruction needs to be re-executed once the missing page is brought into memory. In order to ensure proper operation both the missing page and the previous page need to be in memory. Re-executing instructions can be a non-trivial problem. Fortunately what I'm working on only has a handful of instructions that can cross page boundaries. Rather than attempt to re-execute the instructions, the assembler just forces the instruction into the next page of memory by inserting NOP instructions. Hence it's the NOP instructions that span the page boundary. If there is a need to re-execute them, then it is trivial to do so. NOP ramp example:

00008FF0      41 F8 2A 90 00             bne           fl0,kbdi2

00008FF5      16 01 24 00 00             ldi           r1,#36

00008FFA      EA EA EA EA EA       ; imm

00009000      EA EA EA EA EA

00009005      FD 70 FF 03 10

0000900A      A0 00 01 00 18             sb            r1,LEDS

May 15, 2014

One can write a lot of code using just three registers if one codes in assembly language and is careful. With just a few registers to work with, a byte-code processor can offer high code density. This is great for microcontroller type applications where memory space may be constrained. What if one wants more registers in order to support a compiled language ? RISC processors were originally designed for high performance with compiled languages. The typical RISC processor uses a fixed size instruction format. Unfortunately, one size does not fit all instructions, and the result is that code density for the typical RISC style suffers. To improve code density one can look at the typical operation performed and encode them in as few bits as possible. Allowing the size of instructions to vary in a design based on a RISC processor, results in a kind of hybrid processor; the worst of both worlds. Lower code density and higher complexity. Unfortunately processors become complex anyways when they have to support legacy systems. Trends for currently popular architectures include variable instruction sizes (ARM, INTEL) and flags registers (ARM, INTEL, SPARC). If one removes the limitations of a fixed size instruction set, one can optimize instructions for code density. It's amazing how adequate a branch instruction composed of an eight bit opcode, and an eight bit displacement is. This sixteen bit instruction covers about 90+% of the cases where a branch would be used. The RTF65003 strives to have a good mix of legacy support, while adding additional registers and increasing the addressing space. It is necessarily more complex than an new design.

May 14, 2014

What's better than the RTF65002 ? - The RTF65003. There are several things I don't like about the RTF65002 so I started working on a better version. One item is the branch target address. In native mode on the RTF65002 the target address is computed relative to the address of the instruction; this is different than the '02 and '816 where the address is computed relative to the address of the next instruction. The '003 follows the convention set by the '02. Another issue is the different code and data addresses of the RTF65002. The RTF65002 is a word addressed machine for most data operations and this makes it difficult to use with a compiled language like 'C'. I decided before putting a lot more work into porting software, to create an improved version of the processor. The RTF65003 has byte addressable memory operations, and greater support for different operand sizes. Byte (8 - bit) or character (16 -bit) prefix codes can be applied for memory operations to override the default of a word sized operation. Prefix codes are used to modify the behaviour of following instructions rather than creating a whole bunch of rarely used instructions.

May 12, 2014

Compiled code for the RTF65002 generated signed multiply instructions which hadn't been added to the processor. Two possible solutions were to either change the compiler so that it generated code to perform sign adjustments or modify the processor to include signed multiplies. Thinking that adding signed multiplies to the processor would generate too much additional overhead, they had been left out; I decided to try adding them. Well, lo-and-behold adding the functionality made the processor smaller and faster (by about 5%!). I guess adding the opcodes simplified the instruction decoder.  Encouraged by this good fortune I decided to try adding signed division and modulus operations as well. Doing this resulted in almost no impact to the size or speed of the processor. So the RTF65002 now supports both signed and unsigned multiply / divide / modulus operations.

May 11, 2014

I love today's machines. It makes it possible to do things that were impossible on those of yesteryear's. Take for example a string handling library. The descriptions of the strings can consume more memory than ever before. The current string library I've got makes each string a member of an all strings list, so that all the strings can be garbage collected all at once. Making a list like this isn't practical on a small machine because it would consume too much in the way of memory resources. I also blithely load entire text files into strings, rather than process a line at a time. It seems like poor programming practice, but it's really in the interest of simple algorithms.

May 9, 2014

In order to implement firstcall blocks in a compiled language, auto-converting branches are used. An auto-converting branch (ACBR) acts like a NOP instruction (a branch never) and a store the first time it is executed, and it changes itself into a branch always (BRA) instruction for subsequent execution. In order for this to work properly any instruction cache has to be disabled; this is likely desirable anyway for one-time executing code, so that it doesn't fill up the cache. Shown below is a sample usage and resulting compiled code.

// High level language

firstcall {

     printf("This appears the first time only.\r\n");

}

start_tick = get_tick();

; Generated assembler code:  

     icoff               ; turns off instruction cache

    acbr L_9         ; auto-converting branch into a bra

    ld r5,#L_3>>2  ; get parameter for printf

    push r5

    jsr printf         ; call the printf() routine

    sub sp,#-1      ; dump the parameter

    ld r5,r1

L_9:

This is an excerpt taken from a prime number sieve program written in C32 a C like language. It has been successfully compiled and run on the RTF65002 processor. The program was compiled, assembled, then the resulting binary placed on SD card. It was subsequently loaded and run as a task in the test system.

May 5, 2014

Supermon816 is now running on the RTF65002 in 65C816 emulation mode. Supermon816 is a monitor program contributed by BDD (Big Dumb Dinosaur) at 6502.org. It allows one to assemble / disassemble programs, dump memory, and search for data and more. The program can be activated by pressing 'SU' at the '$' prompt on the test system. The 65C816 is an 8/16 bit processor found in a few systems like the Apple IIgs and SNES. The RTF65002 test system has it's own monitor program for native mode, which is slightly different than Supermon. Supermon816 is one of the first programs to run in '816 mode, and helped to verify that emulation is working correctly.