Interfacing to External Static Ram

Robert Finch

rplaskitti@birdcomputer.ca

March, 2010

 

Abstract

This documents covers the interfacing of an external asynchronous static ram to a system on a programmable chip (SoC). It is assumed that the system is moderately complex, and the ram is shared between two or more system devices.

 

Introduction

Many systems need to make use of an external RAM as the memory provided on the programmable chip may be insufficient to meet the needs of the application. Static RAM is often used because it offers the highest speed, low power consumption and simple interfacing. For true random access cycles such as when multiple devices access the same RAM, static RAM provides the highest performance.

The external static RAM is likely to be shared by several devices in the system, for example a cpu and separate DMA channels for a video controller and disk controller. For the best system performance, it is desirable to obtain maximum use of the RAM’s memory bandwidth. One way to minimize the cycle time of the RAM is to drive it directly from flip-flops (ff’s) without any intervening logic, and receive the output of the RAM directly into ff’s, once again without any intervening logic. Placing the RAM directly between ff’s allows for the use of registers close to the I/O ports interfaced to the RAM. This minimizes propagation delays transferring data between the FPGA and external static RAM.

 

Interfacing Options

There are several different ways to interface to an external RAM. Probably the simplest method is to directly drive the RAM inputs from the SoC address bus and place the RAM outputs directly on the SoC databus. By directly it is meant without registering the signals beforehand. The advantage of this approach is that it’s fairly straightforward to implement, the disadvantage is that it can be severely lacking in performance. The problem is that the system performance is limited due to the combination of bus multiplexing, routing and the RAM access time. This results in a long cycle time required to access the RAM. Without registering the signals, the RAM access time would include address multiplexing, routing, and decoding times, as well as data input bus multiplexing, and routing times. One way to decrease the cycle time, and increase performance, is to break the ram access across multiple cycles rather than having one long cycle. See the following example for cycle time calculations.

Example (cycle time):

Access time for read access by a cpu (the long cycle time):

  1. Request   from cpu 0ns
  2. address bus mux 10ns (muxing in other address sources such as video controller, disk dma)
  3. routing delay 10ns for the address bus (assuming a reasonably sized system)
  4. I/O pad delay 5 ns (out to the ram)
  5. Ram read access time 15ns
  6. I/O pad delay 5 ns (back onto the FPGA)
  7. Databus routing delay 10 ns
  8. Databus input mux delay 10ns

Total time for read access by cpu: 65ns. With a 16 bit SRAM this limits system performance to about 30MB/s.

Breaking the access into multiple cycles reduces the cycle time, because the cycle time becomes limited by the slowest stage.

Access time for read using registered inputs and outputs:

Stage 1:

  1. Request from cpu 0 ns
  2. address bus mux    10 ns
  3. routing delay    10 ns

Stage 2

  1. Latched address from previous stage 0 ns
  2. I/O pad delay 5ns
  3. Ram read access time 15 ns
  4. I/O pad delay 5 ns

Stage 3:

  1. databus routing delay 10ns
  2. databus input mux delay 10 ns

 

As can be seen from stage 2, the longest delay (which sets the cycle time)  is 25 ns.

Total time for read access: 25 ns. With a 16 bit SRAM this limits system performance to 80 MB/s.

 

Simply by breaking the RAM access into stages, performance has almost been tripled (according to the cycle time decrease). However it now requires three clock cycles to access the RAM.

 

Pipelining

It would seem that spreading out the RAM access across three clock cycles, rather than using a single clock cycle that is three times as long wouldn't have any impact on performance, but it does. Even if nothing else is done to the RAM interface, the clock cycle time can now be one-third of what it was before. This means that operations that don't use the external static RAM can now occur three times as fast. For example, a cpu that executes some code from ROM internal to the chip can now benefit from a faster clock cycle.

Another way to significantly increase performance is to pipeline access to the RAM. With access occurring across multiple clock cycles and because the RAM inputs and outputs are being registered, it is possible to pipeline the access to RAM. The RAM access really does still occur within a single clock cycle that is now three times as fast due to registering. The only problem is the delay cause by the multiple stages. However, with the access broken into stages, each stage can represent a different RAM access. For instance, stage one could be in the process of servicing a request for a video controller, while stage two is performing an access for the cpu, while stage three is providing the result for a DMA controller. When there are multiple devices accessing the RAM, the best use of the RAM bandwidth can be obtained, even when the devices accessing the RAM are not pipelined themselves. While one device (bus master) is waiting for a response from the RAM, the RAM subsystem can be busy accessing data for another device.

Example:

In a small SoC the devices consists of a non-pipelined 6502 compatible processor, and a bitmapped video display controller. The video display controller requires approximately 50% of the RAM bandwidth (an access every other cycle). So the cpu may get access to RAM every other cycle. However because the cpu is not pipelined, it must wait for an acknowledge from RAM before proceeding. It takes three clock cycles from the time the cpu requests access until a ready response is received from the RAM. The video display controller is pipelined. The display controller accesses are effectively hidden because they are interspersed with the cpu accesses in a pipelined fashion. The result is to effectively double the system performance over what would be obtainable without pipelined RAM access.

One nice thing about pipelined RAM access is that write cycles may return a ready status in a single cycle, as soon as they are posted to the RAM subsystem. Once the write address and data are latched into the system, there is no need to wait any longer. The write will eventually take place.

 

Sample Code

The following RAM controller is written in Verilog and allows pipelined access to an external asynchronous static RAM. One feature of the RAM controller is that it uses a bit-vector to track which device is requesting access, and to which device an acknowledge should be sent. In this case there are only two devices (a cpu and video controller). It is assumed that an acknowledge is not required for write accesses, as the external system will provide a write acknowledge as soon as the write is posted. The sample code also interfaces an eight bit SoC bus to an external sixteen bit RAM, so there is some multiplexing involved. There are two sets of input / output signals, signals that interface to the SoC and signals that interface to the RAM, so the controller acts as kind of a bridge.

This interface has been tested at 28.636 MHz, it will probably work at upwards of 40 MHz.

 

module RAMCtrl4(rst, clk, clk90, cs, req, ack, addr, wr, din, dout,
    ram_we0, ram_we1, ram_we, ram_oe, ram_ce, ram_a, ram_d);
    // system side connections
    input rst; // reset
    input clk; // system clock
    input clk90; // 90 deg. phase shifted clock for write timing
    input cs; // circuit select
    input [1:0] req; // identifies device requesting access
    output [1:0] ack; // identifies device for which data is available
    reg [1:0] ack;
    input [17:0] addr; // address
    input wr; // write signal
    input [7:0] din; // data input
    output [7:0] dout; // data output
    reg [7:0] dout;
    // RAM side connections
    output ram_we0; // low byte write
    output ram_we1; // high byte write
    output ram_we; // generic write
    output ram_oe; // output enable
    output ram_ce; // chip enable
    output [17:0] ram_a; // address
    inout [15:0] ram_d; // data

reg [1:0] req1; // holds request id for intermediate pipeline stage
reg ram_ce;
reg ram_oe;
reg ce1;
reg we; // registed signals
reg we0, we1;
reg wr0, wr1;
reg [17:0] ram_a;
reg [7:0] dol; // data output latch
reg addr0; // intermediate: address bit zero

wire wr0x = wr & ~addr[0];
wire wr1x = wr & addr[0];
assign ram_d[7:0] = wr0 ? dol : 8'bz;
assign ram_d[15:8] = wr1 ? dol : 8'bz;
assign ram_we = ~(we & clk90);
assign ram_we0 = ~(wr0 & clk90);
assign ram_we1 = ~(wr1 & clk90);

always @(posedge clk)
    if (rst) begin
        we <= 0;
        wr0 <= 0;
        wr1 <= 0;
        ram_ce <= 1;
        ram_oe <= 1;
        ram_a <= 0;
        dol <= 0;
        req1 <= 2'b0;
        ack <= 2'b0;
        addr0 <= 0;
    end
    else begin
        // register RAM inputs
        // stage 1
        addr0 <= addr[0];
        ram_a <= addr[17:1]; // address
        ram_ce <= ~cs;
        ram_oe <= wr;
        we <= wr;
        // On a write cycle we assume no ack is required
        req1 <= wr ? 2'b00 : cs ? req : 2'b00;
        dol <= din;
        wr0 <= wr0x;
        wr1 <= wr1x;
        // stage 2
        dout <= addr0 ? ram_d[15:8] : ram_d[7:0];
        ack <= req1;
    end

endmodule