The mMIPS (= mini MIPS with network interface)

Introduction
Supported instruction set
Supported data types and soft ops
Changing the memory layout
Network interface
Cache

Introduction

The mMIPS (pronounced as "mini MIPS") is a simplified version of the MIPS processor. Compared to the MIPS it has a reduced instruction set which means that some operations need to be done in software. The mMIPS that comes with the package has been extended with a network interface. On this website the term mMIPS refers to the mMIPS with this network interface. This page also contains information about an instruction and data cache for the mMIPS. We do not use that functionality. The SystemC sources of the mMIPS are in ./noc/mmips. It is possible to change the sizes of the code or data memories as well as the number of mMIPS processors on the NOC by changing the dimensions of the network.

Supported instruction set

These instructions are supported by the mMIPS and the compiler lcc that we use in this project:

More information is available in the text file README.mMIPS in the ./lcc directory.

Supported data types and soft ops

Below is a summary of the sizes in bytes of the standard data types in lcc and the operations that are performed in software (called soft ops) due to the lack of a complete instruction set.

Changing the memory layout

The mMIPS implementation that comes with this package assumes two separate memories of 16 kilobytes, one for instructions/code and one for data structures. It may be desirable to change the size of these memories, e.g. to fit more processors on the FPGA. The memory sizes have however been hard coded in the SystemC sources of the mMIPS, in the lcc compiler and in the applications that run on them. Changing the memory sizes in one place means changes in the other locations as well. This paragraph explains what changes need to be made and how to make them.

Note The address of the first byte in the data memory is 0x0 in C and thus the last byte is at address 0x3FFF. It is therefore not possible to (accidentally) address bytes in the instruction memory from C.

1. C Debugging library
  Change the output destination memory range of mprintf() in the C debugging library and recompile it. This memory range reserved for this purpose contains the bytes at address MPRINTF_START_ADDR up to and including MPRINTF_MAX_ADDR (these preprocessor symbols are defined in mtools.h). The data range reserved for mprintf() is 0xE00 through 0xFFF (= 512 bytes) in the version of mtools.h that comes with the package. The bytes at address 0x0 through 0xDFF have been reserved for the input / output image of the JPEG decoder. This means that the bytes up to and including 0xFFF will not be used by lcc. If MPRINTF_MAX_ADDR increases, then the C compiler lcc needs to be reconfigured also to prevent it from storing data structures in the area used for the JPEG image or by mprintf().
2. C compiler lcc
  The following files determine the memory layout as the compiler sees it:
  • ./lcc/lccdir/minimips.link
    Determines the code and data memory segment sizes and positions. By default the values are as shown in table 1.

    Table 1: Variables and structures in ./lcc/lccdir/minimips.link that affect the memory layout including their initial values.

    VariableInitial value
    ROM_START_ADDR 0x00000
    ROM_LENGTH 16k
    RAM_START_ADDR 0x05000
    RAM_LENGTH 12288
    MEMORY rom (rx) : ORIGIN = 0x0, LENGTH = 16k
    ram (!rx) : ORIGIN = 0x5000, l = 12288

    The MEMORY structure tells lcc that the 16 kilobytes up to 0x4000 should be used for instructions, that the 4 kilobytes from 0x4000 up to 0x5000 should not be used and that 12 kilobytes from 0x5000 through 0x7FFF should be used for data structures. As explained in the previous item, the bytes from 0x4000 through 0x4DFF and 0x4E00 through 0x4FFF are used to store the JPEG image and the output of and mprintf() respectively. These ranges are 0x0 through 0xDFF and 0xE00 through 0xFFF when referred to from C code. The variables ROM_START_ADDR, ROM_LENGTH, RAM_START_ADDR and RAM_START_ADDR express the same situation. Both these variables and the MEMORY structure needs to be changed whenever memory sizes change.

  • ./lcc/lib/crt0.s
    This file contains information for the boot loader (e.g. first instruction, last instruction, stack and heap size).

A recompilation of the lcc compiler is necessary only if you change crt0.s.

3. mMIPS SystemC implementation
  The following files in the System C implementation of the mMIPS determine the memory layout:
  • ./noc/mips/mips.h
    RAMSIZE and ROMSIZE determine the size of the data and code segments in bytes.
  • ./noc/mips/mmips.h
    Includes the files brom16k.h and bram16k.h that define the BROM16K and BRAM16K modules that are also instantiated in the mmips.h code. Replace all occurrences of BROM16K and BRAM16K by one of the preset memory sizes (8k or 64k) or create custom memories.
4. Gossip and JPEG decoder
  The script strings.sh writes that portion of the data segment to stdout that contains the output of mprintf(). If the output location of mprintf() was changed, then this script needs to be changed too. strings.sh is used by dolcc in ./c_libs/gossip and ./c_libs/djpeg_mmips. For the JPEG decoder it is also necessary to change the memory location where parse.c and step3.c load and store the picture respectively. The first file expects the picture at the start of the data memory (address 0x0) and the last character at MAX_ADDR_IMAGE (address 0xDFF initially). In step3.c sunraster_header *FrameHeader is initialized to the starting address of the picture (0x0) and unsigned char *FrameBuffer to the first byte after the header of the image (0x20). See also: choosing another the input image.

Network Interface

Since the network interface (N.I.) is memory mapped it can be accessed through specific memory locations. The network interface is controlled in mMIPS assembler by appropriate stores/loads to/from these memory locations. A new module, MEMDEV, replaces the data memory of the original mMIPS and, based on the requested memory address either read/writes the RAM memory (for regular addresses) or performs appropriate communications with network interface (for device addresses). Note that the C communications library stdcomm handles all the internals of the N.I. This setup is the depicted in the following figure.

 

Figure 1: The MEMDEV module accesses RAM or NETWORK_INTERFACE depending on the address.

The MEMDEV module recognizes two addresses assigned to the N.I.: 0x80000000 and 0x80000004 (note how the device access is indicated by the most significant bit of the address). The first address (data word address) is associated with N.I. data while the second word (control word address) is used for N.I. control. Reading and writing from the data word results in the reading/writing internal buffers of the N.I. (these are physically separate buffers, so writing to the data word and reading it back will not return the same value). The read/write operation to the data word are always non-blocking, i.e. regardless of the state of the N.I. they read/write the N.I. buffers. However, depending on the state of N.I. the read data may be invalid, or the written data can overwrite the packet in the send buffer, which was not sent yet. To monitor the status of the N.I., the control word of the N.I. can be accessed by the memory address 0x80000004. The meaning of the bits in the word is explained in figure 2.
 

Figure 2: Meaning of the bits in the N.I. control word (device address: 0x80000004).

Note that some of the bits in the control word can only be written (0-15, 17 and 20 - they control the behavior of N.I.) while some can only be read (16, 18 and 19 - they report the status of N.I.).

The status bits include:

The control bits:

The network interface is a module used to send and receive packet on the network. It is capable of sending and receiving packets with lengths being an arbitrary multiple of 32 bits. When such a packet is sent over the network is is split in smaller parts called flits before it is sent. Conversely, the arriving packet is reconstructed by collecting three flits. For any additional 32-bit word within the packet, additional two data-flits need to be added to the packet. The two actions of sending and receiving the packet are performed by two independent processes within the network interface (i.e. N.I. is able to receive and send simultaneously). The interface of the module is shown in the figure 3.

Figure 3: The network interface module.

On the network side, the interface is compatible with the network's router: it has two sets of data/req/ack signals, one in each direction.

On the processor side, NI provides a set of signals necessary to write destination address and the packet word data (reg_data_in with write_addr and write_data), read received packet word (reg_data_out), trigger packet sending (send) and confirm packet's reading (read) and the signals reporting communication status (send_rdy and data_rdy). In addition to that, two signals are used to mark last words of a given packet: packet_end asserted together with send means that a last word of the packet is being sent, while rcv_packet_end active together with data_rdy means that the last word of the packet has arrived.

The interface is fully synchronous, i.e. all data and associated strobes are probed on rising clock edge.

The sending of a packet is performed in the following manner:

  1. Wait for send_rdy to become high
  2. Output packet word on the reg_data_in bus and assert write_data signal.
  3. Output destination address (relative) on the bottom 16 bits of the reg_data_in bus (X distance - bits [15:8], Y distance - bits [7:0]) and assert write_addr signal
  4. Assert send signal to trigger word sending. If this is the last word of the packet, assert packet_end as well.
  5. If more packet words remain to send, wait for send_rdy to become high and goto step 2.

To read a received packet:

  1. Wait for data_rdy signal to become high
  2. Read data present at reg_data_out
  3. Assert read signal to free the buffer and allow receiving of new packet words.
    If rcv_packet_end is high, this is the last word of the packet.

The functions sc_send() and sc_receive() in the C communications library stdcomm encapsulate the internals of communication..

Cache

The data and instruction memories within MIPS were replaced with two identical cache memories. The block diagram of the cache memory is presented in figure 4.

Figure 4: Block diagram of the cache memory.

The implemented cache memory is a direct mapped, write-through memory with no-allocate-on-miss write policy. It has 256 blocks (8 bit block index), with four words per block (two bit word offset). The breakdown of the address is presented in figure 5.
 

Figure 5: Block diagram of the cache memory.

For implementation reasons, only 15 out of the 20 bits of block tag can be stored. This limits the addressable memory space to 27 bits.

At the heart of the cache memory is the memory module that stores the cache entries. Based on the selected block index and word offset, the memory outputs the appropriate word, together with the valid bit and the tag of the selected block. When writing, the word present at data input is written together with the tag and valid bit to the memory. The output of the memory is connected to byte_select module. In byte read mode, it replaces the most significant byte of the data word with the requested byte.

The inputs of the memory are connected to din_select and addr_split modules that generate appropriate memory control signals based on the input data and cache control signals.

The inputs to the cache are registered in the input register modules. These module register the values present at their inputs (unless the disable signal is asserted) and have two outputs: dout reflects the current contents of the register, while dout_select, depending on the select input, outputs the registered or the current input data.

The control signals of the cache are generated by the miss_ctrl module.

The interface of the cache has two sets of signals: the cache access signals are consistent with the interface of the regular memory and therefore allow seamless replacement of a memory with the cache. The second set of signals allows communication with the external memory or other module responsible for fetching data from the cache.

The cache can perform three different types of operations:

  1. Reading (word or byte). In this mode, addr_split splits the input address into appropriate index/offset/byte and the memory is read. Since only synchronous memories are available in the target Xilinx FPGAs, the memory does not respond to the changed address until the following rising clocks edge. To be able to realize one cycle memory access despite that fact, the address fed to the cache has to be sourced from before the pipeline register.

    After memory is ready, miss_ctrl checks if the entry is valid and if the tag of the selected block is consistent with the requested tag. If the above conditions are not met, the miss condition occurs and new block contents (four words) needs to be fetched. Miss control signals miss condition by asserting miss_wait (which should freeze the processor) and initiates the fetch from main memory. The fetched words arrive one by one and are written to the memory by asserting we signal, and instructing din_select to use fetched data (din_fetch) instead of the regular input data, and instructing addr_split to output appropriate word offset (fetch_word). After the block has been fetched, miss_ctrl rereads the word that caused the miss and de-asserts the miss_wait signal. If byte-read was requested, byte_select performs reordering of word bytes and generates the requested data output.

    Since the cache input signals are not registered (they are sourced from before pipeline register) if miss occurs, the current inputs need to be stored and the following read/write (after miss has been handled) uses the registered outputs of the input registers rather than the direct one (miss control asserts select signal).

  2. Word write. In word write mode, the input word is immediately written to the memory and the asynchronous write transfer is launched to the main memory. This approach is possible, because of the read-before-write feature of the Xilinx memories. With this option enabled, the memory, when the writing a memory location, outputs the previous contents of this location. This way, if a write miss occurs, the previous contents of the memory is available at the outputs and can be written back to memory. In this case, miss_ctrl issues single-cycle stall (miss_wait) and rewrites the memory location with the previous value. This is achieved by asserting rewrite signal, which instructs din_select to output the previous memory value (din_mem) instead of regular data input.

    NOTE: to enable read-before-write option in the synthesized hardware, the RAM modules instantiated in the Verilog source need to be annotated with appropriate attributes. This is illustrated in the following Verilog code, which should be inserted in CACHE_MEMORY.v file, which contains description of the memory module:

        RAMB16_S36_S36 bram1(.DOA(DOA1), ... );
    	  /* synopsys attribute 
    	  WRITE_MODE_A "READ_FIRST"
    	  WRITE_MODE_B "READ_FIRST" */
        RAMB16_S36_S36 bram0(.DOA(DOA0), ... );
    	  /* synopsys attribute 
    	  WRITE_MODE_A "READ_FIRST"
    	  WRITE_MODE_B "READ_FIRST" */
    

    For more details on using Xilinx BlockRAM see this Application Note.

    As mentioned before, the write-through transfer of the written word to the main memory is initiated together with the writing of the word. However, if the transfer cannot be performed (e.g. a network interface communicating with the main memory through network has full buffer), the cache stalls until the transfer can be reinitiated.

  3. Byte write. For implementation reasons, the memory module does not provide byte-level access to the stored data words. This means that to write a single byte to the cache, it is necessary to read the word first and to rewrite it to the memory with the requested byte replaced. In most cases this means that the byte-write will introduce a single cycle cache stall necessary to rewrite the word. The exception is if the written word is not in the cache. Then, it is not necessary to write it back to the cache, and stall is not necessary. The rewriting of the the memory with the byte replaced is achieved by asserting byte_rpl signal, which instructs din_select to replace the byte in the word supplied from the memory (din_mem) with the byte present at the input (din).

    As was the case for word write, main memory transfer is initiated for the byte, which may also result in stall, if the main memory interface is not ready.

The memory module that implements cache memory is built out of two physical BlockRAM modules available in Xilinx FPGA. The setup is the blocks is presented in figure 6.

Figure 6: The memory  module consisting of two physical BlockRAM modules.

Each of the blocks is configured in 512x(32+4) mode, i.e. it contains 512 32-bit word plus additional 4 parity bits per word. Four words of a cache block with index X are stored at addresses X and X+1 in the first block and X and X+1 in the second block. Since the memories are dual-port, it is possible to simultaneously access two words within the same memory block. Thus, simultaneous access to all four words within a cache block is possible.

To store the tag and the valid bit associated with each block, the additional parity bits are used (the parity is not hardware-assisted, so the bits can be freely used by the application for other purposes than parity). For each cache block there are 4x4 parity bits available (4 bits for addresses X and X+1 in each of the blocks), which gives 15 tag bits and 1 valid bit.

For more information about the BlockRAM memory see this Application Note.