Back to TOC
Prev: Day8$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$Next: Day 10

Day 9: Complete Pipelined RISC-V CPU Microarchitecture

Our RISC-V core from the previous day is still incomplete w.r.t the instructions implemented, and additionally we need to do pipelining and handling of the pipeline hazards.
We need to do the following to complete the CPU Design:

Pipeline the CPU, taking care of the data dependency & control flow hazards
Complete the implementation of the remaining ALU instructions
Implement DMEM & Load, Store instructions
Implement the Unconditional Jump (JAL, JALR) instructions

Pipelining the RISC-V CPU Core

9.1 Pipelining the CPU: Using 3-Cycle $valid signal

First, we will implement with a simplified 3-stage pipeline with using a 3-Cycle valid signal, the various stages being:

PC
Instruction Fetch + Decode
RF Read, ALU
RF Write, Branch Instrn. logic

This implementation would have an IPC of only ~1/3 as the valid signal is active once every 3 cycles (HLLHLL...) indicating only one valid instruction in the pipe at any point. We do this step to partition the core (or logic) into the respective pipeline stages first without having to worry about handling the pipeline hazards.

Waterfall Logic Diagram with 3-Cycle Valid
TL-V Logic Implementation Diagram

Makerchip-generated Block Diagram for 3-Cycle Valid design

9.2 Pipelining the CPU: Solving the data & control hazards

9.2.1 Data Hazard: Read-After-Write (RAW) in the Refister File

There is a 2-cycle delay (by design) between RF Read and Write operations.
Hence we have a Read-After-Write (RAW) data hazard if the current instruction in the pipe is trying to read from the Register File (RF) when the previous instruction had written to the same RF index.
To solve this, we need to add a Register File Bypass Mux at the input of the ALU and select the previous ALU output if the previous instruction was writing to the RF index accessed in the current instruction.

Register File Bypass Waterfall Logic Diagram
Register File Bypass TL-V Implementation

9.2.2 Control Hazard: Branch Instructions

We have control flow hazards when a branch is taken.
The PC logic is updated to handle the case when a branch is taken or not.

Branch Instruction Control Hazard

9.3 Complete the ALU

The Instruction Decoder is updated to decode all the instructions and the complete ALU is implemented. Note: All load instructions are treated as the same as the LW instruction.

9.4 DMEM & Load, Store Instructions

9.4.1 DMEM

The DMEM is a single-port R/W memory with 16 entries, 32-bit wide.
The DMEM is placed in the 4th pipeline stage.

DMEM

9.4.2 LOAD (LW, LH, LB, LHU, LBU) Instructions

LOAD rd, imm(rs1)
Loads the data from the DMEM address given by (rs1 + imm) to destination register provided by rd. i.e., rd <= DMEM(rs1 + imm)

9.4.3 STORE (SW, SH, LB) Instructions

STORE rs2, imm(rs1)
Stores the data from rs2 to the DMEM address given by (rs1 + imm). i.e., rd <= DMEM(rs1 + imm)

The $dmem_addr[3:0] is generated by the ALU by treating the load and store instructions to be equivalent to the ADDI instruction.

i.e., The ALU performs the following:
LOAD/ STORE : ($is_load || $is_s_instr) ? ($src1_value + $imm) 
ADDI        :                 $is_addi  ? ($src1_value + $imm) :

Since the DMEM is 32-bit wide and not byte or half-addressable:
$dmem_addr[3:0] = $result[5:2];

Muxes need to be placed at the inputs of RF write index ($rf_wr_index) and RF write data ($rf_wr_data) ports to select the appropriate values depending on the validity of the load instruction.

DMEM Load/ Store

Additionally, the Program Counter logic has to be updated for load redirects.

9.5 Unconditional Jump (JAL, JALR) Instructions

JAL : Jump to (PC + IMM), equivalent to an unconditional branch w.r.t the calculation of the target address.
JALR: Jump to (SRC1 + IMM)

The logic to calculate the branch target for JALR needs to be implemented.
The Program Counter logic also needs to be modified to handle the jumps.

9.6 Complete Pipelined RISC-V CPU Core Implementation in Makerchip

Click on the image below to open up the interactive svg file:

9.7 Bug found with the LW instruction and RF Read Bypass

Original Code: riscv_pipelined_with_LW_Bug.tlv

In the functional simulation of the RTL code in MakerChip IDE of the RISC-V CPU core that we have designed following the steps in the lecture videos and slides, I noticed two issues:

9.7.1 Issue #1

During the execution of the LW instruction, the DMEM address gets written to destination register in the first cycle.

(NOTE: This is a benign issue and not a concern)

Since LW is an I-type (Immediate-type instruction), the $rd (Destination Register) is valid during this phase and thus $rf_wr_en (Register File Write Enable).

// Immediate
$is_i_instr = ($instr[6:2] == 5'b00000) ||
              ($instr[6:2] == 5'b00001) ||
              ($instr[6:2] == 5'b00100) ||
              ($instr[6:2] == 5'b00110) ||
              ($instr[6:2] == 5'b11001);
...
$is_load  = ($opcode == 7'b0000011);
...
$rd_valid = $is_r_instr | $is_i_instr | $is_u_instr | $is_j_instr;
...
$rf_wr_en =  ($rd_valid && $valid && $rd != 5'b0) || >>2$valid_load;

If we take the following example: m4_asm(LW, r15, r0, 00100).
This instruction is supposed to do just: r15 [31:0] <= DMEM [(r0 + 00100)] [31:0]

Due to our design, the DMEM address is generated by the ALU as: DMEM_addr = (rs1 + imm). Hence the ALU output (or the DMEM address) gets written to the destination register first and then two cycles later the actual data from the DMEM address gets written to the destination register.
In our implementation, since it takes two cycles for valid data to be fetched from the DMEM and to be written to the destination register, we are squashing the 2 instructions already in the pipe in the "shadow" of the Load instruction.
Hence writing this intermediate value to the destination register is not a concern a.
Nevertheless, to avoid this unnecessary RF write for a cleaner implmentation, we can deassert $rf_wr_en for these two cycles for a valid load instruction.
```
$rf_wr_en = (!$valid_load && !>>1$valid_load) && ($rd_valid && ($rd != 5'b0) && $valid) || >>2$valid_load;
```

9.7.2 Issue #2

The instruction immediately following the LW instruction gets the wrong $src1_value and $src2_value

(NOTE: This is an actual BUG and breaks functionality)

This bug was found while checking if the above issue was causing any RAW hazards if the instruction immediately following the LW instruction accesses the destination register of the LW instruction.
This happens because of an incorrect RF Read Bypass in the original implementation:
```
$src1_value[31:0] = (>>1$rf_wr_index == $rf_rd_index1) && >>1$rf_wr_en
                    ? >>1$result : $rf_rd_data1 ;

$src2_value[31:0] = (>>1$rf_wr_index == $rf_rd_index2) && >>1$rf_wr_en
                    ? >>1$result : $rf_rd_data2 ;
```
- In this original code, the instruction immediately in the shadow of the LW instruction gets the wrong values for $src1_value, $src2_value which are the inputs to the ALU.
- This is because, we not accounting for the fact that the data to be written to the RF could come from either the ALU ($result) or from the DMEM ($ld_data).
  $rf_wr_data[31:0] = >>2$valid_load ? >>2$ld_data : $result;
  
  But we are only considering the ALU output for RF Read during a RAW Hazard.

RF Read Bypass Bug

FIX 1: During the initial debugs, I came up with the following solution to the bug based on the simulation waveforms and the VIZ_JS debug prints.

riscv_pipelined_withBugFix_1.tlv
This explicitly considers the case of the instruction immediately succeeding LW.

// Handling Read-After-Write Hazard
$src1_value[31:0] = >>3$valid_load && (>>3$rf_wr_index == $rf_rd_index1) ? >>3$ld_data :
                    (>>1$rf_wr_index == $rf_rd_index1) && >>1$rf_wr_en ? >>1$result   :
                    $rf_rd_data1;

$src2_value[31:0] = >>3$valid_load && (>>3$rf_wr_index == $rf_rd_index2) ? >>3$ld_data :
                    (>>1$rf_wr_index == $rf_rd_index2) && >>1$rf_wr_en ? >>1$result   :
                    $rf_rd_data2;

FIX 2: Talking to Steve H. actually got me a better understanding of the issue, and he suggested the following code change:

riscv_pipelined_withBugFix_2.tlv

// Handling Read-After-Write Hazard
$src1_value[31:0] = (>>1$rf_wr_index == $rf_rd_index1) && >>1$rf_wr_en
                    ? >>1$rf_wr_data : $rf_rd_data1;

$src2_value[31:0] = (>>1$rf_wr_index == $rf_rd_index2) && >>1$rf_wr_en
                    ? >>1$rf_wr_data : $rf_rd_data2;

Prev: Day8$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$Next: Day 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Day9.md

Day9.md

Day 9: Complete Pipelined RISC-V CPU Microarchitecture

9.1 Pipelining the CPU: Using 3-Cycle $valid signal

9.2 Pipelining the CPU: Solving the data & control hazards

9.2.1 Data Hazard: Read-After-Write (RAW) in the Refister File

9.2.2 Control Hazard: Branch Instructions

9.3 Complete the ALU

9.4 DMEM & Load, Store Instructions

9.4.1 DMEM

9.4.2 LOAD (LW, LH, LB, LHU, LBU) Instructions

9.4.3 STORE (SW, SH, LB) Instructions

9.5 Unconditional Jump (JAL, JALR) Instructions

9.6 Complete Pipelined RISC-V CPU Core Implementation in Makerchip

9.7 Bug found with the LW instruction and RF Read Bypass

9.7.1 Issue #1

9.7.2 Issue #2

Files

Day9.md

Latest commit

History

Day9.md

File metadata and controls

Day 9: Complete Pipelined RISC-V CPU Microarchitecture

9.1 Pipelining the CPU: Using 3-Cycle $valid signal

9.2 Pipelining the CPU: Solving the data & control hazards

9.2.1 Data Hazard: Read-After-Write (RAW) in the Refister File

9.2.2 Control Hazard: Branch Instructions

9.3 Complete the ALU

9.4 DMEM & Load, Store Instructions

9.4.1 DMEM

9.4.2 LOAD (LW, LH, LB, LHU, LBU) Instructions

9.4.3 STORE (SW, SH, LB) Instructions

9.5 Unconditional Jump (JAL, JALR) Instructions

9.6 Complete Pipelined RISC-V CPU Core Implementation in Makerchip

9.7 Bug found with the LW instruction and RF Read Bypass

9.7.1 Issue #1

9.7.2 Issue #2