Back to TOC
Prev: Day8
Our RISC-V core from the previous day is still incomplete w.r.t the instructions implemented, and additionally we need to do pipelining and handling of the pipeline hazards.
We need to do the following to complete the CPU Design:
- Pipeline the CPU, taking care of the data dependency & control flow hazards
- Complete the implementation of the remaining ALU instructions
- Implement DMEM & Load, Store instructions
- Implement the Unconditional Jump (JAL, JALR) instructions
Pipelining the RISC-V CPU Core |
---|
First, we will implement with a simplified 3-stage pipeline with using a 3-Cycle valid signal, the various stages being:
- PC
- Instruction Fetch + Decode
- RF Read, ALU
- RF Write, Branch Instrn. logic
This implementation would have an IPC of only ~1/3 as the valid signal is active once every 3 cycles (HLLHLL...) indicating only one valid instruction in the pipe at any point. We do this step to partition the core (or logic) into the respective pipeline stages first without having to worry about handling the pipeline hazards.
Waterfall Logic Diagram with 3-Cycle Valid |
---|
TL-V Logic Implementation Diagram |
Makerchip-generated Block Diagram for 3-Cycle Valid design |
---|
- There is a 2-cycle delay (by design) between RF Read and Write operations.
- Hence we have a Read-After-Write (RAW) data hazard if the current instruction in the pipe is trying to read from the Register File (RF) when the previous instruction had written to the same RF index.
- To solve this, we need to add a Register File Bypass Mux at the input of the ALU and select the previous ALU output if the previous instruction was writing to the RF index accessed in the current instruction.
Register File Bypass Waterfall Logic Diagram |
---|
Register File Bypass TL-V Implementation |
- We have control flow hazards when a branch is taken.
- The PC logic is updated to handle the case when a branch is taken or not.
Branch Instruction Control Hazard |
---|
The Instruction Decoder is updated to decode all the instructions and the complete ALU is implemented. Note: All load instructions are treated as the same as the LW instruction.
- The DMEM is a single-port R/W memory with 16 entries, 32-bit wide.
- The DMEM is placed in the 4th pipeline stage.
DMEM
- LOAD rd, imm(rs1)
- Loads the data from the DMEM address given by (rs1 + imm) to destination register provided by rd.
i.e., rd <= DMEM(rs1 + imm)
- STORE rs2, imm(rs1)
- Stores the data from rs2 to the DMEM address given by (rs1 + imm).
i.e., rd <= DMEM(rs1 + imm)
The $dmem_addr[3:0] is generated by the ALU by treating the load and store instructions to be equivalent to the ADDI instruction.
i.e., The ALU performs the following:
LOAD/ STORE : ($is_load || $is_s_instr) ? ($src1_value + $imm)
ADDI : $is_addi ? ($src1_value + $imm) :
Since the DMEM is 32-bit wide and not byte or half-addressable:
$dmem_addr[3:0] = $result[5:2];
Muxes need to be placed at the inputs of RF write index ($rf_wr_index) and RF write data ($rf_wr_data) ports to select the appropriate values depending on the validity of the load instruction.
DMEM Load/ Store |
---|
Additionally, the Program Counter logic has to be updated for load redirects.
- JAL : Jump to (PC + IMM), equivalent to an unconditional branch w.r.t the calculation of the target address.
- JALR: Jump to (SRC1 + IMM)
The logic to calculate the branch target for JALR needs to be implemented.
The Program Counter logic also needs to be modified to handle the jumps.
Click on the image below to open up the interactive svg file:
Original Code: riscv_pipelined_with_LW_Bug.tlv
In the functional simulation of the RTL code in MakerChip IDE of the RISC-V CPU core that we have designed following the steps in the lecture videos and slides, I noticed two issues:
During the execution of the LW instruction, the DMEM address gets written to destination register in the first cycle.
(NOTE: This is a benign issue and not a concern)
-
Since LW is an I-type (Immediate-type instruction), the $rd (Destination Register) is valid during this phase and thus $rf_wr_en (Register File Write Enable).
// Immediate $is_i_instr = ($instr[6:2] == 5'b00000) || ($instr[6:2] == 5'b00001) || ($instr[6:2] == 5'b00100) || ($instr[6:2] == 5'b00110) || ($instr[6:2] == 5'b11001); ... $is_load = ($opcode == 7'b0000011); ... $rd_valid = $is_r_instr | $is_i_instr | $is_u_instr | $is_j_instr; ... $rf_wr_en = ($rd_valid && $valid && $rd != 5'b0) || >>2$valid_load;
-
If we take the following example:
m4_asm(LW, r15, r0, 00100)
.
This instruction is supposed to do just:r15 [31:0] <= DMEM [(r0 + 00100)] [31:0]
Due to our design, the DMEM address is generated by the ALU as: DMEM_addr = (rs1 + imm). Hence the ALU output (or the DMEM address) gets written to the destination register first and then two cycles later the actual data from the DMEM address gets written to the destination register.
-
In our implementation, since it takes two cycles for valid data to be fetched from the DMEM and to be written to the destination register, we are squashing the 2 instructions already in the pipe in the "shadow" of the Load instruction.
Hence writing this intermediate value to the destination register is not a concern a.
Nevertheless, to avoid this unnecessary RF write for a cleaner implmentation, we can deassert $rf_wr_en for these two cycles for a valid load instruction.$rf_wr_en = (!$valid_load && !>>1$valid_load) && ($rd_valid && ($rd != 5'b0) && $valid) || >>2$valid_load;
The instruction immediately following the LW instruction gets the wrong $src1_value and $src2_value
(NOTE: This is an actual BUG and breaks functionality)
-
This bug was found while checking if the above issue was causing any RAW hazards if the instruction immediately following the LW instruction accesses the destination register of the LW instruction.
-
This happens because of an incorrect RF Read Bypass in the original implementation:
$src1_value[31:0] = (>>1$rf_wr_index == $rf_rd_index1) && >>1$rf_wr_en ? >>1$result : $rf_rd_data1 ; $src2_value[31:0] = (>>1$rf_wr_index == $rf_rd_index2) && >>1$rf_wr_en ? >>1$result : $rf_rd_data2 ;
-
In this original code, the instruction immediately in the shadow of the LW instruction gets the wrong values for $src1_value, $src2_value which are the inputs to the ALU.
-
This is because, we not accounting for the fact that the data to be written to the RF could come from either the ALU ($result) or from the DMEM ($ld_data).
$rf_wr_data[31:0] = >>2$valid_load ? >>2$ld_data : $result;
But we are only considering the ALU output for RF Read during a RAW Hazard.
-
RF Read Bypass Bug |
---|
-
FIX 1: During the initial debugs, I came up with the following solution to the bug based on the simulation waveforms and the VIZ_JS debug prints.
-
riscv_pipelined_withBugFix_1.tlv
This explicitly considers the case of the instruction immediately succeeding LW.
// Handling Read-After-Write Hazard $src1_value[31:0] = >>3$valid_load && (>>3$rf_wr_index == $rf_rd_index1) ? >>3$ld_data : (>>1$rf_wr_index == $rf_rd_index1) && >>1$rf_wr_en ? >>1$result : $rf_rd_data1; $src2_value[31:0] = >>3$valid_load && (>>3$rf_wr_index == $rf_rd_index2) ? >>3$ld_data : (>>1$rf_wr_index == $rf_rd_index2) && >>1$rf_wr_en ? >>1$result : $rf_rd_data2;
-
riscv_pipelined_withBugFix_1.tlv
-
FIX 2: Talking to Steve H. actually got me a better understanding of the issue, and he suggested the following code change:
// Handling Read-After-Write Hazard $src1_value[31:0] = (>>1$rf_wr_index == $rf_rd_index1) && >>1$rf_wr_en ? >>1$rf_wr_data : $rf_rd_data1; $src2_value[31:0] = (>>1$rf_wr_index == $rf_rd_index2) && >>1$rf_wr_en ? >>1$rf_wr_data : $rf_rd_data2;
Prev: Day8