Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dv] Deep dive regression triage #2187

Open
GregAC opened this issue Jul 4, 2024 · 4 comments
Open

[dv] Deep dive regression triage #2187

GregAC opened this issue Jul 4, 2024 · 4 comments

Comments

@GregAC
Copy link
Collaborator

GregAC commented Jul 4, 2024

regr-second.log
regr-first.log

Overview

A detailed investigation of all failures observed in a local regression run of commit 1449ed5 running the opentitan configuration was done. Every observed failure was looked at in full (as opposed to just assuming the same signature means the same failure). A second regression was run against 3384bf4 which included some fixes for issues found from the first regression. The regression log files (regr-first.log, regr-second.log) are attached to this issue.

This investigation has found one RTL issue but it is of low concern (and could be argued it doesn't need fixing as the behaviour is already reasonable see: #2186). Otherwise all failures are related to environment/testbench issues and raise no concerns around the RTL.

This report first lists signatures seen and an explanation for each then it lists specific test failures seen and which explanation applies to each test failure. For some of the signatures as there were either a significant number of them (which would have taken a long time to fully investigate) or there was some concern they could be hiding a bug fixes were made to address them. The regression was re-run with these fixes applied to ensure the specific tests being addressed had been fixed. Any failures remaining for the specific tests were then investigated in detail once again. The full regression log was also reviewed for the other tests but just to look at signatures to check there was nothing obviously new, there was no detailed investigation. All of the details of this (signatures fixed, PRs for the fixes and specific tests that got re-investigated) are given below.

First Regression

Run against commit 1449ed5. See the attached regr-first.log for the detailed regression report.

Failure explanations and associated signatures

This provides explanations for every failure seen in the first regression and an example signature that was seen with that failure.

  1. [E] 107: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/tests/core_ibex_base_test.sv(326) @ 32002046: uvm_test_top [uvm_test_top] Test failed due to wall-clock timeout. [1800s]

This is a issue with RISC-V DV test generation. Either the test has entered an infinite loop and will never terminate or will eventually terminate but it is taking too long. In both scenarios the logs have been examined and there has been no lock-up in the design the Ibex core continues to execute instructions in a timely way up to the point of test failure with cosim matching in all cases.

  1. [E] 111: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/tests/core_ibex_test_lib.sv(843) @ 87608: reporter [uvm_test_top] Check failed dut_vif.dut_cb.priv_mode == mode (0 [0x0] vs 3 [0x3]) Incorrect privilege mode

This is a testbench issue related to the old checkers that are still present in some tests. These checks are incorrectly predicting a privilege mode change. The log of the failing test has been checked and the Ibex core is matching with the co-simulation up to the point of test failure and there are no other errors reported.

  1. [E] 118: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 121679: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch Register write data mismatch to x6 DUT: 8001286d expected: 0

This is an issue with co-simulation not correctly predicting core behaviour. The core itself is behaving reasonably but it is a complex edge case that isn't yet correctly modelled. We see a load response with a data integrity error that indicates a bus error. This results in a precise exception for the bus error that is immediately interrupted by the internal interrupt for the integrity error. After the internal interrupt handler is done we continue with the first instruction of the exception handler for the bus error (with mepc being restored correctly due to the recoverable NMI mechanism). The mismatch happens because we write the mtval with the faulting address for the internal interrupt but this write is not carried over to spike. So spike executes the internal interrupt handler (with no prior mtval write) which doesn't read the mtval value and then executes the exceptoion handler which does read the mtval this then mismatches. To fix this the mtval write that occurs due the internal interrupt needs to be carried over to the co-simulation environment.

  1. [E] 162: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 3437882: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch DUT didn't write to register x8, but a write was expected

This is a mismatch between the memory model used by the co-simulation and the memory model in the UVM environment. It is also an issue with the RISC-DV test generation (it jumps to an area of memory that hasn't been initialized). It relates to the differing handling of unitialized memory between co-simulation and UVM. Earlier in the test a few bytes are written to a previously unitialized area of memory. The test then jumps to an address one byte before these written bytes. In the co-simulation the first (unitialized) byte reads as 0x0 and the second (and following bytes) are read as the values they were previously written. In the UVM environment it sees an instruction fetch involving an unitialized memory byte and forces a value of 0x0 for all fetched bytes. This results in the co-simulation and Ibex core fetching different instructions from the same address and a mis-match occurs. The behaviour of the two memory models should be unified.

  1. [FAILED]: Processing the ibex trace failed: Logfile /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/out/run/tests/riscv_pc_intg_test.14611/trace_core_00000000.log not found

Occurrs in the riscv_pc_intg_test test. This is an issue in the regression reporting setup. Due to the way this test works it doesn't generate a trace file in some cases (which is flagged as an error in rethe report) but there is no failure. The test tries to inject a corrupt PC very early in the test before execution has begun. The corrupt PC does not trigger any alert as this particular hardening mechanism only checks anything when the Ibex core is doing straight-line execution (i.e. in particular the PC integrity check isn't carried out for branches/jumps). Before execution has started no check is occurring so no alert is triggered, which is expected behaviour.

  1. [E] 116: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 148168: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch Synchronous trap was expected at ISS PC: 80000000 but the DUT didn't report one at PC 800029ee

An access is made to the address region spike reserves for debug memory. This causes a fault in spike as it generates a bus error but not in Ibex. The spike debug module is not present in the co-simulation environment so there's no need for the bus error here. The cause of this fault has been removed (see lowRISC/riscv-isa-sim@1f89c52).

  1. [FAILURE] Simulation process killed due to timeout [1860s].
    An infrastructure issue, when rerunning the test it passed successfully.

  2. 131: xmsim: *SE,EILLCT: (File: ./fcov/core_ibex_fcov_if.sv, Line: 762):(Time: 730828 NS + 4) Illegal cross tuple (<0, 1, InstrCategoryFetchError, IdStallTypeMem, 0, 0, 0>) occurred corresponding to illegal cross bin (mem_stall_illegal) of cross (core_ibex_tb_top.dut.u_ibex_top.u_ibex_core.u_fcov_bind.uarch_cg_inst@11461_629.exception_stall_instr_cross). [E] 132: xmsim: *SE,EILLCT: (File: ./fcov/core_ibex_fcov_if.sv, Line: 762):(Time: 730868 NS + 4) Illegal cross tuple (<0, 1, InstrCategoryFetchError, IdStallTypeMem, 0, 0, 0>) occurred corresponding to illegal cross bin (mem_stall_illegal) of cross (core_ibex_tb_top.dut.u_ibex_top.gen_lockstep.u_ibex_lockstep.u_shadow_core.u_fcov_bind.uarch_cg_inst@11478_1.exception_stall_instr_cross).

This cross coverage bin should not be marked as illegal, it has been fixed in 89f4d86. In this scenario we see a fetch error in the ID stage as well as a memory stall that we should only see if there's a memory instruction in the ID stage. Upon closer examination the instruction bits in the ID stage decode to a memory instruction, which results in the stall. Because of the fetch error the instruction has no effect on architectural state or external memory activity and the stall is entirely harmless (strictly it doesn't need to occur but practically it is of little consequence).

  1. [E] 200: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/tests/core_ibex_base_test.sv(326) @ 25002027: uvm_test_top [uvm_test_top] Test failed due to wall-clock timeout. [1800s]

Reset test specific issue. Execution continues til timeout however cosim log doesn't exist as reset causes a total cosim reset and it writes log as a new file rather than appending to the existing log. Rerunning the failed test with wave generation enabled somehow breaks the cosim memory model post reset as it immediately sees a bus error on attempting to execute the reset vector. Suspect this is some memory corruption issue with the cosim not being cleanly shutdown and restarted.

Failures

This lists every failing test from the first regression (in regr-first.log) and links it to a signature and explanation from the above list

  • riscv_rand_jump_test.14607 - Signature 7
  • riscv_rand_jump_test.14610 - Signature 1
  • riscv_rand_jump_test.14611 - Signature 1
  • riscv_mmu_stress_test.14610 - Signature 7
  • riscv_debug_instr_test.14614 - Signature 1
  • riscv_debug_instr_test.14621 - Signature 1
  • riscv_dret_test.14605 - Signature 2
  • riscv_dret_test.14606 - Signature 2
  • riscv_dret_test.14608 - Signature 2
  • riscv_debug_ebreak_test.14611 - Signature 1
  • riscv_debug_ebreakmu_test.14614 - Signature 1
  • riscv_irq_in_debug_mode_test.14611 - Signature 1
  • riscv_irq_in_debug_mode_test.14614 - Signature 1
  • riscv_debug_in_irq_test.14611 - Signature 1
  • riscv_mem_error_test.14605 - Signature 8
  • riscv_mem_intg_error_test.14608 - Signature 3
  • riscv_mem_intg_error_test.14616 - Signature 3
  • riscv_mem_intg_error_test.14617 - Signature 3
  • riscv_mem_intg_error_test.14635 - Signature 3
  • riscv_mem_intg_error_test.14644 - Signature 3
  • riscv_mem_intg_error_test.14649 - Signature 3
  • riscv_mem_intg_error_test.14652 - Signature 3
  • riscv_mem_intg_error_test.14654 - Signature 3
  • riscv_debug_single_step_test.14605 - Signature 6
  • riscv_debug_single_step_test.14607 - Signature 4
  • riscv_debug_single_step_test.14609 - Signature 1
  • riscv_debug_single_step_test.14614 - Signature 1
  • riscv_reset_test.14614 - Signature 9
  • riscv_reset_test.14616 - Signature 9
  • riscv_pc_intg_test.14611 - Signature 5
  • riscv_pc_intg_test.14618 - Signature 5
  • riscv_epmp_mmwp_test.14617 - Signature 1

Failures in riscv_assorted_traps_interrupts_debug_test,riscv_pmp_basic_test,riscv_pmp_out_of_bounds_test and riscv_pmp_full_random_test are excluded from the list above as fixes have been made (see PR #2183) to address issues seen. A new full regression has been run with the fixes applied and all failures in these tests investigated in detail. See details below:

Second Regression

Run against commit 3384bf4. See the attached regr-second.log for the detailed regression report.

Only the failures in riscv_assorted_traps_interrupts_debug_test,riscv_pmp_basic_test,riscv_pmp_out_of_bounds_test and riscv_pmp_full_random_test have been investigated in detail and discussed below. Failures in the other tests have been compared against the signatures identified above with some assumptions being made it speed up the triage process. All failures in the other tests are existing failures explained in the First Regression section above with one exception. A new signature was seen in riscv_debug_single_step_test which is examined in detail below. The rest of the details here only concern signatures in the tests investigated in detail.

Signatures

  1. [E] 7061: UVM_ERROR /home/greg/work/ibex_area/ibex/rtl/ibex_controller.sv(995) @ 116357750: reporter [ASSERT FAILED] IbexSetExceptionPCOnSpecialReqIfExpected

This is a flawed assertion. When MIP changes the cycle the controller enters IRQ_TAKEN from DECODE such that no interrupt is actually taken the assertion incorrectly fires. This scenario is confirmed with the warning added in 470b39a

  1. [E] 9876: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/tests/core_ibex_base_test.sv(326) @ 116002037: uvm_test_top [uvm_test_top] Test failed due to wall-clock timeout. [1800s]

This is a issue with RISC-V DV test generation. Either the test has entered an infinite loop and will never terminate or will eventually terminate but it is taking too long. In both scenarios the logs have been examined and there has been no lock-up in the design the Ibex core continues to execute instructions in a timely way up to the point of test failure with cosim matching in all cases.

  1. [E] 112: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 1272021: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch DUT generated load at address 73661800 with data 0 but data f0 was expected with byte mask 1

Earlier a store occurred which crossed a PMP region boundary (boundary at 0x73661800) and the first half of the store failed PMP but the second passed. For Ibex this means the second half of the store produces a memory request which updates the memory. However a matching update isn't made in the cosim memory. Later a load accesses the same bytes that were stored and so cosim and Ibex mismatch as they have different values for the same bytes due to the differing store behaviour.

  1. [E] 127: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 569493: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch Register write data mismatch to x5 DUT: 400000c3 expected: 400000c0

In this scenario the mismatch is caused by a difference in dcsr values, specifically the prv field. Ibex indicates M-mode in this field where spike indicates U-mode. Earlier debug mode is entered just after an interrupt is raised but before the first instruction of the interrupt handler can be executed. Ibex aborts the fetch of the first instruction of the handler to instead enter debug mode, it correctly sets depc to the first instruction of the interrupt handler which should be executed in M-mode so that is also the correct value of prv. The value of depc that spike has isn't known from the traces but presumably it has entered debug mode before it takes the interrupt handler and it's depc points to a U-mode instruction that would have been executed had there been no interrupt or debug request. Both are behaving correctly the issue here comes to how the co-sim environment signals interrupts and debug requests to Spike and ensuring Spike prioritises these in the same way as Ibex in all cases.

  1. [E] 422: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 5555918: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch PC mismatch, DUT retired : 80000000 , but the ISS retired: ffffffff80002580

Possible RTL bug, but of limited concern, could be argued behaviour is reasonable here, see #2186.

Failures

  • riscv_debug_single_step_test.6250 - Signature 5
  • riscv_assorted_traps_interrupts_debug_test.6250 - Signature 2
  • riscv_assorted_traps_interrupts_debug_test.6256 - Signature 4
  • riscv_assorted_traps_interrupts_debug_test.6257 - Signature 1
  • riscv_assorted_traps_interrupts_debug_test.6258 - Signature 1
  • riscv_pmp_basic_test.6255 - Signature 1
  • riscv_pmp_basic_test.6271 - Signature 1
  • riscv_pmp_basic_test.6288 - Signature 1
  • riscv_pmp_out_of_bounds_test.6258 - Signature 1
  • riscv_pmp_out_of_bounds_test.6261 - Signature 1
  • riscv_pmp_out_of_bounds_test.6272 - Signature 1
  • riscv_pmp_out_of_bounds_test.6282 - Signature 1
  • riscv_pmp_full_random_test.6473 - Signature 3
@GregAC
Copy link
Collaborator Author

GregAC commented Jul 4, 2024

Unfortunately I cannot attach html files to a github issue so it's just the regression logs available.

@vogelpi
Copy link
Contributor

vogelpi commented Jul 5, 2024

Thanks for your analysis and the report @GregAC . This all makes sense to me and I agree that the identified possible RTL bug isn't critical.

I compared the latest nightly regression report against your analysis and the results match. I observed one new signature:

riscv_debug_ebreakmu_test.11102
-------------------------------
binary:          test.bin
rtl_log:         rtl_sim.log
rtl_trace:       trace_core_00000000.log
iss_cosim_trace: spike_cosim_trace_core_00000000.log

[FAILED]: error seen in 'rtl_sim.log'
---------------*LOG-EXTRACT*----------------
    105: UVM_INFO /home/runner/_work/lowrisc-private-ci/lowrisc-private-ci/ibex/dv/uvm/core_ibex/tests/core_ibex_base_test.sv(326) @ 2025: uvm_test_top [uvm_test_top] Test wall-clock timeout is set to : 1800s
    106: UVM_INFO /home/runner/_work/lowrisc-private-ci/lowrisc-private-ci/ibex/dv/uvm/core_ibex/common/ibex_mem_intf_agent/ibex_mem_intf_response_seq_lib.sv(44) @ 2025: uvm_test_top.env.instr_if_response_agent.sequencer@@instr_intf_seq [uvm_test_top.env.instr_if_response_agent.sequencer.instr_intf_seq] is_dmem_seq: 0x0
    107: UVM_INFO /home/runner/_work/lowrisc-private-ci/lowrisc-private-ci/ibex/dv/uvm/core_ibex/common/ibex_mem_intf_agent/ibex_mem_intf_response_seq_lib.sv(44) @ 2025: uvm_test_top.env.data_if_response_agent.sequencer@@data_intf_seq [uvm_test_top.env.data_if_response_agent.sequencer.data_intf_seq] is_dmem_seq: 0x1
    108: UVM_INFO /home/runner/_work/lowrisc-private-ci/lowrisc-private-ci/ibex/dv/uvm/core_ibex/tests/core_ibex_new_seq_lib.sv(65) @ 2025: reporter@@fetch_enable_seq_h [fetch_enable_seq_h] Running the "InfiniteRuns" schedule for stimulus generation
    109: core_ibex_tb_top.dut.u_ibex_tracer.unmblk2.unmblk1: Writing execution trace to /home/runner/_work/lowrisc-private-ci/lowrisc-private-ci/ibex/dv/uvm/core_ibex/out/run/tests/riscv_debug_ebreakmu_test.11102/trace_core_00000000.log
[E] 110: UVM_FATAL /home/runner/_work/lowrisc-private-ci/lowrisc-private-ci/ibex/dv/uvm/core_ibex/tests/core_ibex_test_lib.sv(1740) @ 33765: uvm_test_top [uvm_test_top] EBreak seen whilst doing initial debug initialization, KNOWN FAILURE SEE https://github.com/lowRISC/ibex/issues/1313
    111: 
    112: --- RISC-V UVM TEST FAILED ---
    113: 
    114: UVM_INFO /nas/lowrisc/tools/cadence/xcelium/21.09-s006/tools/methodology/UVM/CDNS-1.2/sv/src/base/uvm_report_catcher.svh(705) @ 33765: reporter [UVM/REPORT/CATCHER]
--------------------------------------------

It looks like an old and previously fixed testbench / co-simulation issue shows up again. In any case, this issue isn't a show stopper either.

@GregAC
Copy link
Collaborator Author

GregAC commented Aug 23, 2024

Some more signatures investigated whilst looking at other regressions. I won't give details of precise commits/regressions etc as this is just useful information for later triages rather than a particular point in time investigation the above posts were:

Signature:

riscv_assorted_traps_interrupts_debug_test.10313
------------------------------------------------
binary:          test.bin
rtl_log:         rtl_sim.log
rtl_trace:       trace_core_00000000.log
iss_cosim_trace: spike_cosim_trace_core_00000000.log

[FAILED]: error seen in 'rtl_sim.log'
---------------*LOG-EXTRACT*----------------
    294: 1261075: Illegal instruction (hart 0) at PC 0x8000ba0c: 0x00010413
    295: 1261095: Illegal instruction (hart 0) at PC 0x8000ba0c: 0x00010413
    296: UVM_INFO /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/tests/core_ibex_new_seq_lib.sv(77) @ 1264765: uvm_test_top.env.irq_agent.sequencer@@debug_new_seq_h [uvm_test_top.env.irq_agent.sequencer.debug_new_seq_h] Running 6/9
    297: 1265235: Illegal instruction (hart 0) at PC 0x8000ba0c: 0x00010413
    298: 1265275: Illegal instruction (hart 0) at PC 0x8000ba0c: 0x00010413
[E] 299: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 1265285: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch Synchronous trap was expected at ISS PC: 80000004 but the DUT didn't report one at PC 8000ba08
    300: 
    301: 
    302: --- RISC-V UVM TEST FAILED ---
    303: 
--------------------------------------------

This is an issue with test generation. A run of store instructions are writing to memory just ahead of the PC (i.e. they are modifying the code we're about to execute). The first of these modified instructions executed and spike sees the new value and Ibex sees the old value. Both are reasonable behaviours (you need a FENCE.I to guarantee the changed instructions are observed by the fetch) and a mismatch results.

@GregAC
Copy link
Collaborator Author

GregAC commented Aug 23, 2024

riscv_assorted_traps_interrupts_debug_test.10320
------------------------------------------------
binary:          test.bin
rtl_log:         rtl_sim.log
rtl_trace:       trace_core_00000000.log
iss_cosim_trace: spike_cosim_trace_core_00000000.log

[FAILED]: error seen in 'rtl_sim.log'
---------------*LOG-EXTRACT*----------------
    686: 8581206: Illegal instruction (hart 0) at PC 0x952ade74: 0x00010413
    687: 8608426: Illegal instruction (hart 0) at PC 0x952ade78: 0x00010413
    688: 8608466: Illegal instruction (hart 0) at PC 0x952ade78: 0x00010413
    689: 8619966: Illegal instruction (hart 0) at PC 0x952ade7c: 0x00010413
    690: 8620006: Illegal instruction (hart 0) at PC 0x952ade7c: 0x00010413
[E] 691: UVM_FATAL /home/greg/work/ibex_area/ibex/dv/uvm/core_ibex/common/ibex_cosim_agent/ibex_cosim_scoreboard.sv(168) @ 8650356: uvm_test_top.env.cosim_agent.scoreboard [uvm_test_top.env.cosim_agent.scoreboard] Cosim mismatch Register write data mismatch to x14 DUT: 1800 expected: 1880
    692: 
    693: 
    694: --- RISC-V UVM TEST FAILED ---
    695: 
--------------------------------------------

This is a co-simulation issue. In this scenario an interrupt is taken and before the first instruction of that interrupt can execute an NMI occurs. The NMI takes priority and we start executing the NMI handler. Both Ibex and spike agree in this point (in particular both reading the mcause value showing an NMI cause for the exception) but disagree on the mstatus.mpie value. Ibex has this as 0 and spike as 1. When the first interrupt is taken mstatus.mpie is set to 1 and mstatus.mieto 0 (interrupts disabled for the handler and were enabled before the interrupt). In Ibex when the NMI is taken mstatus.mpie is set to 0 (the value of mstatus.mie when the NMI is taken) but spike does not do this (which is incorrect here) and there is a mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants