-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu-o3: Transform the lsqunit #265
Conversation
0894beb
to
cf13024
Compare
Transform the load/store execution logic into a multi-stage pipeline form Change-Id: Iaf7558ad75ed8fe2bbf4a776359db113b6126453
Originally, the fence instruction will be dispatched to mem's dispatchQueue, but its opType is No_OpClass, which will cause it to wait for the integer issue queue IQ2(IntMisc) to have free items before it can continue execution. If the subsequent instructions of the fence instruction occupy the intIQ2, the fence cannot be executed and cpu stucks. Therefore, change the opType of the fence instruction to MemReadOp to prevent this situation (in fact, the fence will not be dispatched to IQ) Change-Id: Ie38a901e038db9906c43f78675e69391e847c88b
Now initiateAcc only does tlb access and is located at s0 of the load/store pipeline. Load makes cache access and query violations at s1, receives the cache response at s2, and writes back at s3. Store updates sq and query violations at s1, and writes back at s4. AMO operations are now executed using `executeAmo`. Change-Id: Iac678b7de3a690329f279c70fdcd22be4ed22715
This commit is only for normal load. The uncache/amo load is the same as the original process. Change-Id: Idc98ee18a6e94a39774ebba0f772820699b834de
Add a fence before and after the LRSC instruction. Change-Id: I66021d0a5a653d2a7e30cd262166363a84184ed6
Change-Id: Ifc1a586df8beab65772d48a75106155f9e723cba
Adjust cache miss load replay logic: replay all loads cannot get data at load s2, now we don’t need cache to send `sendCustomSignal` when miss. Add RAW nuke replay at load s1&s2 Move most of the writeback logic to load s2 and actually writeback at s3 Change-Id: Idfd3480969958826f4820349168f17c9522f791e
set `EnableLdMissReplay` to True to enable replaying missed load from replayQ set `EnablePipeNukeCheck` to True to detect raw nuke replay in loadpipe NOTE: if `Enableldmissreplay` is False, `EnablePipeNukeCheck` can't be set as True Change-Id: Ic4235bffba01d5dc4c39cec8ae92f2d27b28d98a
store writeback at S4 by default when using --ideal-kmhv3, store writeback at S2 Change-Id: I6a318ff6c182daca0ab041840d76575a16e45d82
Change-Id: I5829589df8ca01724ffa4369d23d7e4693e0aea1
Previously, the delay of the write packet operation did not take into account whether the block was ready. In fact, if the block is not ready, the actual timing for the write to return the TimingResp should be delayed. Change-Id: I65de8d47e2f24ad4be867e1867cddee06092f22f
Currently, at the xbar, except sending the actual TimingResp, a Hint signal is sent N cycles in advance. (N is set by hint_wakeup_ahead_cycles in Caches.py) This Hint signal first queries the MSHR, finds all the associated load instructions, and issues a custom wake-up. Once all the custom wake-ups are received by the load instructions, they wake up the load instructions in the replayQueue. When these waken instructions reach stage s1 or s2 at load pipeline, data is forwarded from the bus. The actual TimingResp will place the data on the bus until the Dcache finally writes the data to itself then clears it. Change-Id: I8960acc14e95c06d8b1a86220f36a181588ff7f4
cf13024
to
343e77e
Compare
If this instruction is cancelled, the associated wake event should be descheduled. Change-Id: I595541aa5f96163350aa5f6e3825f78520a0e660
@@ -121,9 +135,10 @@ LSQ::LSQ(CPU *cpu_ptr, IEW *iew_ptr, const BaseO3CPUParams ¶ms) | |||
} | |||
|
|||
thread.reserve(numThreads); | |||
// TODO: Parameterize the load/store pipeline stages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameterize this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will fix this
@@ -1282,7 +1365,9 @@ LSQ::LSQRequest::~LSQRequest() | |||
std::raise(SIGINT); | |||
} | |||
assert(!isAnyOutstandingRequest()); | |||
_inst->savedRequest = nullptr; | |||
if (_inst->savedRequest == this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need this condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current design, after a load instruction experiences a miss for the first time, it does not continue to track the corresponding request.
Once the miss request retrieves the data, refills the dcache, and returns a response to the LSU, it will discard itself.
When the missed load is replayed to the pipeline, a new request will be generated(the goal is to ensure it re-accesses the DTLB and dcache when necessary). This approach indeed breaks the rule that each instruction is correspond to only one request. If the request management is not handled properly, it could potentially lead to functional issues.
However, in order to implement early wake-up and align with the RTL, I feel this approach is actually more friendly in terms of implementation.
src/cpu/o3/lsq_unit.cc
Outdated
} | ||
|
||
Fault | ||
LSQUnit::storePipeS2(const DynInstPtr &inst, std::bitset<LdStFlagNum> &flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so many identical function?
Can be simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These storePipeS2
, S3
, and S4
do not perform any operations and are indeed redundant. I will replace them with a new function.
src/cpu/o3/lsq_unit.cc
Outdated
void | ||
LSQUnit::dumpLoadPipe() | ||
{ | ||
DPRINTF(LSQUnit, "Dumping LoadPipe:\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use:
if (debug::LSQUnit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using debug::LSQUnit
is much better for simulation speed, I will fix it.
src/cpu/o3/lsq_unit.cc
Outdated
void | ||
LSQUnit::dumpStorePipe() | ||
{ | ||
DPRINTF(LSQUnit, "Dumping StorePipe:\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
int size; | ||
|
||
DynInstPtr insts[MaxWidth]; | ||
std::bitset<LdStFlagNum> flags[MaxWidth]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest put the ldsrflags into dyninst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current design, where a single instruction may appear in both s0 and s3 of the pipeline (fast replay), I feel storing the state in the TimeBuffer would be easier to manage.
int size; | ||
|
||
DynInstPtr insts[MaxWidth]; | ||
std::bitset<LdStFlagNum> flags[MaxWidth]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
+ add LdPipeStages and StPipeStages parameters + remove redundant storePipeSx code + fix dumpLoadStorePipe Change-Id: Ie8cb7865c3a53265520f11f016dd467c25a3e2a5
Background
In order to align the behavior of the LSU module with the RTL, this PR makes the following modifications:
1. Pipeline Construction
Related commits:
Original design: Load/store operations are delayed by a certain number of cycles and then all operations (TLB, Cache lookup, exception check, etc.) are completed in one cycle using
executeLoad/executeStore
.New design: TimeBuffer is set according to the corresponding pipeline stages. Instructions are first dispatched to TimeBuffer
s0
, and over time they pass through the stages of the pipeline to complete the corresponding operations.Code Details
Original design:
op_latency - 1
cycles before being sent to the corresponding function unit.GEM5/src/cpu/o3/inst_queue.cc
Lines 700 to 708 in 2d1995a
executeLoad/executeStore
to complete the corresponding operations.GEM5/src/cpu/o3/iew.cc
Lines 1407 to 1437 in 2d1995a
executeLoad/executeStore
completes all operations.GEM5/src/cpu/o3/lsq_unit.cc
Lines 906 to 979 in 2d1995a
GEM5/src/cpu/o3/lsq_unit.cc
Lines 1003 to 1067 in 2d1995a
New design:
GEM5/src/cpu/o3/inst_queue.cc
Lines 711 to 719 in cf13024
issueToLoadPipe/issueToStorePipe
to send the instructions to thes0
of the load/store pipeline.loadPipeSx
/storePipeSx
is the corresponding pipeline timebuffer,loadPipeSx[0]
means s0 stage of load pipeline.GEM5/src/cpu/o3/lsq_unit.hh
Lines 658 to 682 in cf13024
GEM5/src/cpu/o3/iew.cc
Lines 1473 to 1503 in cf13024
GEM5/src/cpu/o3/lsq_unit.cc
Lines 1053 to 1081 in cf13024
s1
,s2
, etc.). The corresponding instructions are fetched from the load/store pipeline at each stage and the operations are executed.GEM5/src/cpu/o3/lsq_unit.cc
Lines 1314 to 1367 in cf13024
GEM5/src/cpu/o3/lsq_unit.cc
Lines 1493 to 1559 in cf13024
2. Nuke Replay
Related commit:
Original design: If the load executes first, since the store in the same cycle hasn't finished (and hasn't updated the SQ), the load can't detect the dependency and thus can't forward data from the store. Then the store in the same cycle executes, a RAW violation will be detected, causing the pipeline to be flushed.
What about RTL behavior: The described situation does not occur in the RTL because the store instruction in the s1 pipeline stage (the cycle before the SQ is updated) checks whether there is a load instruction on the load pipeline with the same address and data that has not forwarded from it, and causes that load instruction to be replayed.
New design: When the load reaches
s1
ors2
, it checks whether there are stores still ins1
that haven't executed and match the address. If so, the load will be replayed. This is called the pipeline nuke replay. After the store ins1
completes, the address and data are updated to the SQ, allowing the replayed load to correctly forward data without causing a RAW violation.Code Details
RTL design:
s1
query load pipelines1/s2
, causing the matched load to be replayed https://github.com/OpenXiangShan/XiangShan/blob/0051450372ae5a03ce9d36afdbdd34b9a19f4785/src/main/scala/xiangshan/mem/pipeline/LoadUnit.scala#L971-L995 https://github.com/OpenXiangShan/XiangShan/blob/0051450372ae5a03ce9d36afdbdd34b9a19f4785/src/main/scala/xiangshan/mem/pipeline/LoadUnit.scala#L1228-L1248 https://github.com/OpenXiangShan/XiangShan/blob/0051450372ae5a03ce9d36afdbdd34b9a19f4785/src/main/scala/xiangshan/mem/pipeline/LoadUnit.scala#L1357-L1381Gem5 New design:
pipeLineNukeCheck
is where the load ins1
checks for pipeline nuke.GEM5/src/cpu/o3/lsq_unit.cc
Lines 1159 to 1179 in cf13024
pipeLineNukeCheck
for load ins2
checks for pipeline nuke.GEM5/src/cpu/o3/lsq_unit.cc
Lines 1250 to 1258 in cf13024
GEM5/src/cpu/o3/lsq_unit.cc
Lines 743 to 762 in cf13024
s1
ors2
triggers a pipeline nuke, a fast load replay is performed ins2
.GEM5/src/cpu/o3/lsq_unit.cc
Lines 1273 to 1282 in cf13024
Nuke
, and the RAW violation check will skip this load.GEM5/src/cpu/o3/lsq_unit.cc
Lines 955 to 961 in cf13024
3. Miss Load Replay
Commits related to this modification:
Original design: A miss load directly writes back after receiving the
TimingResp
, with no limit on the number and no re-entering pipeline behavior.New design: A miss load receives a custom
Hint
signal two cycles before receiving theTimingResp
, which wakes up the relevant instructions and reissues them into the pipeline. The relevant data becomes stable in thebus
afterTimingResp
reaching the LSU, and the reissued load will forward data from thebus
. If data cannot be forwarded, it will access the cache again. After the cache refilled the data to itself completely, aBus_Clear
request is sent to the LSU to clear the corresponding data in thebus
.Code details
Original design:
completeDataAccess
for write-back after receiving theTimingResp
GEM5/src/cpu/o3/lsq.cc
Lines 1349 to 1362 in 2d1995a
New design:
cache miss load replayQueue
GEM5/src/cpu/o3/lsq_unit.cc
Lines 1259 to 1271 in cf13024
Hint
signal is returned a few cycles ahead of theTimingResp
GEM5/src/mem/coherent_xbar.cc
Lines 518 to 522 in cf13024
Hint
reaches L1, it queries the MSHR and sends a wake-up signal to the relevant load requests in the cache block. The loads that receive the wake-up signal are awakened from the replayQueueGEM5/src/mem/cache/cache.cc
Lines 963 to 999 in cf13024
GEM5/src/cpu/o3/lsq.cc
Lines 1524 to 1537 in cf13024
TimingResp
arrives a few cycles after theHint
and stabilizes the data on the busGEM5/src/cpu/o3/lsq.cc
Lines 1434 to 1475 in cf13024
forwardFrmBus(load_inst, request)
), and accesses the cache only if no data is found on the bus.GEM5/src/cpu/o3/lsq_unit.cc
Lines 2995 to 3023 in cf13024
GEM5/src/cpu/o3/lsq_unit.cc
Lines 2995 to 3002 in cf13024
GEM5/src/cpu/o3/lsq_unit.cc
Lines 1221 to 1244 in cf13024
Bus_Clear
is sent to the LSU to clear the relevant data from the busGEM5/src/mem/cache/base.cc
Lines 2041 to 2052 in cf13024
GEM5/src/cpu/o3/lsq.cc
Lines 586 to 602 in cf13024
4. Misc
During the alignment process, some functionality correctness issues and minor misalignments were also discovered, and they are all documented in this section.
4.1 fence opType
Commits related to this modification:
Original design:
the fence instruction will be dispatched to mem's dispatchQueue, but its opType is
No_OpClass
, which will cause it to wait for the integer issue queue IQ2(IntMisc) to have free items before it can continue execution.If the subsequent instructions of the fence instruction occupy the intIQ2, the fence cannot be executed and cpu deadlocks.
new design:
Change the opType of the fence instruction to
MemReadOp
to prevent this situation (in fact, the fence will not be dispatched to IQ)4.2 LRSC
Commits related to this modification:
LR Instruction can be executed speculatively, which causes RAW violations in some corner cases.
To reduce complexity and ensure consistency with the RTL design, avoid speculative execution for the LR, and strictly maintain order.
4.3 store writeback
Commits related to this modification:
Previously, the store was written back in s2, but it is written back in s4 in RTL design. So align the timing of this operation. The write-back timing has a significant impact on high IPC programs like hmmer (~13%), with earlier write-back being more beneficial for improving performance.