Skip to content

Latest commit

 

History

History
343 lines (305 loc) · 19.9 KB

reri_intro.adoc

File metadata and controls

343 lines (305 loc) · 19.9 KB

Introduction

The RAS Error Record Register Interface (RERI) specification augments Reliability, Availability, and Serviceability (RAS) features in the SoC with a standard mechanism for reporting errors by means of a memory-mapped register interface to enable error reporting, provide the facility to log the detected errors (including their severity, nature, and location), and configuring means to signal the error to a RAS handler component. The RAS handler may use this information to determine suitable recovery actions that may include terminating the computation (e.g., terminating a process), restarting parts or all of the system, etc. to recover from the errors. Additionally, this specification shall support software-initiated error logging, reporting, and testing of RAS handlers. Lastly, this specification shall provide maximal flexibility to implement error handling and coexists with RAS frameworks defined by other standards such as PCIe cite:[PCI] and CXL cite:[CXL].

A system is an entity that interacts with other entities such as other systems, software, operators, etc. to deliver one or more services in its role as a service provider. A system may itself be a consumer of one or more services provided by one or more other systems. A system thus is a collection of interacting components that implement one or more functions to provide a service.

A service is the behavior as perceived by the consumers of the service. A system may implement the service as one or more functions in the system. The functions used to compose the service may be implemented by one or more components in the system.

A service is described as a set of states that can be observed by the consumer of the service. The set of states observed by the consumer of the service may be further dependent on a set of internal states of the functions that implement the service.

A service is said to be correct if the set of states observed by the consumer of the service match the specification of that service. The specifications of a service may include its functional behavior, performance goals, security objectives, and RAS requirements.

Reliability of a system as a function of time is the probability it continues to provide correct service and may be characterized by metrics such as mean time between failures (MTBF). The services provided by a reliable system fail on faults instead of silently producing incorrect results. Reliable systems incorporate methods to detect occurrence of errors and to signal the errors to the consumers of the service.

Availability of a system as a function of time is the probability that the system provides the expected service and is a measure of tolerance to errors. Systems may increase their availability by minimizing the impact of the errors in one part of the system to the rest of the system. This may be achieved by means such as error correction, redundancy, state checkpoints and rollbacks, error prediction, and error containment.

Serviceability is a measure of time to restore the service to correct operation with minimal disruption to the consumers of the service. These may be achieved by means such as identifying and reporting failures and supporting mechanisms to repair and bring the system back online.

Faults and Errors

A fault is an incorrect state resulting from failures of components or due to interference from the environment in which the system operates. A fault is permanent if it reflects an irreversible change to the observable system state and is transient otherwise. A permanent fault may occur due to a physical defect or due to a flaw in the design of the functions implementing the service itself. A transient fault may occur due to temporary environmental conditions (cosmic rays, voltage glitches, etc.) or due to instability (e.g. marginal hardware).

Some faults that occur in a component may be dormant and only affect the internal state of the component. Such dormant faults however may turn into active faults when that internal state is used by the computation process in that component and produce an error. An error is detected when its presence is indicated by an error message or signal.

Software faults may similarly cause errors that cause the service provided by the system to deviate from its specification. Well known software engineering and reliability techniques may be employed to prevent, detect and recover from software errors. Software errors are not in the scope of this specification. Software should not have the ability to induce hardware errors.

A service failure occurs when the service deviates from its specification due to errors.

A reliable system deals with errors through one or more of the following techniques cite:[RAS_TAX] cite:[EXA_RAS]:

  • Fault prevention

  • Error detection and correction

  • Error prediction

Fault Prevention

Fault prevention involves use of techniques that reduce or prevent errors that may occur after the product has been shipped. These may be accomplished through the use of high quality in product design, technology selection, materials selection, and manufacturing time screening for defects. Through the use of systematic design, technology selection, and manufacturing tests many errors such as those induced by electric fields, temperature stress, switching/coupling noise (e.g. DRAM RowHammer cite:[RHAM] effect), incorrect V/F operating points, insufficient guard bands, meta-stability, etc. can be prevented.

Faults that are not prevented may manifest as errors during operation of the system. Errors that are not detected may still lead to a service failure. For example, an undetected error in an adder used to produce the address of a load may produce a bad address which causes the load to incur an exception and lead to a service failure. Some undetected errors however may not manifest as exceptions and cause a service failure due to silent data corruption. For example, a circuit performing encryption of a database may silently cause an error in the ciphertext produced leading to the entire database being left in a state where it cannot be decrypted. Such undetected errors that do not lead to a service failure are called Silent Data Errors (SDE). The impact of SDE is generally much higher than errors that lead to a service failure. A resilient system attempts to minimize the probability of SDE to the largest extent possible by implementing error detection capabilities.

Error Detection and Correction

Error detection involves the use of coding and protocols to detect errors cite:[EU_HPC] cite:[EXA_2014]. For example, caches with error correcting codes, TLB entries with parity protection, buses with parity protection on transaction fields, circuitry to detect unexpected and/or illegal encodings, gray codes, voltage sensors, clock/PLL monitors, timing margin sensors, etc. Some components such as memory controllers may actively attempt to detect errors using techniques such as periodic background scrubbing or on-demand scrubbing.

Error correction involves the use of techniques to correct the detected errors. Error correction may be performed by employing error correcting codes and protocols. For example, a processor cache may employ error correcting codes (ECC) to detect and correct errors. Some components may recover from errors by using protocols that involve a retry. For example, a TLB that detects an error may invalidate the entry and attempt to refill it from the page tables, a receiver on a bus that detects an error may request the transmitter to retransmit the transaction, etc. Error correction is thus complete when the error is either corrected or it does not recur on retry. Such errors that were corrected by the hardware are called Corrected Errors (CE).

Errors that could not be corrected are called uncorrected errors. A component that detects an uncorrected error may allow possibly corrupted data to propagate to the requester of the data but associate an indicator (e.g., poison) with the data. Such errors are said to be Uncorrected Errors Deferred (UED) as they allow the component to continue operation and defer dealing with the error to a later point in time if the data corrupted by the error is consumed. Deferring errors allows deferring the error handling to an ultimate consumer of the corrupted data that may be able to provide more precise information to a RAS handler about the contexts affected by the corruption and thus enable more precise error recovery actions by the RAS handler. The component that detected and deferred the error may notify a RAS handler by reporting the UED but such a UED does not need an immediate remedial action to be performed by the RAS handler. For example, a memory controller may detect an uncorrectable ECC error on data in memory but since there is no immediate consumer of the data the memory controller may just mark the data as poisoned and defer the error handling to a component that requests the data. If the poisoned data is never consumed then deferred errors are benign. If the poisoned data is completely overwritten with new data then the associated poison is cleared. If the poisoned data is only partially written then the data continues to be marked as poisoned.

A component that detects an uncorrected error may be unable to defer the handling of the error by techniques such as poisoning. Such errors are said to be Uncorrected Errors Critical (UEC) and a RAS handler is invoked as immediate remedial actions are required. For example, a cache controller may detect an uncorrectable ECC error on the memory used to hold cache tags and since such errors cannot be attributed to any particular data element these errors may be classified as UEC. If poisoned data is attempted to be consumed by a component (e.g. a hart, an IOMMU, a device, etc.) then an UEC occurs as immediate remedial actions are required and further deferral of the error is not possible.

A component that signals a request for execution of an RAS handler for an UEC may indicate that the error has not propagated beyond the boundaries of the component that detected the error and thus may be containable through recovery actions (e.g., terminating the computation, etc.) carried out by the RAS handler.

Some components act as an intermediary through which the data passes through. For example, a PCIe/CXL port is an intermediary component that by itself does not consume the data it receives from memory but forwards the data to the endpoint. In such cases the component may receive the data with a deferred error. Such a component may propagate the error and not log an error by itself. However, if the component to which the data is being propagated (e.g. a PCIe endpoint) is not capable of handling poison then the former component must signal a UEC instead of propagating the corrupted data, as the act of propagation breaks containment of the error.

An error detected by a component may lead to a failure mode where the component may not be able to service requests anymore (e.g. colloquially called jammed, wedged, etc.). For example, an error in the hart pipeline may cause the hart to stop committing instructions, a fabric may be in a state where it cannot process any further requests, the link connecting the memory module to the host may have failed, etc. In such cases invoking a RAS handler may not be useful as the RAS handler itself may need to generate requests to the failed component to perform the recovery actions. Components in such failed states may use an implementation-defined signal to a system recovery controller (e.g., a Baseboard Management Controller (BMC), an on-chip service controller, etc.) to initiate a RAS-handling reset to restart the component, sub-system, or the system itself to restore correct service operations.

Error Prediction

Error prediction involves the use of corrected errors as a predictor of future uncorrectable permanent failures or other systemic issues, such as marginality due to aging. Monitoring corrected errors may facilitate the avoidance of future service failures.

Studies indicate that the probability of an uncorrected DRAM error is elevated if the DIMM previously experienced corrected errors cite:[DRAM_WILD] cite:[SRI_2012] cite:[ZIV_2019]. Such reasoning is used by system protection mechanisms, which utilize simple heuristics for offlining potentially failing memory pages cite:[HWA_2012] cite:[MEZ_2015] cite:[TANG_2006] cite:[DU_2021] or for replacing compromised DIMMs cite:[MAR_2014] cite:[DRAM_WILD] cite:[DU_2020].

Reporting of detected and corrected hardware errors is requisite for any quantitative analysis of system resilience and for the prediction of future uncorrected errors cite:[EU_HPC]. This prediction capability facilitates the deployment of preventive mechanisms, such as pre-failure alerts in High-Performance Computing (HPC) cluster management software, thus mitigating the costs associated with unscheduled outages and system repairs.

Components of a resilient system may also include corrected error counters to count the corrections performed. Such components may additionally include a fixed or programmable threshold to notify a RAS handler when the number of corrected errors surpasses the threshold.

RERI Features

Version 1.0 of the RISC-V RERI specification supports the following features:

  • Error severity classes and standard error codes.

  • Standard register format and addressing for memory-mapped error-record registers and error-record banks.

  • Rules for prioritized overwriting of valid error records with new error records.

  • Corrected error counting.

  • Error record injection for RAS handler testing.

This specification is intended to accommodate a wide variety of systems designs and needs - from high-end server-class systems to low-end embedded systems. This is accomplished through providing implementation flexibility and options - both within the registers of an error record and the number of error records in an error bank, and with respect to the association between hardware components and error errors/banks.

Glossary

Table 1. Terms and definitions
Term Definition

AER

Advanced Error Reporting. A PCIe capability to support advanced error control and reporting.

BMC

Baseboard Management Controller.

CE

Corrected Error.

Custom

A register or data structure field designated for custom use. Software that is not aware of the custom use must ignore custom fields and preserve value held in these fields when writing values to other fields in the same register.

CXL

Compute Express Link bus standard.

Data

In this specification data refers broadly to all forms of information being stored or transferred in a computing system. In the case of a CPU, for example, this encompasses information that may be treated as instructions that are fetched and executed, as well as data that is loaded and stored.

DIMM

Dual-In-line Memory Module. A packaging arrangement of memory devices on a socketable substrate.

DRAM

Dynamic random-access memory. Devices made using Dynamic RAM circuit configurations that have data storage that must be refreshed periodically.

ECC

Error Correcting Code.

Error Reporting

Error reporting is the process of logging information (including their severity, nature, and location) about a detected error in an error record and signaling, if required, the occurrence of the error to an appropriate RAS handler.

FSM

Finite-State Machine. An abstract machine that can be in exactly one of a finite number of states at any time.

GPA

Guest Physical Address. See Priv. specification.

HPC

High-performance Computing. High-Performance Computing (HPC) refers to the use of parallel processing techniques to solve complex computational problems. It enables faster data processing and simulation by leveraging multiple processors or servers.

ID

Identifier.

IOMMU

Input-Output Memory Management Unit. A system-level Memory Management Unit (MMU) that connects direct-memory-access capable Input/Output (I/O) devices to system memory.

NMI

Non-Maskable interrupt. See Priv. specification.

OS

Operating System.

PLL

Phase-Locked Loop. A control system that generates an output signal whose phase is related to the phase of an input signal. PLLs are commonly used to perform clock synthesis.

PCIe

Peripheral Component Interconnect Express bus standard.

RAS

Reliability, Availability, and Serviceability.

RERI

RAS error record register interface.

Reserved

A register or data structure field reserved for future use. Reserved fields in data structures must be set to 0 by software. Software must ignore reserved fields in registers and preserve the value held in these fields when writing values to other fields in the same register.

RO

Read-Only - Register bits are read-only and cannot be altered by software. Where explicitly defined, these bits are used to reflect changing hardware state, and as a result bit values can be observed to change at run time.
If the optional feature that would Set the bits is not implemented, the bits must be hardwired to Zero

RW

Read-Write - Register bits are read-write and are permitted to be either Set or Cleared by software to the desired state.
If the optional feature that is associated with the bits is not implemented, the bits are permitted to be hardwired to Zero.

RW1C

Write-1-to-Clear status - Register bits indicate status when read. A Set bit indicates a status event which is Cleared by writing a 1b. Writing a 0b to RW1C bits has no effect.
If the optional feature that would Set the bit is not implemented, the bit must be read-only and hardwired to Zero

RW1S

Read-Write-1-to-Set - register bits indicate status when read. The bit may be Set by writing 1b. Writing a 0b to RW1S bits has no effect.
If the optional feature that introduces the bit is not implemented, the bit must be read-only and hardwired to Zero

SDE

Silent Data Error.

SOC

System On a Chip, also referred as System-On-a-Chip and System-On-Chip.

SPA

Supervisor Physical Address. See Priv. specification.

TLB

Translation Lookaside Buffer.

VA

Virtual Address. See Priv. specification.

UED

Uncorrected Error Deferred.

UEC

Uncorrected Error Critical.

WARL

Write Any values, Reads Legal values: Attribute of a register field that is only defined for a subset of bit encodings, but allow any value to be written while guaranteeing to return a legal value whenever read.

WPRI

Writes Preserve values, Reads Ignore values: Attribute of a register field that is reserved for future standard use.