Skip to content

Commit

Permalink
Merge pull request #248 from serco425/iss2-adding-BNN-info
Browse files Browse the repository at this point in the history
iss2-adding-BNN-info
  • Loading branch information
profvjreddi authored Jun 4, 2024
2 parents 0185e65 + 3481e96 commit afbd50a
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 2 deletions.
24 changes: 24 additions & 0 deletions contents/robust_ai/robust_ai.bib
Original file line number Diff line number Diff line change
@@ -1,6 +1,30 @@
%comment{This file was created with betterbib v5.0.11.}
@article{Aygun2021BSBNN,
author = {Aygun, Sercan and Gunes, Ece Olcay and De Vleeschouwer, Christophe},
title = {Efficient and robust bitstream processing in binarised neural networks},
journal = {Electron. Lett.},
volume = {57},
number = {5},
pages = {219--222},
keywords = {Logic circuits, Neural nets (circuit implementations), Logic elements, Neural net devices},
doi = {10.1049/ell2.12045},
year = {2021},
source = {Crossref},
url = {https://doi.org/10.1049/ell2.12045},
publisher = {Institution of Engineering and Technology (IET)},
issn = {0013-5194, 1350-911X},
month = jan,
}

@article{courbariaux2016binarized,
author = {Courbariaux, Matthieu and Hubara, Itay and Soudry, Daniel and El-Yaniv, Ran and Bengio, Yoshua},
title = {Binarized neural networks: {Training} deep neural networks with weights and activations constrained to+ 1 or-1},
journal = {arXiv preprint arXiv:1602.02830},
year = {2016},
}

@book{reddi2013resilient,
author = {Reddi, Vijay Janapa and Gupta, Meeta Sharma},
title = {Resilient Architecture Design for Voltage Variation},
Expand Down
8 changes: 6 additions & 2 deletions contents/robust_ai/robust_ai.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -153,10 +153,14 @@ In ML systems, transient faults can have significant implications during the tra

For example, a bit flip in the weight matrix of a neural network can cause the model to learn incorrect patterns or associations, leading to degraded performance [@wan2021analyzing]. Transient faults in the data pipeline, such as corruption of training samples or labels, can also introduce noise and affect the quality of the learned model.

During the inference phase, transient faults can impact the reliability and trustworthiness of ML predictions. If a transient fault occurs in the memory storing the trained model parameters or in the computation of the inference results, it can lead to incorrect or inconsistent predictions. For instance, a bit flip in the activation values of a neural network can alter the final classification or regression output [@mahmoud2020pytorchfi].
During the inference phase, transient faults can impact the reliability and trustworthiness of ML predictions. If a transient fault occurs in the memory storing the trained model parameters or in the computation of the inference results, it can lead to incorrect or inconsistent predictions. For instance, a bit flip in the activation values of a neural network can alter the final classification or regression output [@mahmoud2020pytorchfi].

In safety-critical applications, such as autonomous vehicles or medical diagnosis, transient faults during inference can have severe consequences, leading to incorrect decisions or actions [@li2017understanding;@jha2019ml]. Ensuring the resilience of ML systems against transient faults is crucial to maintaining the integrity and reliability of the predictions.

At the other extreme, in resource-constrained environments like TinyML, Binarized Neural Networks (BNNs)[@courbariaux2016binarized] have emerged as a promising solution. BNNs represent network weights in single-bit precision, offering computational efficiency and faster inference times. However, this binary representation renders BNNs fragile to bit-flip errors on the network weights. For instance, prior work [@Aygun2021BSBNN] has shown that a two-hidden layer BNN architecture for a simple task such as MNIST classification suffers performance degradation from 98% test accuracy to 70% when random bit-flipping soft errors are inserted through model weights with a 10% probability.

Addressing such issues requires considering flip-aware training techniques or leveraging emerging computing paradigms (e.g., [stochastic computing](https://en.wikipedia.org/wiki/Stochastic_computing)) to enhance fault tolerance and robustness, which we will discuss in @sec-hw-intermittent-detect-mitigate. Future research directions aim to develop hybrid architectures, novel activation functions, and loss functions tailored to bridge the accuracy gap compared to full-precision models while maintaining their computational efficiency.

### Permanent Faults

Permanent faults are hardware defects that persist and cause irreversible damage to the affected components. These faults are characterized by their persistent nature and require repair or replacement of the faulty hardware to restore normal system functionality.
Expand Down Expand Up @@ -243,7 +247,7 @@ Mitigating the impact of intermittent faults in ML systems requires a multifacet

Designing ML systems resilient to intermittent faults is crucial to ensuring their reliability and robustness. This involves incorporating fault-tolerant techniques, runtime monitoring, and adaptive mechanisms into the system architecture. By proactively addressing the challenges of intermittent faults, ML systems can maintain their accuracy, consistency, and trustworthiness, even in sporadic hardware failures. Regular testing, monitoring, and maintenance of ML systems can help identify and mitigate intermittent faults before they cause significant disruptions or performance degradation.

### Detection and Mitigation
### Detection and Mitigation {#sec-hw-intermittent-detect-mitigate}

This section explores various fault detection techniques, including hardware-level and software-level approaches, and discusses effective mitigation strategies to enhance the resilience of ML systems. Additionally, we will look into resilient ML system design considerations, present case studies and examples, and highlight future research directions in fault-tolerant ML systems.

Expand Down

0 comments on commit afbd50a

Please sign in to comment.