DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

Overview

Automatic crash bucketing is a critical step in the software development process to efficiently analyze and triage bug reports. In this work, we aim at detecting for a crash report its candidate near-duplicates (i.e., similar crashes that are likely to be induced by the same software bug) in a large database of historical crashes and given any similarity measure dedicated to compare between stack traces. To this end, we propose DeepLSH a deep Siamese hash coding neural network based on Locality-Sensitive Hashing (LSH) property in order to provide binary hash codes aiming to locate the most similar stack traces into hash buckets. DeepLSH have been conducted on a large stack trace dataset and performed on state-of-the-art similarity measures proposed to tackle the crash deduplication problem:

Jaccard coefficient [Ref]
Cosine similarity [Ref]
Lucene TF-IDF [Ref]
Edit distance [Ref]
Brodie et al. [Paper]
PDM-Rebucket [Paper]
DURFEX [Paper]
Lerch and Mezini [Paper]
Moroo et al. [Paper]
TraceSIM [Paper]

Contributions

Our contribution is three-fold.

Aiming to overcome the problem of deriving LSH functions for stack-trace similarity measures, we propose a generic approach dubbed DeepLSH that learns and provides a family of binary hash functions that perfectly approximate the locality-sensitive property to retrieve efficiently and rapidly near-duplicate stack traces.

Technically, we design a deep Siamese neural network architecture to perform end-to-end hashing with an original objective loss function based on the locality-sensitive property preserving with appropriate regularizations to cope with the binarization problem of optimizing non-smooth loss functions.
We demonstrate through our experimental study the effectiveness and scalability of DeepLSH to yield near-duplicate crash reports under a dozen of similarity metrics. We successfully compare to standard LSH techniques (MinHash and SimHash), and the most relevant deep hashing baselineon a large real-world dataset that we make available.

How to use this code?

Clone this repository
Install the required python packages: pip install -r ./code/requirements.txt
Run the notebooks in code/notebooks/measure:
- measure-DeepLSH.ipynb : the end-to-end procedure to train/test and validate DeepLSH
- measure-Baseline.ipynb: the end-to-end procedure to train/test and validate the CNHH+LSH baseline
- measure-MinHash.ipynb (for Jaccard [boW / N-grams]): the end-to-end procedure to run MinHash
- measure-SimHash.ipynb (for Cosine [boW / N-grams / TF-IDF]): the end-to-end procedure to run SimHash
- measure-Runtime.ipynb : comparison between the required runtime for DeepLSH vs. CNNH+LSH vs. KNN-based approach

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

Overview

Contributions

How to use this code?

About

Releases

Packages

Languages

RemilYoucef/deep-locality-sensitive-hashing

Folders and files

Latest commit

History

Repository files navigation

DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

Overview

Contributions

How to use this code?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages