Skip to content

Commit

Permalink
update readme and function doc
Browse files Browse the repository at this point in the history
  • Loading branch information
Cyrilvallez committed May 16, 2022
1 parent f1e00b5 commit 602f445
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 7 deletions.
26 changes: 21 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ The specific goal here is more to detect crude near duplicate image manipulation

# Usage

This library was created to benchmark all these different methods. The easiest way for this is to choose a dataset, randomly split it in 2 parts (experimental and control groups), and sample a given number of images in both groups on which you can perform artificial attacks defined is `generator/generate_attacks.py`. The scripts `create_groups.py` and `create_attacks.py` perform those tasks, and save the images with correct name format for later matching.
This library was created to benchmark all these different methods. The easiest way for this is to choose a dataset, randomly split it in 2 parts (experimental and control groups), and sample a given number of images in both groups on which you can perform artificial attacks defined in `generator/generate_attacks.py`. The scripts `create_groups.py` and `create_attacks.py` perform those tasks, and save the images with correct name format for later matching.

Then given a database (all experimental group) of images to check for manipulations, an experimental group of images that are manipulations of some images in the database (all attacks on the images sampled from experimental group) and a control group containing images not present in the database (all attacks on the images sampled from control group), it can be declared as :
Then given a database of images, an experimental group of images that are manipulations of some images in the database (all attacks on the images sampled from experimental group) and a control group containing images not present in the database (all attacks on the images sampled from control group), datasets can be declared in the following way :

```
import hashing
Expand All @@ -45,6 +45,16 @@ positive_dataset = hashing.create_dataset(path_experimental, existing_attacks=Tr
negative_dataset = hashing.create_dataset(path_control, existing_attacks=True)
```

Additionally, if one wants to perform attacks at experiment time, without having to save them to disk (experiment will take more time but this will save storage space), it can be done as

```
path_dataset = 'Datasets/...'
dataset = hashing.create_dataset(path_dataset, fraction=0.3, existing_attacks=False):
```

where `fraction` is the fraction of the dataset on which attacks will be performed (give 1 for each image in the dataset).

Then declare the methods and algorithms you wish to use, along with thresholds for the matching logic, e.g :

```
Expand Down Expand Up @@ -75,12 +85,18 @@ All this is contained in `main.py`.

The final digest is composed of 6 files : `general.json` with general metrics for all the experiment, `attacks.json` containing the metrics for each types of attack, `images_pos.json` and `images_neg.json` containing number of correct/incorrect detection for each image in the database respectively, and `match_time.json` and `db_time.json` respectively containing the time (s) for the matching phase and the the database creation phase.

To process and create figures from the digest, one can look into `process.py`.
# Figure generation

To process and create figures from the digest, one can look into `process.py`. Figure generation is contained in `helpers/create_plot.py`. Note that by default this will require a LaTeX installation on the machine running the process. This can be disabled in `helpers/configs_plot.py`.

# Datasets

We personally used 3 datasets that can be found at <INSERT LINK !!!> and for which the splitting has already been done. They are the BSDS500 dataset, ImageNet validation set and a dataset from Kaggle containing memes from reddit.
We personally used 3 datasets that can be found online, and for which we performed the splitting. They are the [BSDS500 dataset](https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html), [ImageNet validation set (ILSVRC2012)](https://www.image-net.org/) and the [Kaggle memes dataset](https://www.kaggle.com/datasets/gmorinan/most-viewed-memes-templates-of-2018).

# Pre-trained SimCLR models

The pre-trained SimCLR models are not available in this repository due to their large size. They can be downloaded (along with their architecture definition) directly in the [github of the authors ](https://github.com/google-research/simclr) or downloaded directly at <INSERT LINK !!!> (folders SimCLRv1 and SimCLRv2) and then added to the `hashing` folder of the library.
The pre-trained SimCLR models are not available in this repository due to their large size. They can be downloaded (along with their architecture definition) directly in the [github of the authors ](https://github.com/google-research/simclr). Note that we used the PyTorch version of these models. They should then be added to `hashing/SimCLRv1/` and `hashing/SimCLRv2/` (folders SimCLRv1 and SimCLRv2). If unsure, see how these models are loaded directly in `hashing/neuralhash.py`.

# Computational setup

For neural methods, use of a GPU is almost essential for computational efficiency. Other classes of methods do not rely on it, and their computations are performed exclusively on CPU.
4 changes: 2 additions & 2 deletions hashing/general_hash.py
Original file line number Diff line number Diff line change
Expand Up @@ -443,7 +443,7 @@ def hashing(algorithms, thresholds, databases, dataset, general_batch_size=512,
default is 512.
artificial_attacks : Bool, optional
Whether the attacks are artificial and we need to record the attack-wise
digest or not. The default is True.
digest or not. Set to False for the memes dataset. The default is True.
Returns
-------
Expand Down Expand Up @@ -625,7 +625,7 @@ def total_hashing(algorithms, thresholds, path_to_db, positive_dataset,
default is 512.
artificial_attacks : Bool, optional
Whether the attacks are artificial and we need to record the attack-wise
digest or not. The default is True.
digest or not. Set to False for the memes dataset. The default is True.
Returns
-------
Expand Down

0 comments on commit 602f445

Please sign in to comment.