update readme and function doc

Cyrilvallez · May 16, 2022 · 602f445 · 602f445
1 parent f1e00b5
commit 602f445
Show file tree

Hide file tree

Showing 2 changed files with 23 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -29,9 +29,9 @@ The specific goal here is more to detect crude near duplicate image manipulation
 
 # Usage
 
-This library was created to benchmark all these different methods. The easiest way for this is to choose a dataset, randomly split it in 2 parts (experimental and control groups), and sample a given number of images in both groups on which you can perform artificial attacks defined is `generator/generate_attacks.py`. The scripts `create_groups.py` and `create_attacks.py` perform those tasks, and save the images with correct name format for later matching.
+This library was created to benchmark all these different methods. The easiest way for this is to choose a dataset, randomly split it in 2 parts (experimental and control groups), and sample a given number of images in both groups on which you can perform artificial attacks defined in `generator/generate_attacks.py`. The scripts `create_groups.py` and `create_attacks.py` perform those tasks, and save the images with correct name format for later matching.
 
-Then given a database (all experimental group) of images to check for manipulations, an experimental group of images that are manipulations of some images in the database (all attacks on the images sampled from experimental group) and a control group containing images not present in the database (all attacks on the images sampled from control group), it can be declared as :
+Then given a database of images, an experimental group of images that are manipulations of some images in the database (all attacks on the images sampled from experimental group) and a control group containing images not present in the database (all attacks on the images sampled from control group), datasets can be declared in the following way :
 
 ```
 import hashing 
@@ -45,6 +45,16 @@ positive_dataset = hashing.create_dataset(path_experimental, existing_attacks=Tr
 negative_dataset = hashing.create_dataset(path_control, existing_attacks=True)
 ```
 
+Additionally, if one wants to perform attacks at experiment time, without having to save them to disk (experiment will take more time but this will save storage space), it can be done as
+
+```
+path_dataset = 'Datasets/...'
+
+dataset = hashing.create_dataset(path_dataset, fraction=0.3, existing_attacks=False):
+```
+
+where `fraction` is the fraction of the dataset on which attacks will be performed (give 1 for each image in the dataset).
+
 Then declare the methods and algorithms you wish to use, along with thresholds for the matching logic, e.g :
 
 ```
@@ -75,12 +85,18 @@ All this is contained in `main.py`.
 
 The final digest is composed of 6 files : `general.json` with general metrics for all the experiment, `attacks.json` containing the metrics for each types of attack, `images_pos.json` and `images_neg.json` containing number of correct/incorrect detection for each image in the database respectively, and `match_time.json` and `db_time.json` respectively containing the time (s) for the matching phase and the the database creation phase.
 
-To process and create figures from the digest, one can look into `process.py`.
+# Figure generation
+
+To process and create figures from the digest, one can look into `process.py`. Figure generation is contained in `helpers/create_plot.py`. Note that by default this will require a LaTeX installation on the machine running the process. This can be disabled in `helpers/configs_plot.py`.
 
 # Datasets
 
-We personally used 3 datasets that can be found at <INSERT LINK !!!> and for which the splitting has already been done. They are the BSDS500 dataset, ImageNet validation set and a dataset from Kaggle containing memes from reddit.
+We personally used 3 datasets that can be found online, and for which we performed the splitting. They are the [BSDS500 dataset](https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html), [ImageNet validation set (ILSVRC2012)](https://www.image-net.org/) and the [Kaggle memes dataset](https://www.kaggle.com/datasets/gmorinan/most-viewed-memes-templates-of-2018).
 
 # Pre-trained SimCLR models 
 
-The pre-trained SimCLR models are not available in this repository due to their large size. They can be downloaded (along with their architecture definition) directly in the [github of the authors ](https://github.com/google-research/simclr) or downloaded directly at <INSERT LINK !!!> (folders SimCLRv1 and SimCLRv2) and then added to the `hashing` folder of the library.
+The pre-trained SimCLR models are not available in this repository due to their large size. They can be downloaded (along with their architecture definition) directly in the [github of the authors ](https://github.com/google-research/simclr). Note that we used the PyTorch version of these models. They should then be added to `hashing/SimCLRv1/` and `hashing/SimCLRv2/` (folders SimCLRv1 and SimCLRv2). If unsure, see how these models are loaded directly in `hashing/neuralhash.py`.
+
+# Computational setup
+
+For neural methods, use of a GPU is almost essential for computational efficiency. Other classes of methods do not rely on it, and their computations are performed exclusively on CPU.
diff --git a/hashing/general_hash.py b/hashing/general_hash.py
@@ -443,7 +443,7 @@ def hashing(algorithms, thresholds, databases, dataset, general_batch_size=512,
         default is 512.
     artificial_attacks : Bool, optional
         Whether the attacks are artificial and we need to record the attack-wise 
-        digest or not. The default is True.
+        digest or not. Set to False for the memes dataset. The default is True.
 
     Returns
     -------
@@ -625,7 +625,7 @@ def total_hashing(algorithms, thresholds, path_to_db, positive_dataset,
         default is 512.
     artificial_attacks : Bool, optional
         Whether the attacks are artificial and we need to record the attack-wise 
-        digest or not. The default is True.
+        digest or not. Set to False for the memes dataset. The default is True.
 
     Returns
     -------