diff --git a/README.md b/README.md index b402fb4..359b99b 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ A library for image manipulation detection. This supports 3 classes of algorithms : -- Perceptual hashing methods (fast and simple methods designed for image forensics). The following algorithms are implemented in `hashing/imagehash.py`: +- Perceptual hashing methods (fast and simple methods designed for image forensics). The following algorithms are implemented in `hashing/imagehash.py` (taken and modified from [here](https://github.com/JohannesBuchner/imagehash)): - Average Hash - Perceptual hash - Difference hash diff --git a/Results/Threshold_0_005_bsds500/Experiment.yml b/Results/Threshold_0_005_bsds500/Experiment.yml new file mode 100644 index 0000000..63d0d5f --- /dev/null +++ b/Results/Threshold_0_005_bsds500/Experiment.yml @@ -0,0 +1,53 @@ +--- +name : Benchmark_memes_new +date generated: Wednesday 04/05/2022 +GPU(s): 1 a100 +CPUs: 16 +dataset used: BSDS500 with attacks on disk +thresholds: thresholds = [ + [0.052], + [0.224], + [0.159], + [0.072], + [0.069], + [67.7778], + [0.0906], + [0.0414], + [0.1606], + [0.2611], + [0.3683], + [0.2996], + [0.3168], + [0.5197], + [0.5133], + [0.5208], + ] +algorithms: [ + hashing.ClassicalAlgorithm('Ahash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Phash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Dhash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Whash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Crop resistant hash', hash_size=8, batch_size=512, cutoff=1), + hashing.FeatureAlgorithm('SIFT', batch_size=512, n_features=30, cutoff=1), + hashing.FeatureAlgorithm('ORB', batch_size=512, n_features=30, cutoff=1), + hashing.FeatureAlgorithm('FAST + DAISY', batch_size=512, n_features=30, cutoff=1), + hashing.FeatureAlgorithm('FAST + LATCH', batch_size=512, n_features=30, cutoff=1), + hashing.NeuralAlgorithm('Inception v3', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('EfficientNet B7', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('ResNet50 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('ResNet101 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('SimCLR v1 ResNet50 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('SimCLR v2 ResNet50 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('SimCLR v2 ResNet101 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + ] +general batch size: 64 +--- +purpose: | +Check that those thresholds correspond to 0.005 fpr on BSDS500. \ No newline at end of file diff --git a/Results/Threshold_0_005_memes/Experiment.yml b/Results/Threshold_0_005_memes/Experiment.yml new file mode 100644 index 0000000..f243e7f --- /dev/null +++ b/Results/Threshold_0_005_memes/Experiment.yml @@ -0,0 +1,53 @@ +--- +name : Benchmark_memes_new +date generated: Wednesday 04/05/2022 +GPU(s): 1 a100 +CPUs: 16 +dataset used: Kaggle memes dataset split +thresholds: thresholds = [ + [0.052], + [0.224], + [0.159], + [0.072], + [0.069], + [67.7778], + [0.0906], + [0.0414], + [0.1606], + [0.2611], + [0.3683], + [0.2996], + [0.3168], + [0.5197], + [0.5133], + [0.5208], + ] +algorithms: [ + hashing.ClassicalAlgorithm('Ahash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Phash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Dhash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Whash', hash_size=8, batch_size=512), + hashing.ClassicalAlgorithm('Crop resistant hash', hash_size=8, batch_size=512, cutoff=1), + hashing.FeatureAlgorithm('SIFT', batch_size=512, n_features=30, cutoff=1), + hashing.FeatureAlgorithm('ORB', batch_size=512, n_features=30, cutoff=1), + hashing.FeatureAlgorithm('FAST + DAISY', batch_size=512, n_features=30, cutoff=1), + hashing.FeatureAlgorithm('FAST + LATCH', batch_size=512, n_features=30, cutoff=1), + hashing.NeuralAlgorithm('Inception v3', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('EfficientNet B7', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('ResNet50 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('ResNet101 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('SimCLR v1 ResNet50 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('SimCLR v2 ResNet50 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + hashing.NeuralAlgorithm('SimCLR v2 ResNet101 2x', raw_features=True, batch_size=32, + device='cuda', distance='Jensen-Shannon'), + ] +general batch size: 64 +--- +purpose: | +Check the fpr obtained with the thresholds giving 0.005 on BSDS500. \ No newline at end of file diff --git a/Weekly_logs.md b/Weekly_logs.md index 923fb9d..4a51d87 100644 --- a/Weekly_logs.md +++ b/Weekly_logs.md @@ -14,7 +14,9 @@ These are logs that I will update every week in order to keep track of my work ( 9. [Week 9 : 04/04](#week9) 10. [Week 10 : 11/04](#week10) 11. [Week 11 : 18/04](#week11) -11. [Week 12 : 25/04](#week12) +12. [Week 12 : 25/04](#week12) +12. [Week 13 : 02/05](#week13) +12. [Week 14 : 09/05](#week14) @@ -263,4 +265,18 @@ Apart from that, I read about DINO which is a self-supervised training technique - Visual Instance Retrieval with Deep Convolutional Networks - Large-Scale Image Retrieval with Compressed Fisher Vectors -and some other that I just quickly looked at. \ No newline at end of file +and some other that I just quickly looked at. + +## Week 13 : 02/05 + +This week, monday and tuesday were used to make my slides and prepare for the student exchange presentation. Then on wednesday, I cleaned most of the main Github repository, wrote the Readme etc... for when we will submit the paper. Finally, the end of the week was devoted to read and discover how ElasticSearch and Faiss work for similarity search. After some time, it appeared that faiss would be easier to use and integrate into Python code, and is thus the preferred direction for now. There is no clear documentation, but some examples on their Github. + +## Week 14 : 09/05 + +This week, the first trials with Faiss were performed. I first downloaded half of the [Flick1M dataset](https://press.liacs.nl/mirflickr/), resulting in 500K distractor images to emulate large scale image search, following what we discussed during the meeting the week before (we agreed that millions of images were not needed, and that we would go to the hundred thousands). Then I created a new Github repository that start this new part of the project, on which I added code to extract the features on the different datasets and save them to file for fast future access. I first had a lot of memory issues when dealing with such large arrays. At first, the problem seemed to come from Pytorch dataloader and the workers for subprocesses, but finally appeared to be due to numpy implementation of their function and copy process when creating new arrays. After this trouble, I was able to really start testing Faiss on the data. + +Of course, I first tested with brute-force matching, to get a baseline both in term of search time and accuracy we could hope to obtain. As accuracy metric, I decided to use recall@k measure (proportion of target images present in the k nearest neighbors of the query images) which in my opinion make sense and seems to be used in this kind of context. I first tested with recall@1, meaning we only look for 1 nearest neighbor of each query. + +The first observation is that Faiss is incredibly fast. When looking at 1 nearest neighbor, it takes only about 10-15s on GPU to get the results for ~40K queries on a database of about ~500K. This is at least the case for L2 and cosine distances (L1 is a bit longer, and Jensen-Shannon way more --> about 800s). The second observation is that we get pretty good results from the neural descriptors (SimCLRv2 Resnet50 2x) : the recall@1 for brute force cosine similarity is 0.806 on the Kaggle memes dataset (but remember : this dataset is pretty dirty !) and 0.95 on the artificial attacks BSDS500 dataset ! Of course, since this is brute-force we won't be able to improve those recall number, only the time needed to get results (may not even be true -> sometimes PCA improves on the baseline brute force results). However, checking for more neighbors (recall@5, recall@10,...), should improve results. In the end it seems that last months efforts were not for nothing. + +I also compared the brute-force baseline to clustering and searching only to the nearest clusters. Of course this improves the time needed for the search, but does affect performances. However, when searching a relatively high number of clusters (50 to 100), we can get faster results without loosing much performances, which is a good start. I now want to experiment in priority with PCA for dimensionality reduction, since this may drastically improve search time while not affecting too much the recall. Then an analysis of the clusters would be nice to get a sense of how the data is partitioned. So those are the next objectives, along with refining the current framework for easier benchmarking of techniques. \ No newline at end of file diff --git a/helpers/create_plot.py b/helpers/create_plot.py index 959bb83..eaa9f89 100644 --- a/helpers/create_plot.py +++ b/helpers/create_plot.py @@ -411,7 +411,6 @@ def _find_lowest_biggest_frequencies(image_wise_digest, kind, N): # The number of the images as randomly attributed ints img_numbers = np.arange(len(keys)) - # REORDER ?? ID = np.zeros((N_rows, N_cols, N)) value = np.zeros((N_rows, N_cols, N))