Add `PublicBenchmarkDataset` & `SecretDataset` #747

RasmusOrsoe · 2024-09-13T12:29:13Z

This PR adds extensions of ERDAHostedDataset that allows us to build and share public benchmarking datasets, and secret ones! It also introduces functionality to ParquetDataset that removes chunk ids from selection that doesn't exist.

Below is an example of the syntax of SecretDataset - a way for us to share datasets using ERDA sharelinks:

from graphnet.data import SecretDataset

dm = SecretDataset(secret= "secret-erda-sharelink",
                   graph_definition= ... ,
                    download_dir="/home/cool-datasets/",
                    backend = 'parquet',
                    mode = 'train')

training_dataloader = dm.train_dataloader
validation_dataloader = dm.val_dataloader
test_dataloader = dm.test_dataloader

The idea here is that we can distribute datasets "secretly" to colleagues, and once the data is ready to be made public, the data can be made available through the PublicBenchmarkDataset by subclassing, providing a similar syntax:

from graphnet.datasets import ABenchmarkDataset 

dm = ABenchmarkDataset(
                    graph_definition= ... ,
                    download_dir="/home/cool-datasets/",
                    backend = 'parquet',
                    mode = 'train')

training_dataloader = dm.train_dataloader
validation_dataloader = dm.val_dataloader
test_dataloader = dm.test_dataloader

…dices

… given

Pulse merging graph definition

Aske-Rosted

A few comments. Also seems like some of the unit tests are failing.

Aske-Rosted · 2024-09-16T00:31:23Z

src/graphnet/models/graphs/graph_definition.py

+                    isinstance(value, (int, float))
+                    for value in values_to_merge
+                ):
+                    # alculate the mean for all attributes except charge


Aske-Rosted · 2024-09-16T00:37:55Z

tests/models/test_graph_definition.py

+    assert pulses[0] == input_features.shape[0]
+    # Merging window (2 ns) is large enough to merge two of the pulses
+    assert pulses[1] == (input_features.shape[0] - 1)
+    # Merging window (2 ns) is large enough to merge four of the pulses


should this be 8 ns? Also maybe the test should check that grouping more than 2 of the pulses in one is handled correctly.

Aske-Rosted · 2024-09-16T00:50:07Z

src/graphnet/data/curated_datamodule.py

+        graph_definition: GraphDefinition,
+        download_dir: str,
+        backend: str = "parquet",
+        mode: str = "train",


I do not quite understand why the naming of the different modes are "train/test/test-no-noise".

Aske-Rosted · 2024-09-16T00:53:44Z

src/graphnet/data/curated_datamodule.py

+                    os.path.join(
+                        self.dataset_dir,
+                        "selections",
+                        "test_noise_selection.parquet",


Calling this one "test_noise" instead of calling the no noise mode "test-no-noise" after having specified "no-noise" throughout the code is a bit confusing

RasmusOrsoe · 2024-09-16T10:15:08Z

@Aske-Rosted thanks for taking a look. Looks like I by mistake managed to merge another branch into this one, causing the checks to fail. I think your comments on the toggles between "test", "train" and "no-noise" is fair - and is granted quite specific to what I intend to use it for. I'll close this PR and make a new one in the future.

RasmusOrsoe added 24 commits May 20, 2024 16:37

revert changes on main

6a06d65

adjust download logic

534a529

Merge branch 'main' of https://github.com/RasmusOrsoe/graphnet

fbafb46

Merge branch 'main' of https://github.com/RasmusOrsoe/graphnet

cbc4228

add merging functionality to graph_definition

810f6c7

generalize temp ids to xyz

76c8b83

reference time column in Detector

b9cf465

add sensor_time_name as Detector property

7d487f4

add sensor_time_column to all Detectors

6779ee0

pass new args through specific graph implementations

e9e3a68

add charge_name as Detector property

6f993ce

add charge_column to all Detectors

fac18e6

add member variable for charge in graph def

72de10e

add unit test for merging functionality

8227d74

remove stray print statement

6c5cf10

adjust logic for path finding

2899067

grab chunk ids instead of inferring them in ParquetDatset _get_all_in…

c521d27

…dices

remove non-existing ids froms indices in parquet_dataset

a386817

adjust pathing for secret dataset

3593366

add z flag for extraction with tar for speedup

074ebdb

toggle z-flag off for tar extraction for parquet backend

b683831

add PublicBenchmarkDataset and SecretDataset

50f9a35

add imports to init

bf3fc6e

adjust doc string

3c3b962

RasmusOrsoe requested a review from Aske-Rosted September 13, 2024 12:46

RasmusOrsoe added 5 commits September 13, 2024 14:48

black

40aee1f

overwrite previous changes to DataConverter

2c1d202

fix _get_all_indices_ in parquetdataset

ea39d1c

remove changes to DataConverter

f35a04e

remove unintended comment

2f778a9

RasmusOrsoe and others added 5 commits September 14, 2024 15:55

cast list to str

17d3d44

Only infer train/val selection in DataModule if test selection is not…

4012e77

… given

grammar

0db2229

Merge pull request #30 from RasmusOrsoe/pulse_merging_graph_definition

31b99c5

Pulse merging graph definition

Merge branch 'paper-test-branch' into new_dataset

03c5935

Aske-Rosted reviewed Sep 16, 2024

View reviewed changes

RasmusOrsoe closed this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `PublicBenchmarkDataset` & `SecretDataset` #747

Add `PublicBenchmarkDataset` & `SecretDataset` #747

RasmusOrsoe commented Sep 13, 2024 •

edited

Loading

Aske-Rosted left a comment •

edited

Loading

Aske-Rosted Sep 16, 2024

Aske-Rosted Sep 16, 2024

Aske-Rosted Sep 16, 2024

Aske-Rosted Sep 16, 2024

RasmusOrsoe commented Sep 16, 2024

Add PublicBenchmarkDataset & SecretDataset #747

Add PublicBenchmarkDataset & SecretDataset #747

Conversation

RasmusOrsoe commented Sep 13, 2024 • edited Loading

Aske-Rosted left a comment • edited Loading

Choose a reason for hiding this comment

Aske-Rosted Sep 16, 2024

Choose a reason for hiding this comment

Aske-Rosted Sep 16, 2024

Choose a reason for hiding this comment

Aske-Rosted Sep 16, 2024

Choose a reason for hiding this comment

Aske-Rosted Sep 16, 2024

Choose a reason for hiding this comment

RasmusOrsoe commented Sep 16, 2024

Add `PublicBenchmarkDataset` & `SecretDataset` #747

Add `PublicBenchmarkDataset` & `SecretDataset` #747

RasmusOrsoe commented Sep 13, 2024 •

edited

Loading

Aske-Rosted left a comment •

edited

Loading