Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ptc refactor #65

Merged
merged 42 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
bcc151b
added updated frontier code
Apr 4, 2024
8ccdd90
rmd tar pieces
Apr 4, 2024
7221ae0
updated pre-training code
Jun 7, 2024
b84c587
added frontier code
Jun 7, 2024
4f1f2d2
added latest code, runners, and notebooks
Jun 10, 2024
79548c5
added note
Jun 10, 2024
3b3d10c
per step checkpointing
Jul 3, 2024
8c99c72
fixed bug with checkpointing
Jul 3, 2024
422b7bd
updated data path
Jul 3, 2024
30a5734
added config changes for hackathon
Aug 26, 2024
f433e11
made more config changes
Aug 26, 2024
7e72fd7
changed batch size for example script
Aug 27, 2024
5b435b6
added tensorboard notebook
Aug 27, 2024
add1cb1
add tensorboard
Aug 27, 2024
45ee11e
Merge branch 'ht24_jl' into hackathon_2024
Aug 27, 2024
84ac000
Merge branch 'hackathon_2024' of https://github.com/nasa-nccs-hpda/py…
Aug 27, 2024
aa846b0
added tensorboard updates
Aug 28, 2024
80ec8a2
made tensorboard dir arg
Aug 28, 2024
11342e2
made config changes
Aug 28, 2024
2de0b35
Adding optimizer module and lamb with fusedlamb with deepspeed
Aug 28, 2024
265f252
Merge pull request #57 from nasa-nccs-hpda/hackaton_2024_jacaraba
jordancaraballo Aug 28, 2024
5c6ad7c
merged jordans changes
Aug 28, 2024
d01a940
added lamb gradient logging
Aug 28, 2024
2211008
added lamb gradient logging
Aug 28, 2024
f0597d7
rmd address for slurm
Aug 28, 2024
089c81c
Merge pull request #58 from nasa-nccs-hpda/hackathon_2024_cssprad1
cssprad1 Aug 28, 2024
2dbbc9b
add DeepSpeed Flops Profiler to tensorboard
Sep 6, 2024
b370537
add DeepSpeed Flops Profiler to tensorboard
Sep 6, 2024
a9df1a6
new TAG
Sep 6, 2024
341b414
Merge pull request #60 from nasa-nccs-hpda/hackaton_2024_jl
cssprad1 Sep 9, 2024
5a7c5d5
test updates from matched pair code assesment
Sep 16, 2024
a6cc223
Adding code coverage through match pair trial
jordancaraballo Sep 16, 2024
d66072b
Merge remote-tracking branch 'caleb/matched-pair-cssprad1' into hacka…
jordancaraballo Sep 16, 2024
75ca214
Some README documentation for regression test locally
jordancaraballo Sep 16, 2024
a851f86
removed files, changed structure, initial commit
Oct 22, 2024
d1591c9
init files for all subdirs
Oct 22, 2024
fa0aa69
initial pretraining commit
Oct 23, 2024
7e6430e
added model handling and 3dcloud task
Oct 30, 2024
8f8d9b1
updated pipelines
Nov 6, 2024
72f0197
deleted other examples
Nov 6, 2024
5dc47bf
merge conflicts
Nov 8, 2024
9a7cd6f
merge conflicts
Nov 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[run]
source = pytorch_caney
omit =
*/site-packages/*
*/dist-packages/*
*/tests/*
setup.py

[report]
exclude_lines =
pragma: no cover
def __repr__
if self.debug:
if __name__ == .__main__.:
raise NotImplementedError
pass
except ImportError:

show_missing = True

[html]
directory = htmlcov

[xml]
output = coverage.xml
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,18 @@ cd pytorch-caney; conda env create -f requirements/environment_gpu.yml;
conda activate pytorch-caney
python -m unittest discover pytorch_caney/tests
```

Another example using the singularity exec command from the Explore system:

```bash
singularity exec --env PYTHONPATH="$NOBACKUP/development/pytorch-caney" --nv -B /path/to/mount /path/to/container/pytorch-caney-container coverage run -m unittest discover pytorch_caney/tests

This command would output the report per file:

```bash
singularity exec --env PYTHONPATH="$NOBACKUP/development/pytorch-caney" --nv -B /path/to/mount /path/to/container/pytorch-caney-container coverage report
```

## References

- [Pytorch Lightning](https://github.com/Lightning-AI/lightning)
Expand Down
32 changes: 32 additions & 0 deletions configs/3dcloudtask_fcn_baseline_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
PIPELINE: '3dcloud'
DATAMODULE: 'abitoa3dcloud'
MODEL:
ENCODER: 'fcn'
DECODER: 'fcn'
NAME: 3dcloud-fcn-baseline
IN_CHANS: 14
DROP_PATH_RATE: 0.1
DATA:
BATCH_SIZE: 32
DATA_PATHS: [/explore/nobackup/projects/ilab/data/satvision-toa/3dcloud.data/abiChipsNew/]
TEST_DATA_PATHS: [/explore/nobackup/projects/ilab/data/satvision-toa/3dcloud.data/abiChipsNew/]
IMG_SIZE: 128
TRAIN:
ACCELERATOR: 'gpu'
STRATEGY: 'auto'
EPOCHS: 50
WARMUP_EPOCHS: 10
BASE_LR: 3e-4
MIN_LR: 2e-4
WARMUP_LR: 1e-4
WEIGHT_DECAY: 0.05
LR_SCHEDULER:
NAME: 'multistep'
GAMMA: 0.1
MULTISTEPS: [700,]
LOSS:
NAME: 'bce'
PRINT_FREQ: 10
SAVE_FREQ: 50
VALIDATION_FREQ: 20
TAG: 3dcloud_task_fcn_baseline_128_scaled_bt_minmax
41 changes: 41 additions & 0 deletions configs/3dcloudtask_swinv2_satvision_giant_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
PIPELINE: '3dcloud'
DATAMODULE: 'abitoa3dcloud'
MODEL:
ENCODER: 'satvision'
DECODER: 'fcn'
PRETRAINED: /panfs/ccds02/nobackup/projects/ilab/projects/3DClouds/models/SV-TOA/3B_2M/mp_rank_00_model_states.pt
TYPE: swinv2
NAME: 3dcloud-svtoa-finetune-giant
IN_CHANS: 14
DROP_PATH_RATE: 0.1
SWINV2:
IN_CHANS: 14
EMBED_DIM: 512
DEPTHS: [ 2, 2, 42, 2 ]
NUM_HEADS: [ 16, 32, 64, 128 ]
WINDOW_SIZE: 8
NORM_PERIOD: 6
DATA:
BATCH_SIZE: 32
DATA_PATHS: [/explore/nobackup/projects/ilab/data/satvision-toa/3dcloud.data/abiChipsNew/]
TEST_DATA_PATHS: [/explore/nobackup/projects/ilab/data/satvision-toa/3dcloud.data/abiChipsNew/]
IMG_SIZE: 128
TRAIN:
USE_CHECKPOINT: True
EPOCHS: 50
WARMUP_EPOCHS: 10
BASE_LR: 3e-4
MIN_LR: 2e-4
WARMUP_LR: 1e-4
WEIGHT_DECAY: 0.05
LR_SCHEDULER:
NAME: 'multistep'
GAMMA: 0.1
MULTISTEPS: [700,]
LOSS:
NAME: 'bce'
PRECISION: 'bf16'
PRINT_FREQ: 10
SAVE_FREQ: 50
VALIDATION_FREQ: 20
TAG: 3dcloud_task_swinv2_g_satvision_128_scaled_bt_minmax
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
PIPELINE: 'satvisiontoapretrain'

MODEL:
TYPE: swinv2
NAME: mim_satvision_pretrain-giant
DROP_PATH_RATE: 0.1
SWINV2:
IN_CHANS: 14
EMBED_DIM: 512
DEPTHS: [ 2, 2, 42, 2 ]
NUM_HEADS: [ 16, 32, 64, 128 ]
WINDOW_SIZE: 8
NORM_PERIOD: 6

DATA:
DATAMODULE: False
BATCH_SIZE: 64
LENGTH: 1_920_000
PIN_MEMORY: True
NUM_WORKERS: 4
DATA_PATHS: [/explore/nobackup/projects/ilab/projects/3DClouds/data/mosaic-v3/webdatasets/shards]
IMG_SIZE: 128
MASK_PATCH_SIZE: 8
MASK_RATIO: 0.6

TRAIN:
ACCELERATOR: 'gpu'
STRATEGY: 'deepspeed'
USE_CHECKPOINT: True
EPOCHS: 50
WARMUP_EPOCHS: 10
BASE_LR: 3e-4
MIN_LR: 2e-4
WARMUP_LR: 1e-4
WEIGHT_DECAY: 0.05
LR_SCHEDULER:
NAME: 'multistep'
GAMMA: 0.1
MULTISTEPS: [700,]

DEEPSPEED:
STAGE: 2

PRECISION: 'bf16'

PRINT_FREQ: 10
SAVE_FREQ: 50
TAG: mim_pretrain_giant_satvision_128_scaled_bt_minmax_50ep
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
PIPELINE: 'satvisiontoapretrain'

MODEL:
TYPE: swinv2
NAME: mim_satvision_pretrain-giant
DROP_PATH_RATE: 0.1
PRETRAINED: /panfs/ccds02/nobackup/projects/ilab/projects/3DClouds/models/SV-TOA/3B_2M/mp_rank_00_model_states.pt
SWINV2:
IN_CHANS: 14
EMBED_DIM: 512
DEPTHS: [ 2, 2, 42, 2 ]
NUM_HEADS: [ 16, 32, 64, 128 ]
WINDOW_SIZE: 8
NORM_PERIOD: 6

DATA:
DATAMODULE: False
BATCH_SIZE: 64
LENGTH: 1_920_000
PIN_MEMORY: True
NUM_WORKERS: 4
DATA_PATHS: [/explore/nobackup/projects/ilab/projects/3DClouds/data/mosaic-v3/webdatasets/shards]
IMG_SIZE: 128
MASK_PATCH_SIZE: 8
MASK_RATIO: 0.6

TRAIN:
ACCELERATOR: 'gpu'
STRATEGY: 'deepspeed'
USE_CHECKPOINT: True
EPOCHS: 50
WARMUP_EPOCHS: 10
BASE_LR: 3e-4
MIN_LR: 2e-4
WARMUP_LR: 1e-4
WEIGHT_DECAY: 0.05
LR_SCHEDULER:
NAME: 'multistep'
GAMMA: 0.1
MULTISTEPS: [700,]

DEEPSPEED:
STAGE: 2

PRECISION: 'bf16'

PRINT_FREQ: 10
SAVE_FREQ: 50
TAG: mim_pretrain_giant_satvision_128_scaled_bt_minmax_50ep_resume
3 changes: 0 additions & 3 deletions examples/satvision-giant/README.md

This file was deleted.

This file was deleted.

24 changes: 0 additions & 24 deletions examples/satvision-giant/run_satvision_pretrain.sh

This file was deleted.

3 changes: 0 additions & 3 deletions examples/satvision-huge/README.md

This file was deleted.

This file was deleted.

22 changes: 0 additions & 22 deletions examples/satvision-huge/run_satvision_pretrain.sh

This file was deleted.

This file was deleted.

Loading
Loading