This repository contains a small empirical comparison of algorithm implementations in cblearn with each other and with implementations in different libraries.
At the moment, only ordinal embedding algorithms are evaluated.
Here you find some results.
conda create -n cblearn python==3.10
conda activate cblearn
conda install h5py seaborn tqdm pandas
conda install -c conda-forge adjusttext
pip install git+https://github.com/dekuenstle/cblearn.git#egg=cblearn[torch]
pip install jupyterlab
The data will be stored in ./datasets
; the path can be customized with the environment variable CBLEARN_DATA
.
(this might take a few minutes)
conda activate cblearn
python scripts/datasets.py
conda activate cblearn
cat runs/py.sh | xargs -L1 sbatch slurm/batchjob.sh
- Download van der Maaten's STE scripts and extract to
lib/vanderMaaten_STE
- Adjust matlab license file/server in
slurm/mat-batchjob.sh
cat runs/mat.sh | tr '\n' '\0' | xargs -0n1 sbatch slurm/mat-batchjob.sh
orsingularity run --bind ${PWD}:/home/docker --pwd /home/docker --env [email protected] docker://mathworks/matlab:r2022a matlab -sd scripts/ -batch "embedding('STE', 'car');"
or (if matlab is available on your system)sh runs/mat.sh
sh /mnt/qb/work/wichmann/dkuenstle56/cblearn-benchmark/slurm/mat-batchjob.sh "matlab -sd scripts/ -batch 'disp("GNMDS", "car");'" sbatch slurm/mat-batchjob.sh matlab -sd scripts/ -batch "'embedding("STE", "material");'"
singularity run --bind ${PWD}:/home/docker --pwd /home/docker --env MLM_LICENSE_FILE=[email protected] docker://mathworks/matlab:r2022a matlab -sd scripts/ -batch "embedding('STE', 'car');"
-
Start R and install dependencies. If you are asked, if you want to use a personal library, respond "yes".
R > install.packages(c('docopt', 'jsonlite', 'MLDS', 'loe'), dependencies=TRUE, repos='http://cran.r-project.org/') ... yes ... yes > q()
-
cat runs/r.sh | xargs -L1 sbatch slurm/batchjob.sh
orsh runs/r.sh
orRscript scripts/embedding.R SOE car
Workaround on our HPC:
echo $SCRATCH
/scratch_local/<foo>
mkdir $SCRATCH/r-lib
R
> install.packages(c('docopt', 'jsonlite', 'MLDS', 'loe'), dependencies=TRUE, repos='http://cran.r-project.org/', lib='/scratch_local/<foo>/r-lib')
> q()
cp -a $SCRATCH/r-lib/* ~/R/x86_64-redhat-linux-gnu-library/3.6/
- Install python environment as described above.
conda activate cblearn
- Run a single model, e.g.
python scripts/embedding.py SOE car
, or all modelssh runs/py.sh
- Install R (tested with 4.2)
- Install dependencies in R.
install.packages(c('docopt', 'rjson', 'MLDS', 'loe'), dependencies=TRUE, repos='http://cran.rstudio.com/')
- Run a single model, e.g.
Rscript scripts/embedding.R SOE car
, or all modelssh runs/r.sh
# singularity:
singularity run --env [email protected] docker://mathworks/matlab:r2022a
# or docker:
docker run -it --rm -p 8888:8888 -e [email protected] --shm-size=512M mathworks/matlab:r2022a
Plots that visualize the datasets and the comparison's results, like the ones in the paper, are generated with jupyter notebooks.
Start jupyter jupyter lab .
, and then run the following notebooks:
R-language R embedding.R <algo> <dataset> <result>
Matlab matlab embedding.m -r "embedding <algo> <dataset> <result>"
- STE: CKL[-K], GNMDS[-K], STE[-K], and tSTE algorithms.
Python
- cblearn: MLDS, CKL-X, GNMDS-X, SOE, STE-X, tSTE CKL-GPU[-K], FORTE-GPU[-K], GNMDS-GPU[-K], SOE-GPU, STE-GPU, tSTE-GPU
If you don't run the scripts with containers, you can manually install
these dependencies to your local R instance with install.packages(...)
.
We run each algorithm and dataset on a separate cluster entity with 96GB RAM and maximum 24h runtime. Runs that exceeded these limitations failed intentionally. For example, our FORTE-GPU algorithm requires too much memory and thus fails on the large imagenet-v2 dataset. Similarly, the tSTE algorithm of vanderMaaten timed out on the things and imagenet-v2 datasets. The R implementation of SOE crashed for imagenet-v2 because "long vectors" are not supported by some internal function.
The scripts in this library are free to use under the MIT License conditions. The plots are shared under CC BY-SA 2.0 and require attribution.