cblearn-benchmark

This repository contains a small empirical comparison of algorithm implementations in cblearn with each other and with implementations in different libraries.

At the moment, only ordinal embedding algorithms are evaluated.

Here you find some results.

Preparation

Setup a conda environment

conda create -n cblearn python==3.10
conda activate cblearn

conda install h5py seaborn tqdm pandas
conda install -c conda-forge adjusttext
pip install git+https://github.com/dekuenstle/cblearn.git#egg=cblearn[torch]
pip install jupyterlab

Download the datasets

The data will be stored in ./datasets; the path can be customized with the environment variable CBLEARN_DATA. (this might take a few minutes)

conda activate cblearn
python scripts/datasets.py

Run benchmark

On a compute cluster (recommended)

Python (via conda)

conda activate cblearn
cat runs/py.sh | xargs -L1 sbatch slurm/batchjob.sh

Matlab (via singularity)

Download van der Maaten's STE scripts and extract to lib/vanderMaaten_STE
Adjust matlab license file/server in slurm/mat-batchjob.sh
cat runs/mat.sh | tr '\n' '\0' | xargs -0n1 sbatch slurm/mat-batchjob.sh or singularity run --bind ${PWD}:/home/docker --pwd /home/docker --env [email protected] docker://mathworks/matlab:r2022a matlab -sd scripts/ -batch "embedding('STE', 'car');" or (if matlab is available on your system) sh runs/mat.sh

sh /mnt/qb/work/wichmann/dkuenstle56/cblearn-benchmark/slurm/mat-batchjob.sh "matlab -sd scripts/ -batch 'disp("GNMDS", "car");'" sbatch slurm/mat-batchjob.sh matlab -sd scripts/ -batch "'embedding("STE", "material");'"

singularity run --bind ${PWD}:/home/docker --pwd /home/docker --env MLM_LICENSE_FILE=[email protected] docker://mathworks/matlab:r2022a matlab -sd scripts/ -batch "embedding('STE', 'car');"

R

Start R and install dependencies. If you are asked, if you want to use a personal library, respond "yes".

R
> install.packages(c('docopt', 'jsonlite', 'MLDS', 'loe'), dependencies=TRUE, repos='http://cran.r-project.org/')
... yes
... yes
> q()

cat runs/r.sh | xargs -L1 sbatch slurm/batchjob.sh or sh runs/r.sh or Rscript scripts/embedding.R SOE car

Workaround on our HPC:

echo $SCRATCH
    /scratch_local/<foo>
mkdir $SCRATCH/r-lib
R
> install.packages(c('docopt', 'jsonlite', 'MLDS', 'loe'), dependencies=TRUE, repos='http://cran.r-project.org/', lib='/scratch_local/<foo>/r-lib')
> q()
cp -a $SCRATCH/r-lib/* ~/R/x86_64-redhat-linux-gnu-library/3.6/

Manual

Python

Install python environment as described above.
conda activate cblearn
Run a single model, e.g. python scripts/embedding.py SOE car, or all models sh runs/py.sh

R

Install R (tested with 4.2)

Install dependencies in R.

install.packages(c('docopt', 'rjson', 'MLDS', 'loe'), dependencies=TRUE, repos='http://cran.rstudio.com/')

Run a single model, e.g. Rscript scripts/embedding.R SOE car, or all models sh runs/r.sh

Matlab (in a container)

# singularity:
singularity run --env [email protected] docker://mathworks/matlab:r2022a

# or docker:
docker run -it --rm -p 8888:8888 -e [email protected] --shm-size=512M mathworks/matlab:r2022a

Plotting

Plots that visualize the datasets and the comparison's results, like the ones in the paper, are generated with jupyter notebooks.

Start jupyter jupyter lab ., and then run the following notebooks:

scripts/plot_datasets.ipynb

Libraries and Algorithms:

R-language R embedding.R <algo> <dataset> <result>

MLDS: MLDS algorithm
loe: SOE algorithm

Matlab matlab embedding.m -r "embedding <algo> <dataset> <result>"

STE: CKL[-K], GNMDS[-K], STE[-K], and tSTE algorithms.

Python

cblearn: MLDS, CKL-X, GNMDS-X, SOE, STE-X, tSTE CKL-GPU[-K], FORTE-GPU[-K], GNMDS-GPU[-K], SOE-GPU, STE-GPU, tSTE-GPU

Dependencies

R Dependencies

docopt.R: Command line interface
rjson: JSON loading
MLDS: MLDS Algorithm
loe: SOE Algorithm

If you don't run the scripts with containers, you can manually install these dependencies to your local R instance with install.packages(...).

Missing data

We run each algorithm and dataset on a separate cluster entity with 96GB RAM and maximum 24h runtime. Runs that exceeded these limitations failed intentionally. For example, our FORTE-GPU algorithm requires too much memory and thus fails on the large imagenet-v2 dataset. Similarly, the tSTE algorithm of vanderMaaten timed out on the things and imagenet-v2 datasets. The R implementation of SOE crashed for imagenet-v2 because "long vectors" are not supported by some internal function.

License

The scripts in this library are free to use under the MIT License conditions. The plots are shared under CC BY-SA 2.0 and require attribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cblearn-benchmark

Preparation

Setup a conda environment

Download the datasets

Run benchmark

On a compute cluster (recommended)

Python (via conda)

Matlab (via singularity)

R

Manual

Python

R

Matlab (in a container)

Plotting

Libraries and Algorithms:

Dependencies

R Dependencies

Missing data

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
datasets		datasets
lib		lib
logs		logs
plots		plots
results		results
runs		runs
scripts		scripts
slurm		slurm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
results.md		results.md

License

cblearn/cblearn-benchmark

Folders and files

Latest commit

History

Repository files navigation

cblearn-benchmark

Preparation

Setup a conda environment

Download the datasets

Run benchmark

On a compute cluster (recommended)

Python (via conda)

Matlab (via singularity)

R

Manual

Python

R

Matlab (in a container)

Plotting

Libraries and Algorithms:

Dependencies

R Dependencies

Missing data

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages