Skip to content

[VLDB'22] Cardinality Estimation of Approximate Substring Queries using Deep Learning.

License

Notifications You must be signed in to change notification settings

sykwon/teddy-dream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(SODDY, TEDDY) & DREAM

license ubuntu gcc 7.5 python 3.7 cuda 11.6 size Hits

This repository implements training data generation algorithms (SODDY & TEDDY) and deep cardinality estimators (DREAM) proposed in our paper "Cardinality Estimation of Approximate Substring Queries using Deep Learning". It is created by Suyong Kwon, Woohwan Jung and Kyuseok Shim.

Repository Overview

It consists of four folders each of which contains its own README file and script.

Folder Description
gen_train_data training data generation algorithms
dream deep cardinality estimators for approximate substring queries
astrid the modified version of Astrid starting from the astrid model downloaded from [github]
plot example notebook files

Installation and Requirements

It is recommended to run our code with the CUDA environment. However, the non-CUDA version of our code is also working when the pytorch library does not supper GPU. (You may set CUDA_VISIBLE_DEVICES as -1 to enforce CPU mode.)

Method 1: Use the Docker Image

To run the image needs the NVIDIA Container Toolkit. If you do not have the toolkit, refer to the installation guide

git clone https://github.com/sykwon/teddy-dream.git

# run docker image
docker run -it --gpus all --name dream -v ${PWD}:/workspace -u 1000:1000 sykwon/dream /bin/bash

# after starting docker
redis-server --daemonize yes
cd gen_train_data/
make clean && make && make info
cd ..

Method 2: Create a Virtual Python Environment

This code needs Python-3.7 or higher.

sudo apt-get install -y redis-server git
sudo apt-get install -y binutils
sudo apt-get install -y texlive texlive-latex-extra texlive-fonts-recommended dvipng cm-super

conda create -n py37 python=3.7
source activate py37
conda install -y pytorch=1.7.1 torchvision=0.8.2 cudatoolkit=11.0 -c pytorch -c nvidia

pip install -r requirements.txt

Datasets

  • DBLP
  • GENE
  • WIKI
  • IMDB

Examples

These commands produces experimental results.

cd gen_train_data
./run.sh DBLP     # to generate training data from the DBLP dataset
# ./run.sh GENE   # to generate training data from the GENE dataset
# ./run.sh WIKI   # to generate training data from the WIKI dataset
# ./run.sh IMDB   # to generate training data from the IMDB dataset
# ./run.sh all    # to generate training data from all datasets
cd ..

cd dream
./run.sh DBLP    # to train all models except Astrid with the DBLP dataset
# ./run.sh GENE  # to train all models except Astrid with the GENE dataset
# ./run.sh WIKI  # to train all models except Astrid with the WIKI dataset
# ./run.sh IMDB  # to train all models except Astrid with the IMDB dataset
# ./run.sh all   # to train all models except Astrid with all datasets
cd ..

cd astrid
./run.sh DBLP    # to train the Astrid model with the DBLP dataset
# ./run.sh GENE  # to train the Astrid model with the GENE dataset
# ./run.sh WIKI  # to train the Astrid model with the WIKI dataset
# ./run.sh IMDB  # to train the Astrid model with the IMDB dataset
# ./run.sh all   # to train the Astrid model with all datasets
cd ..

Please refer to [notebook] to see the experimental results.

Citation

Please consider to cite our paper if you find this code useful:

@article{kwon2022cardinality,
    title={Cardinality estimation of approximate substring queries using deep learning},
    author={Kwon, Suyong and Jung, Woohwan and Shim, Kyuseok},
    journal={Proceddings of the VLDB Endowment},
    volume={15},
    number={11},
    year={2022}
}