Skip to content

Commit

Permalink
feat: init luigi, add raw data extraction (#3)
Browse files Browse the repository at this point in the history
* feat: init luigi, add extract raw data
* feat: update readme
  • Loading branch information
leomaurodesenv authored Nov 2, 2023
1 parent 382de19 commit 0624be9
Show file tree
Hide file tree
Showing 5 changed files with 24 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# dvc-luigi
This is a learning repository about DVC Data Version Control and Luigi Pipelines

- luigi, dvc, pre-commit
- setup https://github.com/Kaggle/kaggle-api
- `kaggle competitions download -c sentiment-analysis-on-movie-reviews -p data`
- `kaggle competitions download -c sentiment-analysis-on-movie-reviews -p data`
1 change: 1 addition & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/output/
/data.xml
/sentiment-analysis-on-movie-reviews.zip
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
kaggle==1.5.16
dvc==3.28.0
luigi==3.4.0
Empty file added source/__init__.py
Empty file.
20 changes: 20 additions & 0 deletions source/get_raw_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import os
import luigi
import zipfile

class ExtractRawData(luigi.Task):
data_path = luigi.Parameter(default="../data/sentiment-analysis-on-movie-reviews.zip")

def output(self):
return {
"test": luigi.LocalTarget('../data/output/test.tsv.zip'),
"train": luigi.LocalTarget('../data/output/train.tsv.zip'),
}

def run(self):
# Check if data file exists
assert os.path.exists(self.data_path)

# Unzip data file
with zipfile.ZipFile(self.data_path, 'r') as zip_ref:
zip_ref.extractall("../data/output/")

0 comments on commit 0624be9

Please sign in to comment.