Kaggle competition: https://www.kaggle.com/t/c88b47a9221b43dc95f73cd696b5695f
This is a classification task with 7 categories, involving two files:
train.npz
test.x.npz
The training file contains 8,000 samples, each consisting of 100 features and 7 labels. The testing file comprises 2,000 samples, each having 100 features.
To load the files, you can use the following code:
import numpy as np
train = np.load('train.npz')
test = np.load('test.x.npz')
train_x, train_y = train['x'], train['y']
test_x = test['x']
Your objective is to classify and predict the 2,000 test data samples and obtain 2,000 rows of labels ranging from 0 to 6.
For example:
pred = model(test_x) # 2000 x 1 matrix
print(pred)
# [[0],
# [6],
# [2],
# ...,
# [1],
# [3]]
np.savetxt(
"pred.csv",
np.asarray(list(enumerate(pred.reshape(-1)))),
fmt="%d",
delimiter=",",
header="id,label",
comments="",
)
You could see a baseline model in src/simp/baseline.py
. It is a simple MLP model with 3 hidden layers, dropout and batch normalization. The accuracy is about 0.8.
To run it:
You need to install the following packages:
dependencies = [
"jax>=0.4.6",
"jaxlib>=0.4.6",
"dm-haiku>=0.0.9",
"optax @ git+https://github.com/deepmind/optax.git",
"numpy>=1.24.2",
"rich>=13.3.2",
]
Or using pdm
https://pdm.fming.dev/latest/ to install dependencies:
pdm install
You could run the baseline model by:
python src/simp/baseline.py
# or if you use pdm
pdm run python src/simp/baseline.py
We evaluate the predition txt using accuracy. The accuracy is defined as:
def accuracy(pred, label):
return np.mean(pred == label)
You could see an example in src/simp/test_results.py
.