how to use PML for learning distance matrix of tabular data #654
harshaamzn
started this conversation in
General
Replies: 1 comment 5 replies
-
How about randomly sampling the billion rows to create the dataset for PML training? |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a huge tabular dataset with labels. Each record has ~500 features with a label and the dataset has a billion rows. Labels in my data are binary. I want to use submodlib to find a diverse representative subset of this large dataset to train a model with higher AUC than a model with randomly subsampled dataset from my billion rows. submodlib accepts a distance matrix to specify similarity of different items. It then applies submodular maximization to output the diverse subset of rows. I am interested in using PML to learn the distance matrix from the tabular data. Appreciate any pointers on how to go about this.
Elsewhere I have used the SoftTrippleLoss function to train a neural network that generates embedding such that data points of the same label are closer to each other. However, to train such a neural network, we have to select a training dataset. The learnt embeddings will depend on the training data set we start with. The input to PML to produce a distance matrix is embeddings and labels. It feels like the distance matrix learned through PML which I need to use in submodlib depends on the embeddings and labels. Generating embeddings in turn depends on the training dataset we use. How do we break this cycle? Appreciate any pointers
Beta Was this translation helpful? Give feedback.
All reactions