how to use PML for learning distance matrix of tabular data #654

harshaamzn · 2023-08-08T17:43:05Z

harshaamzn
Aug 8, 2023

I have a huge tabular dataset with labels. Each record has ~500 features with a label and the dataset has a billion rows. Labels in my data are binary. I want to use submodlib to find a diverse representative subset of this large dataset to train a model with higher AUC than a model with randomly subsampled dataset from my billion rows. submodlib accepts a distance matrix to specify similarity of different items. It then applies submodular maximization to output the diverse subset of rows. I am interested in using PML to learn the distance matrix from the tabular data. Appreciate any pointers on how to go about this.

Elsewhere I have used the SoftTrippleLoss function to train a neural network that generates embedding such that data points of the same label are closer to each other. However, to train such a neural network, we have to select a training dataset. The learnt embeddings will depend on the training data set we start with. The input to PML to produce a distance matrix is embeddings and labels. It feels like the distance matrix learned through PML which I need to use in submodlib depends on the embeddings and labels. Generating embeddings in turn depends on the training dataset we use. How do we break this cycle? Appreciate any pointers

KevinMusgrave · 2023-08-09T21:56:53Z

KevinMusgrave
Aug 9, 2023
Maintainer

How about randomly sampling the billion rows to create the dataset for PML training?

5 replies

harshaamzn Aug 10, 2023
Author

our data is highly class imbalanced. random sampling will not give us representative data from both the classes. even if I sample each of the 2 classes individually, I will end up changing the balance of class data in the sampled set.

Doesn't the quality of model output (embeddings of tabular data - records) from PML training depend on what data is fed to PML? I want to use PML to produce a distance matrix learnt from the data. However, if I used sampled data for PML training, I suspect the learnt distance matrix may not be representative of the original dataset. Can you confirm if I am thinking correct about this cyclic dependency?

KevinMusgrave Aug 11, 2023
Maintainer

Yes, so it seems like you have to either use the entire dataset with PML, or do some sort of sampling.

I assume that training on the entire dataset will take too long, so you have to sample.

Could you make it so that your random sample has the same class imbalance as the original dataset?

KevinMusgrave Aug 11, 2023
Maintainer

A bigger issue is that a distance matrix of size (1 billion, 1 billion) won't fit in memory. I think that's about an exabyte of data.

KevinMusgrave Aug 11, 2023
Maintainer

But yes you're right that the model quality obtained by PML depends on the data fed into it. You'll obtain the best model if you can train on the entire dataset, and the quality will decrease as you make the sample size smaller and smaller.

harshaamzn Aug 11, 2023
Author

yes, we can down sample the classes individually to maintain the same ratio. However, random down sampling will not give the diverse records, which is the final objective - down sample the Billion record dataset to retain diverse informative samples, than simply more data.

Perhaps I can divide the datasets into chunks (of 10MM or smaller) and work on each of the smaller datasets to find the distance matrix - sort of a divide and conquer approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use PML for learning distance matrix of tabular data #654

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

how to use PML for learning distance matrix of tabular data #654

harshaamzn Aug 8, 2023

Replies: 1 comment · 5 replies

KevinMusgrave Aug 9, 2023 Maintainer

harshaamzn Aug 10, 2023 Author

KevinMusgrave Aug 11, 2023 Maintainer

KevinMusgrave Aug 11, 2023 Maintainer

KevinMusgrave Aug 11, 2023 Maintainer

harshaamzn Aug 11, 2023 Author

harshaamzn
Aug 8, 2023

Replies: 1 comment 5 replies

KevinMusgrave
Aug 9, 2023
Maintainer

harshaamzn Aug 10, 2023
Author

KevinMusgrave Aug 11, 2023
Maintainer

KevinMusgrave Aug 11, 2023
Maintainer

KevinMusgrave Aug 11, 2023
Maintainer

harshaamzn Aug 11, 2023
Author