-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 0.2.0 #25
Release 0.2.0 #25
Conversation
Added |
Separation of schemas from BEDMSAvailable schemas for standardization have been moved to the HuggingFace repository :https://huggingface.co/databio/attribute-standardizer-model6 from bedms import AttrStandardizer
model = AttrStandardizer("ENCODE") BEDMS had mapped schema name from bedms import AttrStandardizer
model = AttrStandardizer(
repo_id="databio/attribute-standardizer-model6", model_name="encode"
) This also makes it easier for the user to provide their chosen schemas ( as long as they have models on their HuggingFace repository ). |
Is it worth updating
to return a dictionary that includes the repo_id value for the 3 schemas we provide? Or should we just hardcode the 3 repo IDs from PEPhub? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good for the most part. Small comments everywhere. There was an overarching theme I just wanted to bring up... curious what others think:
Using classes as state-storage
I brought this up in a recent review for @ClaudeHu. I'm noticing a pattern where people define and call functions on classes that set state inside that class. So for example in the AttributeStandardizerTrainer
, we call load_enode_data
...
This doesn't return anything, instead if sets state on the trainer itself. In essence, it does this:
def load_encode_data(self):
self.encode_data = ...
Instead of:
def load_encode_data(self):
encode_data = ...
return encode_data
I don't know if one is better than the other, but the former feels like an anti-pattern to me for some reason. I suppose my argument for the second version is that it gives control of the data to the user and reduces ambiguity about whats going on; it hides less functionality.
Last comments
Lastly, is it worth considering lightning
? I know the gains are smaller when there's no deep learning or multi-GPU training going on, but it really does clean up a lot of stuff and handle boring nuances that don't matter. Its a nice paradigm.
Moreover, if possible, we could integrate ClearML logging so the user can view the training in real-time instead of waiting for the plot to be saved to disk only to realize their model didn't learn anything...
Again this is if training times are long, if not then we can forget it.
README.md
Outdated
To train the custom model: | ||
|
||
```python | ||
trainer.training() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be .train()
?
README.md
Outdated
To load the datasets and encode them: | ||
|
||
```python | ||
trainer.load_encode_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this return anything? I'd expect this to return something
README.md
Outdated
|
||
To see the available schemas, you can run: | ||
```python | ||
trainer.testing() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here I like .test()
over .testing()
README.md
Outdated
) | ||
results = model.standardize(pep="geo/gse228634:default") | ||
|
||
assert results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of assert
could we show how maybe they could iterate over results? Then maybe include comments on what the expected output of a print
might be? Something like:
for result in results:
print(result) # {'attr': 'genome', 'score': 0.8 }
bedms/train.py
Outdated
self.label_encoder = None | ||
self.vectorizer = None | ||
self.train_loader = None | ||
self.val_loader = None | ||
self.test_loader = None | ||
self.output_size = None | ||
self.criterion = None | ||
self.train_accuracies = None | ||
self.val_accuracies = None | ||
self.train_losses = None | ||
self.val_losses = None | ||
self.model = None | ||
self.fpr = None | ||
self.tpr = None | ||
self.roc_auc = None | ||
self.all_labels = None | ||
self.all_preds = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, could we type these?
bedms/utils_train.py
Outdated
return glob(os.path.join(dir, "*.csv")) | ||
|
||
|
||
def load_and_preprocess(file_path: str) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More descriptive name? What are we loading and preprocessing?
bedms/utils_train.py
Outdated
) | ||
|
||
|
||
def load_from_dir(dir: str) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here can this be a bit more descriptive?
bedms/utils_train.py
Outdated
:return torch.Tensor: A tensor representing the | ||
average of embeddings in the most common cluster. | ||
""" | ||
flattened_embeddings = [embedding.tolist() for embedding in embeddings] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is flattened, its just converting the list of np.array
s to a list of lists, right?
y_tensor, | ||
) | ||
# Create DataLoader | ||
return DataLoader(dataset, batch_size=batch_size, shuffle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify num_workers?
model_filename = f"model_{self.model_name}.pth" | ||
label_encoder_filename = f"label_encoder_{self.model_name}.pkl" | ||
vectorizer_filename = f"vectorizer_{self.model_name}.pkl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Very nice generic solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes are really good, and I like that code is not duplicated and simple.
One thing before merging, can we clean .git history/folder, because it is 100mb?
Changes:
TrainStandardizer
module for custom model creation