Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] OP-wise Insight Mining #516

Merged
merged 24 commits into from
Dec 20, 2024
Merged
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
083b665
+ add auto mode for analyzer: load all filters that produce stats to …
HYLcool Dec 12, 2024
662df5e
+ add default mem_required for those model-based OPs
HYLcool Dec 13, 2024
926c3da
- support wordcloud drawing for str or str list fields in stats
HYLcool Dec 13, 2024
27347c0
- take the minimum one of dataset length and auto num
HYLcool Dec 13, 2024
d19f92f
* update default export path
HYLcool Dec 13, 2024
fbd6726
* set version limit for wandb to avoid exception
HYLcool Dec 13, 2024
9f9f85b
+ add docs for auto mode
HYLcool Dec 13, 2024
566eb5b
+ support t-test for Measure
HYLcool Dec 16, 2024
7b8ee5c
* fix some bugs
HYLcool Dec 16, 2024
601d9a2
- support analyze a dataset object
HYLcool Dec 17, 2024
34f2ab6
- support analysis on tags in meta
HYLcool Dec 17, 2024
8531a01
- support analysis with tagging OPs
HYLcool Dec 17, 2024
4d6b701
- move tags into the meta field
HYLcool Dec 18, 2024
35aa6bd
- do not tell tags using their suffix
HYLcool Dec 18, 2024
85e1392
- add insight mining
HYLcool Dec 18, 2024
e3d7b8b
* resolve the bugs when running insight mining in multiprocessing mode
HYLcool Dec 19, 2024
3ca9994
Merge branch 'main' into feat/insight_mining
HYLcool Dec 19, 2024
16ca358
* update unittests
HYLcool Dec 20, 2024
dfb0bca
* update unittests
HYLcool Dec 20, 2024
f8b9539
* update unittests
HYLcool Dec 20, 2024
45259e5
* update readme for analyzer
HYLcool Dec 20, 2024
174ee05
Merge branch 'main' into feat/insight_mining
HYLcool Dec 20, 2024
51f53dc
* use more detailed key
HYLcool Dec 20, 2024
58001ca
+ add reference
HYLcool Dec 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'main' into feat/insight_mining
# Conflicts:
#	data_juicer/ops/__init__.py
#	data_juicer/ops/base_op.py
HYLcool committed Dec 19, 2024
commit 3ca999480adc286907983135a07a6c627e5cc107
5 changes: 3 additions & 2 deletions data_juicer/ops/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from . import deduplicator, filter, mapper, selector
from . import aggregator, deduplicator, filter, grouper, mapper, selector
from .base_op import (NON_STATS_FILTERS, OPERATORS, TAGGING_OPS, UNFORKABLE,
Deduplicator, Filter, Mapper, Selector)
Aggregator, Deduplicator, Filter, Grouper, Mapper,
Selector)
from .load import load_ops

__all__ = [
8 changes: 8 additions & 0 deletions data_juicer/ops/base_op.py
Original file line number Diff line number Diff line change
@@ -237,6 +237,14 @@ def run(self, dataset):
num_proc=self.runtime_np(),
batch_size=self.batch_size,
desc='Adding new column for meta')
if self.index_key is not None:

def add_index(sample, idx):
sample[self.index_key] = idx
return sample

dataset = dataset.map(add_index, with_indices=True)

return dataset

def empty_history(self):
You are viewing a condensed version of this merge commit. You can view the full changes here.