Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/export dataset for research #3511

Merged
merged 6 commits into from
Oct 18, 2024

Conversation

L-M-Sherlock
Copy link
Contributor

proto/anki/scheduler.proto Show resolved Hide resolved
self, min_entries: int = 0, target_path: str | None = None
) -> None:
if target_path is None:
target_path = os.path.join(os.path.dirname(self.path), "dataset.pb")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer target_path was required instead of falling back to dataset.pb. Otherwise looks good 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. I update it.

.filter_map(|deck| {
let parent_id = self
.storage
.parent_decks(&deck)
Copy link
Member

@dae dae Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very efficient, but I don't know if we have an existing routine you can reuse to look up the parent from a hashmap, so I won't ask you to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find an existing routine. What about that:

    pub(crate) fn immediate_parent_deck(&self, child: &Deck) -> Result<Option<Deck>> {
        if let Some(parent_name) = immediate_parent_name(child.name.as_native_str()) {
            if let Some(parent_did) = self.get_deck_id(parent_name)? {
                return self.get_deck(parent_did);
            }
        }
        Ok(None)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function named deck_tree:

pub fn deck_tree(&mut self, timestamp: Option<TimestampSecs>) -> Result<DeckTreeNode> {
let names = self.storage.get_all_deck_names()?;
let mut tree = deck_names_to_tree(names.into_iter());
let decks_map = self.storage.get_decks_map()?;
add_collapsed_and_filtered(&mut tree, &decks_map, timestamp.is_none());
if self.default_deck_is_empty()? {
hide_default_deck(&mut tree);
}

Do you mean this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd need immediate_parent_deck() and then looking up the parent in a hashset of all deck names. Alternatively, you could use deck_tree() and then walk the tree once to determine the parent of each deck. But this is code that will run infrequently and the delay won't be huge, so I'm ok with not making this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant immediate_parent_name() - glad to see you figured that out :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactor the code. Now it looks up the parent from the hashmap.

469f286

@@ -1,3 +1,4 @@
use std::collections::HashMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing that Jarrett. Just one little tweak - please move this down

@dae
Copy link
Member

dae commented Oct 18, 2024

Thanks :-) It may take a couple of weeks for me to get this onto AnkiWeb - I still need to catch up with the recent FSRS changes as well. Will let you know.

@dae dae merged commit b09326c into ankitects:main Oct 18, 2024
1 check passed
@L-M-Sherlock L-M-Sherlock deleted the Feat/export-dataset-for-research branch October 18, 2024 08:58
@dae
Copy link
Member

dae commented Oct 21, 2024

Finally got updates working again. Here's a small test export. Please confirm it has everything you want, then I can do a bigger run.

https://drive.google.com/file/d/1JmUvu6SELAy9jSXejksjfTBTKmYEQ3mI/view?usp=sharing

Is 20k a good number, or do you want more or less?

@L-M-Sherlock
Copy link
Contributor Author

Is 20k a good number, or do you want more or less?

20k feels too much. My device will run out of its disk. 10k is enough. Thanks! I will check the test export and give you feedback later.

@L-M-Sherlock
Copy link
Contributor Author

Please confirm it has everything you want, then I can do a bigger run.

Checked! It's really what I want.

@dae
Copy link
Member

dae commented Oct 22, 2024

Last time, I think I gathered revlogs with 5k+ reviews. Do you have a preference for min number?

@L-M-Sherlock
Copy link
Contributor Author

Is it possible to restrict the limit to exclude users who have thousands of revlog entries generated by rescheduling on change?

@dae
Copy link
Member

dae commented Oct 22, 2024

Not currently - export_dataset() only takes min_entries. If you need this, you can submit a PR to change it, and then I will need to deploy a new update to AnkiWeb.

Please let me know if you'd like me to generate the entries as-is, or wait for the PR.

@dae
Copy link
Member

dae commented Oct 22, 2024

(sorry, just another post in case you didn't see my edit)

@L-M-Sherlock
Copy link
Contributor Author

Please let me know if you'd like me to generate the entries as-is, or wait for the PR.

Please wait for that, thanks! I'm afraid that this problem is worse for those users registered in last year because the reschedule on change feature was introduced in last year.

@dae
Copy link
Member

dae commented Oct 22, 2024

Is your goal to exclude users who have rescheduled on change? Or exclude the rescheduling revlogs for those users? If your goal is the former, won't that bias the results away from early adopters?

@L-M-Sherlock
Copy link
Contributor Author

My goal is the latter, because these entries are useless in optimization.

@Expertium
Copy link
Contributor

@dae any chance we will get the new dataset before the release of Anki 24.11?

@L-M-Sherlock
Copy link
Contributor Author

@Expertium, I have received the dataset. The benchmark will be updated in next two weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants