Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

281 vejret shouldnt be classify as technology science #283

Merged

Conversation

tfnribeiro
Copy link
Collaborator

As described in the issue, when labeling I thought vejret would have news that are similar to what's associated with the keyword 'klima', which are more focused on larger climate events rather than description of the weather in a daily / weekly basis that we see in 'vejret'.

This means we should remove this to avoid these types of news to be inferred as 'Science & Technology' in the future.

I added a script (tools/update_es_based_on_url_keyword.py), where we can give a keyword to delete the current mappings topics with that keyword, and re-index the articles accordingly. This can also be used to simply update a keyword which had no mapping and has been mapped a topic.

- This will then re-index the affected documents in ES accordingly to the resulting changes.
- Reflect the action of the method.
- This tool can be used to re-update any articles that might get new mappings in the future, so it's not just related with migration.
@tfnribeiro tfnribeiro linked an issue Nov 5, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Nov 5, 2024

Module view diffs:
diffdiffdiff

- Removed redundant code in mysql -> es.
- Renamed set_new_topics_url_keywords -> add_new_topics_from_url_keyword , to better reflect the purpose of the method
- Added a method recalculate_topics_from_url_keywords to allow to recalculate the topics from the url_keywords.
- Improved the indexing process by checking if the document should compute embeddings. This should only happen for new articles, or articles that have updated their content. This will allow for faster updates in the future.
- Remove unnecessary code
- Improved the script to only delete the topic mappings that are associated with the keyword that is being updated.
- If the document is found in ES, then check if we need to re-compute the embeddings, this should speed up the process of updating the DB in case of errors.
- Renamed variables for clarity
@tfnribeiro
Copy link
Collaborator Author

After a discussion with Mircea, we identified some edge-cases that needed to be covered correctly.

I have made a series of improvements, and I will exemplify an example here, currently I have "vejret" associated with the topic "news", and I want to remove this topic.

image

I will run the script with the following options:

URL_KEYWORD_TO_UPDATE = "vejret"
DELETE_ARTICLE_NEW_TOPICS = True
RECALCULATE_TOPICS = True
RE_INDEX_ONLY_ARTICLES_IN_ES = True
ITERATION_STEP = 1000

This means, I want to delete the mapping "vejret" -> "News", and recalculate the topics for those articles.

This results in the following output:

Started at: 2024-11-08 13:06:32.807006
Got articles with url_keyword 'vejret', total: 650
Deleting new_topics 'News' for articles which have the keyword: 'vejret'
Found '650' topic mappings to delete.
MySQL deletion completed.
Re-indexing only existing articles in ES...
Starting re-indexing process...
  0%|                                                                                 | 0/1 [00:00<?, ?it/s]
Batch finished. ADDED/UPDATED:650 | ERRORS: 0
100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00, 20.68s/it]
[]
Total articles added/updated: 650
Ended at: 2024-11-08 13:07:26.641374
Process took: 0:00:53.834368

Now, going back to the home page:

image

The articles are no longer stored as "News", but can still be found by searching:

image

@mircealungu Take a look when you have time and let me know if it makes more sense now!

Copy link
Member

@mircealungu mircealungu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions for you @tfnribeiro

def gen_docs(articles_w_topics):
for article in articles_w_topics:
try:
yield create_or_update_bulk_docs(article, db_session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called bulk_docs? it seems to work on a single article.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an iterator that works for each article and creates a json to work for the bulk function, so the bulk here means that it's to be used by the bulk functions, not necessary that is taking a bulk of articles. I renamed it to: 'create_or_update_doc_for_bulk' maybe that's a bit more clear?

if not article:
print(f"Skipped for: '{i}', article not in DB.")
continue
if recalculate_topic:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not clean code. a function should do one thing and one thing only. here the function is fetch, a getter, but then it also does some other commits to the DB. Could we separate these? I wonder if that would help me better understand the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the recalculate topics to the function: 'recalculate_article_url_keyword_topics', which is run before the indexing step.

db_session.commit()
print("MySQL deletion completed.")

if len(target_ids) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be moved above, immediately after the getting of the target_ids? the print('Got articles... is on the other branch of this if I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This has to do with re-indexing, not the data integrity. If I moved this above, then there would be data that wouldn't be updated in the database, which I think is more confusing.

The way I see it, is that the DB should always be consistent regarding the mappings, but the ES can be incomplete, e.g. some articles might not be in ES and we do not necessarily need to index everything in ES.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some testing, I realized that originally, I was only updating the articles in the DB that were in ES. I still think we should keep the DB data correct with the updates, even if it takes somewhat longer - but ES can be a bit more flexible. This was related with the 'fetch_articles_by_id' not being clean, and re-calcualting the topics when fetching.

I think the flow now makes more sense: Delete Mappings and the Associating of the Keyword -> Recalculate the Topics for Articles -> Index them in ES.

# Updating url_keyword new_topic mapping
# And the topics that were added based on that keyword.
if DELETE_ARTICLE_NEW_TOPICS:
new_topics_ids_to_delete = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this precisely named? do we ever delete a topic_id?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't delete topic_ids, but we delete any articles that contain the topics_ids that were associated with the keyword.

I tried to name it 'topics_ids_to_delete_mappings', hopefully that makes more sense.

@tfnribeiro
Copy link
Collaborator Author

I have reviewed the code, and I hope now it's more understandable. I have renamed some variables / methods, and tried to make the steps more clear.

Let me know if it is more clear?

- Fixed a case, if the Index doesn't exist an error is thrown.
- Use the configuration constant rather than hardcoded value
- By default, we are doing deletion.
@tfnribeiro
Copy link
Collaborator Author

I also fixed a bug introduced by the quicker method of filtering articles in ES. This only happens when the user doesn't have the ES index zeeguu. Overall, it should be more robust this way.

@tfnribeiro tfnribeiro force-pushed the 281-vejret-shouldnt-be-classify-as-technology-science branch from 923752f to 1d30360 Compare November 21, 2024 12:10
- Before we had a "add_topic" which just attempted to add a topic to the article. However, this would fail if there was an existing article. To avoid situations where we replace topics when we don't expect, I separated the methods into 'add_topic_if_doesnt_exist' and 'add_or_replace_topic'. These names should be clear to their function.
- With the methods above, whenever we add keyword topics or hardcoded these should be prioritized over the inferred topics. This means whenever a keyword or hardset is topic we would use the 'add_or_replace_topic' and for inferred topics 'add_topic_if_doesnt_exist'
- The table naming was still under the New_ terminology and that has been migrated.
@mircealungu mircealungu merged commit 6ea514e into master Nov 26, 2024
2 of 3 checks passed
@mircealungu mircealungu deleted the 281-vejret-shouldnt-be-classify-as-technology-science branch November 26, 2024 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vejret shouldn't be classify as Technology & Science
2 participants