Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Report Generator Functionallity #166

Merged
merged 17 commits into from
Jul 10, 2024

Conversation

tfnribeiro
Copy link
Collaborator

I am making this PR in order to make it easier to received and discuss some of the functionality of the report generator.

Main Feature:

We can now generate reports based on the SQL database and some collected data during the crawls. This is done by keeping a dictionary that tracks the number of "skipped" articles and the reason why they were skipped and the content that is removed (AKA repetitive sentences, or short paragraphs).

Implementation Details:

The report shown here can be found in the following ZIP:

reports.zip

image

tools/crawl_summary

Contains CrawlReport which is responsible to keep track of the feeds crawling time and the number of articles retrieved from the database or the reason why they are not crawled. It also collects the sentences that were removed based on "noisy" pre-defined patterns. Essentially, it tracks the information that is logged to the CMD during crawling.

This object is created every time there is a crawl performed by the article_crawler.py creating a json file at the end of the process.

The class also allows to load all generated files based on the number of days from the day running the script, e.g. the last 7 days generated files. This is used in the report to report which sentences were removed and the reason why articles aren't crawled.

All data is stored in crawl_data seperated by language.

tools/report_generator

Contains three main pieces of code:

  • data_extractor.py: Contains the logic to extract the logic from the database. Essentially, each method runs a defined query and loads it into a dataframe which is used to manipulate the data to be shown in plots. This can also be used in the Notebook, which can be seen as a playground of sorts to test the different queries.
  • generate_report.py: contains all the plotting logic to generate the HTML report which is placed in the reports folder. For each plot, an image is produced specifically for that report. The HTML expects all the images to be located in a img folder relative to the HTML file. Currently the HTML doesn't use much formatting, essentially just having the images placed one after the other. Eventually, we could make some divs with some CSS to make it slightly nicer to read. I have used an open-source CSS library to format the tables (pure-min.css).
  • explore_report_notebook.ipynb: will contain examples on how the data was processed and can allow to develop new plots or quickly iterate over the styling over the plots currently being used.

Notes:

  • When developing the CrawlerReport, I noticed that we do not keep track of the articles that are rejected (not stored in the DB). Meaning that, at every crawl, we re-attempt to crawl articles that are noisy. We should enable a caching mechanism based on the URLs we do not want to parse. Could be through a table in the DB maybe?
  • I want to create some tests to ensure the queries are returning the data we expect. What would be the best way to do this? Should I create a test folder, with some "Fake" data in the same format of the Database with specific tests cases?
  • The DataExtractor cannot run queries for periods longer than 200 days in my testing. I haven't been able to find the timeout setting relating to read_timeouts.
  • To fully deploy this, we will need some cron jobs that ensure that the logs are cleaned to avoid storing too much data. All the data that is generated is not tracked (added to .gitignore)
  • For the active users, I consider only users that have at least 1 minute of activity in either reading or exercises during the period being evaluated - which by default is a week (7 days).

Copy link
Member

@mircealungu mircealungu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff @tfnribeiro . A few comments for you


class CrawlReport:
def __init__(self) -> None:
path_to_dir = os.sep.join(inspect.getfile(self.__class__).split(os.sep)[:-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good if we parameterized this with two envvars maybe?
FOLDER_WITH_CRAWL_SUMMARIES and FOLDER_FOR_REPORT_OUTPUT

Alternatively, take them as script parameters in the main script and pass them here?

That way we have more control at deployment time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am not sure. This is initialized in the article_crawler.py - so I think the parameter would be passed there instead for this file specifically, maybe I can make a setter for the dir path, though the generate_report also expects this path to be in a specific place.

I could make both the article_crawler and the generate_report take a CrawlReportPath in case we want to change it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with the environment variables, I think it's working!

dt_parsed = datetime.datetime.strptime(str_datetime, STR_DATETIME_FORMAT)
return dt_parsed

def __convert_dt_to_str(self, datetime):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this as sub-method where it's used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to make it a private method for the class, but maybe I will make it a class method to reflect this. This is just to ensure it's converted in the expected format for the file output.

Let me know if you agree with it otherwise I can make it part of the methods that save and load, it's just in my mind I see it as a pair the dt_to_str and str_to_dt

path_to_dir = os.sep.join(inspect.getfile(self.__class__).split(os.sep)[:-1])
self.default_save_dir = os.path.join(path_to_dir, "crawl_data")
self.data = {"lang": {}}
self.crawl_date = datetime.datetime.now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this confused me.
maybe:

  • initialize it with None and add comment about how it will be initialized later
  • rename it to something more intention revealing? maybe something like: start_of_crawl_period. unless i mis-understood it's role.

Copy link
Collaborator Author

@tfnribeiro tfnribeiro Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the intention is to use as the way to save the files, I actually forgot to update it and it was using datetime.now().

So my idea is, you start an article_crawler, this object stores the time it started and it's used as sort of the key for that crawl, aka, the crawl done for danish in that starting time. I don't think this should be a value that is manipulated, and it works as a almost private property.

It also allows us to load files based on their date, so if I want for the last 7 days, I can use this date to sort out when we are crawling it based on the current date. Does that make sense?

traceback.print_exc()
crawl_report.add_feed_error(feed.language.code, feed.id, str(e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is elegant.

@@ -22,7 +26,9 @@ def setUp(self):

def testDifficultyOfFeedItems(self):
feed = FeedRule().feed1
download_from_feed(feed, zeeguu.core.model.db.session, 3, False)
crawl_report = CrawlReport()
crawl_report.add_feed(feed.language.code, feed.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Feed object knows it's language.code and id so in principle, you could only expect a Feed object in the add_feed method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I will make that change!

def test_ml_classification(self):
db_content = mock_readability_call(URL_ML_JP_PAYWALL)

is_quality, reason = sufficient_quality(db_content)
is_quality, reason, code = sufficient_quality(db_content)
assert not is_quality
assert reason == "ML Prediction was 'Paywalled'."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need the reason when we have now the code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to discuss this with you, I was thinking about having a little class that has a Code + Reason property so we can keep both. I think maybe some codes might not be enough to fully understand so they could have a slight description with them. What do you think?

@mircealungu
Copy link
Member

mircealungu commented Jun 27, 2024

  • When developing the CrawlerReport, I noticed that we do not keep track of the articles that are rejected (not stored in the DB). Meaning that, at every crawl, we re-attempt to crawl articles that are noisy. We should enable a caching mechanism based on the URLs we do not want to parse. Could be through a table in the DB maybe?

Yes. That's a good observation. The simplest thing could be an extra table for broken articles? maybe with attributes date attempted and reason for it being not downloaded?

  • I want to create some tests to ensure the queries are returning the data we expect. What would be the best way to do this? Should I create a test folder, with some "Fake" data in the same format of the Database with specific tests cases?

This would probably be the way I think.

  • The DataExtractor cannot run queries for periods longer than 200 days in my testing. I haven't been able to find the timeout setting relating to read_timeouts.

I don't think we'll need that for now.

  • To fully deploy this, we will need some cron jobs that ensure that the logs are cleaned to avoid storing too much data. All the data that is generated is not tracked (added to .gitignore)

Once we merge, I'll take care of that.

  • For the active users, I consider only users that have at least 1 minute of activity in either reading or exercises during the period being evaluated - which by default is a week (7 days)

Sounds good!

@tfnribeiro
Copy link
Collaborator Author

tfnribeiro commented Jun 27, 2024

I will also look into the creation of the rejected_article table.

The way I see it, it will need the URL + Feed Id + Reason for rejection, this would essentially mean that the CrawlReport would only need to keep track of the Content filtering - which maybe ends up being more elegant.

Alternatively, we can just add them and mark them as broken, and not index them in ES - which I guess would do the same and uses the functionality in the system already?

What do you think ?

Copy link
Member

@mircealungu mircealungu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool stuff!

@mircealungu mircealungu merged commit b415994 into zeeguu:master Jul 10, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants