Added Report Generator Functionallity #166

tfnribeiro · 2024-06-26T14:19:39Z

I am making this PR in order to make it easier to received and discuss some of the functionality of the report generator.

Main Feature:

We can now generate reports based on the SQL database and some collected data during the crawls. This is done by keeping a dictionary that tracks the number of "skipped" articles and the reason why they were skipped and the content that is removed (AKA repetitive sentences, or short paragraphs).

Implementation Details:

The report shown here can be found in the following ZIP:

reports.zip

`tools/crawl_summary`

Contains CrawlReport which is responsible to keep track of the feeds crawling time and the number of articles retrieved from the database or the reason why they are not crawled. It also collects the sentences that were removed based on "noisy" pre-defined patterns. Essentially, it tracks the information that is logged to the CMD during crawling.

This object is created every time there is a crawl performed by the article_crawler.py creating a json file at the end of the process.

The class also allows to load all generated files based on the number of days from the day running the script, e.g. the last 7 days generated files. This is used in the report to report which sentences were removed and the reason why articles aren't crawled.

All data is stored in crawl_data seperated by language.

`tools/report_generator`

Contains three main pieces of code:

data_extractor.py: Contains the logic to extract the logic from the database. Essentially, each method runs a defined query and loads it into a dataframe which is used to manipulate the data to be shown in plots. This can also be used in the Notebook, which can be seen as a playground of sorts to test the different queries.
generate_report.py: contains all the plotting logic to generate the HTML report which is placed in the reports folder. For each plot, an image is produced specifically for that report. The HTML expects all the images to be located in a img folder relative to the HTML file. Currently the HTML doesn't use much formatting, essentially just having the images placed one after the other. Eventually, we could make some divs with some CSS to make it slightly nicer to read. I have used an open-source CSS library to format the tables (pure-min.css).
explore_report_notebook.ipynb: will contain examples on how the data was processed and can allow to develop new plots or quickly iterate over the styling over the plots currently being used.

Notes:

When developing the CrawlerReport, I noticed that we do not keep track of the articles that are rejected (not stored in the DB). Meaning that, at every crawl, we re-attempt to crawl articles that are noisy. We should enable a caching mechanism based on the URLs we do not want to parse. Could be through a table in the DB maybe?
I want to create some tests to ensure the queries are returning the data we expect. What would be the best way to do this? Should I create a test folder, with some "Fake" data in the same format of the Database with specific tests cases?
The DataExtractor cannot run queries for periods longer than 200 days in my testing. I haven't been able to find the timeout setting relating to read_timeouts.
To fully deploy this, we will need some cron jobs that ensure that the logs are cleaned to avoid storing too much data. All the data that is generated is not tracked (added to .gitignore)
For the active users, I consider only users that have at least 1 minute of activity in either reading or exercises during the period being evaluated - which by default is a week (7 days).

… days matching the report

mircealungu

Cool stuff @tfnribeiro . A few comments for you

mircealungu · 2024-06-26T21:36:36Z

tools/crawl_summary/crawl_report.py

+
+class CrawlReport:
+    def __init__(self) -> None:
+        path_to_dir = os.sep.join(inspect.getfile(self.__class__).split(os.sep)[:-1])


it would be good if we parameterized this with two envvars maybe?
FOLDER_WITH_CRAWL_SUMMARIES and FOLDER_FOR_REPORT_OUTPUT

Alternatively, take them as script parameters in the main script and pass them here?

That way we have more control at deployment time.

Yes, I am not sure. This is initialized in the article_crawler.py - so I think the parameter would be passed there instead for this file specifically, maybe I can make a setter for the dir path, though the generate_report also expects this path to be in a specific place.

I could make both the article_crawler and the generate_report take a CrawlReportPath in case we want to change it.

I decided to go with the environment variables, I think it's working!

mircealungu · 2024-06-26T21:40:55Z

tools/crawl_summary/crawl_report.py

+        dt_parsed = datetime.datetime.strptime(str_datetime, STR_DATETIME_FORMAT)
+        return dt_parsed
+
+    def __convert_dt_to_str(self, datetime):


move this as sub-method where it's used?

I decided to make it a private method for the class, but maybe I will make it a class method to reflect this. This is just to ensure it's converted in the expected format for the file output.

Let me know if you agree with it otherwise I can make it part of the methods that save and load, it's just in my mind I see it as a pair the dt_to_str and str_to_dt

mircealungu · 2024-06-26T21:42:59Z

tools/crawl_summary/crawl_report.py

+        path_to_dir = os.sep.join(inspect.getfile(self.__class__).split(os.sep)[:-1])
+        self.default_save_dir = os.path.join(path_to_dir, "crawl_data")
+        self.data = {"lang": {}}
+        self.crawl_date = datetime.datetime.now()


this confused me.
maybe:

initialize it with None and add comment about how it will be initialized later

rename it to something more intention revealing? maybe something like: start_of_crawl_period. unless i mis-understood it's role.

So the intention is to use as the way to save the files, I actually forgot to update it and it was using datetime.now().

So my idea is, you start an article_crawler, this object stores the time it started and it's used as sort of the key for that crawl, aka, the crawl done for danish in that starting time. I don't think this should be a value that is manipulated, and it works as a almost private property.

It also allows us to load files based on their date, so if I want for the last 7 days, I can use this date to sort out when we are crawling it based on the current date. Does that make sense?

mircealungu · 2024-06-27T06:14:37Z

tools/feed_retrieval.py

            traceback.print_exc()
+            crawl_report.add_feed_error(feed.language.code, feed.id, str(e))


this is elegant.

mircealungu · 2024-06-27T06:21:28Z

zeeguu/core/test/test_retrieve_and_compute.py

@@ -22,7 +26,9 @@ def setUp(self):

    def testDifficultyOfFeedItems(self):
        feed = FeedRule().feed1
-        download_from_feed(feed, zeeguu.core.model.db.session, 3, False)
+        crawl_report = CrawlReport()
+        crawl_report.add_feed(feed.language.code, feed.id)


a Feed object knows it's language.code and id so in principle, you could only expect a Feed object in the add_feed method?

That's a good point, I will make that change!

mircealungu · 2024-06-27T06:44:47Z

zeeguu/core/test/test_retrieve_and_compute.py

    def test_ml_classification(self):
        db_content = mock_readability_call(URL_ML_JP_PAYWALL)

-        is_quality, reason = sufficient_quality(db_content)
+        is_quality, reason, code = sufficient_quality(db_content)
        assert not is_quality
        assert reason == "ML Prediction was 'Paywalled'."


do we still need the reason when we have now the code?

I wanted to discuss this with you, I was thinking about having a little class that has a Code + Reason property so we can keep both. I think maybe some codes might not be enough to fully understand so they could have a slight description with them. What do you think?

mircealungu · 2024-06-27T07:39:05Z

When developing the CrawlerReport, I noticed that we do not keep track of the articles that are rejected (not stored in the DB). Meaning that, at every crawl, we re-attempt to crawl articles that are noisy. We should enable a caching mechanism based on the URLs we do not want to parse. Could be through a table in the DB maybe?

Yes. That's a good observation. The simplest thing could be an extra table for broken articles? maybe with attributes date attempted and reason for it being not downloaded?

I want to create some tests to ensure the queries are returning the data we expect. What would be the best way to do this? Should I create a test folder, with some "Fake" data in the same format of the Database with specific tests cases?

This would probably be the way I think.

The DataExtractor cannot run queries for periods longer than 200 days in my testing. I haven't been able to find the timeout setting relating to read_timeouts.

I don't think we'll need that for now.

To fully deploy this, we will need some cron jobs that ensure that the logs are cleaned to avoid storing too much data. All the data that is generated is not tracked (added to .gitignore)

Once we merge, I'll take care of that.

For the active users, I consider only users that have at least 1 minute of activity in either reading or exercises during the period being evaluated - which by default is a week (7 days)

Sounds good!

tfnribeiro · 2024-06-27T09:16:06Z

I will also look into the creation of the rejected_article table.

The way I see it, it will need the URL + Feed Id + Reason for rejection, this would essentially mean that the CrawlReport would only need to keep track of the Content filtering - which maybe ends up being more elegant.

Alternatively, we can just add them and mark them as broken, and not index them in ES - which I guess would do the same and uses the functionality in the system already?

What do you think ?

…id from the feed object.

- Allows DB to keep track of why articles were broken

Fixed access when loading a dictionary.

…report

- There are some languages which have no active users, so many plots would be empty. Instead, the report states there are no active users. - Active users require a user to at least read for a minute or do exercises for a minute to be considered active.

…n week number

mircealungu

cool stuff!

tfnribeiro added 4 commits June 26, 2024 15:36

Added report generator

f758c79

Added a warning in case the CrawlReport doesn't contain the number of…

e9d079b

… days matching the report

Added necessary changes to the support the article crawler.

953d919

Fixed the Tests, as the feeds expect the CrawlReport object

922cbf4

mircealungu reviewed Jun 27, 2024

View reviewed changes

tfnribeiro added 13 commits June 27, 2024 16:01

Simplified CrawlReport, which can take a feed and extract the code + …

87473ee

…id from the feed object.

Update to not allow broken articles.

fc8d8ca

Store URLs based on patterns + Reducing duplication

9e2157a

Added generating likely candidates of repeated sents.

ada9583

Renaming method names

fce096f

Added a table to track broken_codes

01ae26b

- Allows DB to keep track of why articles were broken

Update crawl_report.py

e90fbdd

Fixed access when loading a dictionary.

Merge remote-tracking branch 'upstream/master' into user-feed-weekly-…

a4e3886

…report

Fixed tests, need a call that allows to not add to the crawl report.

d216454

Add environment variables to control the output of the reports

72fb785

Fixed Queries + Only Show Plots with relevant data

f4663a3

Updated naming convention to include date + days reported, rather tha…

8ecfd48

…n week number

mircealungu approved these changes Jul 10, 2024

View reviewed changes

mircealungu merged commit b415994 into zeeguu:master Jul 10, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Report Generator Functionallity #166

Added Report Generator Functionallity #166

tfnribeiro commented Jun 26, 2024

mircealungu left a comment

mircealungu Jun 26, 2024

tfnribeiro Jun 27, 2024

tfnribeiro Jul 3, 2024

mircealungu Jun 26, 2024

tfnribeiro Jun 27, 2024

mircealungu Jun 26, 2024

tfnribeiro Jun 27, 2024 •

edited

Loading

mircealungu Jun 27, 2024

mircealungu Jun 27, 2024

tfnribeiro Jun 27, 2024

mircealungu Jun 27, 2024

tfnribeiro Jun 27, 2024

mircealungu commented Jun 27, 2024 •

edited

Loading

tfnribeiro commented Jun 27, 2024 •

edited

Loading

mircealungu left a comment

		traceback.print_exc()
		crawl_report.add_feed_error(feed.language.code, feed.id, str(e))

Added Report Generator Functionallity #166

Added Report Generator Functionallity #166

Conversation

tfnribeiro commented Jun 26, 2024

Main Feature:

Implementation Details:

tools/crawl_summary

tools/report_generator

Notes:

mircealungu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfnribeiro Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mircealungu commented Jun 27, 2024 • edited Loading

tfnribeiro commented Jun 27, 2024 • edited Loading

mircealungu left a comment

Choose a reason for hiding this comment

`tools/crawl_summary`

`tools/report_generator`

tfnribeiro Jun 27, 2024 •

edited

Loading

mircealungu commented Jun 27, 2024 •

edited

Loading

tfnribeiro commented Jun 27, 2024 •

edited

Loading