-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Report Generator Functionallity #166
Added Report Generator Functionallity #166
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff @tfnribeiro . A few comments for you
tools/crawl_summary/crawl_report.py
Outdated
|
||
class CrawlReport: | ||
def __init__(self) -> None: | ||
path_to_dir = os.sep.join(inspect.getfile(self.__class__).split(os.sep)[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good if we parameterized this with two envvars maybe?
FOLDER_WITH_CRAWL_SUMMARIES
and FOLDER_FOR_REPORT_OUTPUT
Alternatively, take them as script parameters in the main script and pass them here?
That way we have more control at deployment time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am not sure. This is initialized in the article_crawler.py - so I think the parameter would be passed there instead for this file specifically, maybe I can make a setter for the dir path, though the generate_report also expects this path to be in a specific place.
I could make both the article_crawler
and the generate_report
take a CrawlReportPath
in case we want to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to go with the environment variables, I think it's working!
dt_parsed = datetime.datetime.strptime(str_datetime, STR_DATETIME_FORMAT) | ||
return dt_parsed | ||
|
||
def __convert_dt_to_str(self, datetime): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this as sub-method where it's used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to make it a private method for the class, but maybe I will make it a class method to reflect this. This is just to ensure it's converted in the expected format for the file output.
Let me know if you agree with it otherwise I can make it part of the methods that save and load, it's just in my mind I see it as a pair the dt_to_str and str_to_dt
tools/crawl_summary/crawl_report.py
Outdated
path_to_dir = os.sep.join(inspect.getfile(self.__class__).split(os.sep)[:-1]) | ||
self.default_save_dir = os.path.join(path_to_dir, "crawl_data") | ||
self.data = {"lang": {}} | ||
self.crawl_date = datetime.datetime.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this confused me.
maybe:
- initialize it with None and add comment about how it will be initialized later
- rename it to something more intention revealing? maybe something like:
start_of_crawl_period
. unless i mis-understood it's role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the intention is to use as the way to save the files, I actually forgot to update it and it was using datetime.now().
So my idea is, you start an article_crawler, this object stores the time it started and it's used as sort of the key for that crawl, aka, the crawl done for danish in that starting time. I don't think this should be a value that is manipulated, and it works as a almost private property.
It also allows us to load files based on their date, so if I want for the last 7 days, I can use this date to sort out when we are crawling it based on the current date. Does that make sense?
tools/feed_retrieval.py
Outdated
traceback.print_exc() | ||
crawl_report.add_feed_error(feed.language.code, feed.id, str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is elegant.
@@ -22,7 +26,9 @@ def setUp(self): | |||
|
|||
def testDifficultyOfFeedItems(self): | |||
feed = FeedRule().feed1 | |||
download_from_feed(feed, zeeguu.core.model.db.session, 3, False) | |||
crawl_report = CrawlReport() | |||
crawl_report.add_feed(feed.language.code, feed.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Feed
object knows it's language.code
and id
so in principle, you could only expect a Feed
object in the add_feed
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I will make that change!
def test_ml_classification(self): | ||
db_content = mock_readability_call(URL_ML_JP_PAYWALL) | ||
|
||
is_quality, reason = sufficient_quality(db_content) | ||
is_quality, reason, code = sufficient_quality(db_content) | ||
assert not is_quality | ||
assert reason == "ML Prediction was 'Paywalled'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need the reason when we have now the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to discuss this with you, I was thinking about having a little class that has a Code + Reason property so we can keep both. I think maybe some codes might not be enough to fully understand so they could have a slight description with them. What do you think?
Yes. That's a good observation. The simplest thing could be an extra table for broken articles? maybe with attributes date attempted and reason for it being not downloaded?
This would probably be the way I think.
I don't think we'll need that for now.
Once we merge, I'll take care of that.
Sounds good! |
I will also look into the creation of the The way I see it, it will need the URL + Feed Id + Reason for rejection, this would essentially mean that the Alternatively, we can just add them and mark them as broken, and not index them in ES - which I guess would do the same and uses the functionality in the system already? What do you think ? |
…id from the feed object.
- Allows DB to keep track of why articles were broken
Fixed access when loading a dictionary.
- There are some languages which have no active users, so many plots would be empty. Instead, the report states there are no active users. - Active users require a user to at least read for a minute or do exercises for a minute to be considered active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff!
I am making this PR in order to make it easier to received and discuss some of the functionality of the report generator.
Main Feature:
We can now generate reports based on the SQL database and some collected data during the crawls. This is done by keeping a dictionary that tracks the number of "skipped" articles and the reason why they were skipped and the content that is removed (AKA repetitive sentences, or short paragraphs).
Implementation Details:
The report shown here can be found in the following ZIP:
reports.zip
tools/crawl_summary
Contains
CrawlReport
which is responsible to keep track of the feeds crawling time and the number of articles retrieved from the database or the reason why they are not crawled. It also collects the sentences that were removed based on "noisy" pre-defined patterns. Essentially, it tracks the information that is logged to the CMD during crawling.This object is created every time there is a crawl performed by the
article_crawler.py
creating a json file at the end of the process.The class also allows to load all generated files based on the number of days from the day running the script, e.g. the last 7 days generated files. This is used in the report to report which sentences were removed and the reason why articles aren't crawled.
All data is stored in
crawl_data
seperated by language.tools/report_generator
Contains three main pieces of code:
data_extractor.py
: Contains the logic to extract the logic from the database. Essentially, each method runs a defined query and loads it into a dataframe which is used to manipulate the data to be shown in plots. This can also be used in the Notebook, which can be seen as a playground of sorts to test the different queries.generate_report.py
: contains all the plotting logic to generate the HTML report which is placed in thereports
folder. For each plot, an image is produced specifically for that report. The HTML expects all the images to be located in aimg
folder relative to the HTML file. Currently the HTML doesn't use much formatting, essentially just having the images placed one after the other. Eventually, we could make some divs with some CSS to make it slightly nicer to read. I have used an open-source CSS library to format the tables (pure-min.css).explore_report_notebook.ipynb
: will contain examples on how the data was processed and can allow to develop new plots or quickly iterate over the styling over the plots currently being used.Notes:
DataExtractor
cannot run queries for periods longer than 200 days in my testing. I haven't been able to find the timeout setting relating to read_timeouts.