-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply proper CSS for proper page display - step 1 #29
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #29 +/- ##
==========================================
+ Coverage 47.22% 51.48% +4.25%
==========================================
Files 7 9 +2
Lines 432 540 +108
Branches 45 61 +16
==========================================
+ Hits 204 278 +74
- Misses 226 258 +32
- Partials 2 4 +2 ☔ View full report in Codecov by Sentry. |
c817b49
to
5ef373a
Compare
832e7cb
to
c3b6266
Compare
5ef373a
to
b6d1d52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
How much of the CSS Processor is based on warc2zim?
Once this has matured enough, I think it should move to scraperlib.
scraper/src/libretexts2zim/css.py
Outdated
*[ | ||
parent.name | ||
for parent in reversed(original_path.parents) | ||
if parent.name and parent.name != ".." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this doing exactly? What's the consequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows to compute a "clean" target path to use inside the ZIM, even if original path was a relative one. E.g. for an asset located at ../folder/file.ext
, it will place the file in the ZIM at folder/file.ext
. This is meant to create a folder structure "similar" to the online one, but simplifying the file path to get rid of query strings for instance. I will isolate this logic in a dedicated function to add a docstring and make it clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that we are removing those walking paths instead of resolving them so something like folder/sub1/sub2/sub3/sub4/../../../file.txt
will end up at folder/sub1/sub2/sub3/sub4/file.ext
and not at folder/file.ext
.
Given all resources in-file are rewritten with full path, it should not have an impact but, as you suggested, this should all be properly documented as this would lead a future dev into a false trail on some random css issue.
) | ||
if not relative_css_path: | ||
return | ||
url_node.value = str(relative_css_path) # pyright: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for those wide ignores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that there is 2 or 3 rules to ignore, I don't recall exactly, but when you specify it then it doesn't fit on one line, so black want to move things to match the 88 chars limit, and then it becomes a mess because hint is not a proper line anymore, ... Not ideal obviously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I hate those. The problem is that those will never be fixed because the next developer will have no clue what was being ignored and this will lead to more stuff ignored, eventually weakening what we want to rely on.
I propose we keep them and disable black for those specific lines.
Here's what it would look like on that function:
def _process_node(self, css_original_url: str, node: ast.Node):
"""Process one single CSS node"""
if isinstance(
node,
ast.QualifiedRule
| ast.SquareBracketsBlock
| ast.ParenthesesBlock
| ast.CurlyBracketsBlock,
):
self._process_list(
css_original_url,
node.content, # pyright: ignore[reportUnknownArgumentType, reportUnknownMemberType]
)
elif isinstance(node, ast.FunctionBlock):
if node.lower_name == "url": # pyright: ignore[reportUnknownMemberType]
url_node: ast.Node = node.arguments[0] # pyright: ignore
relative_css_path = self._process_url(
css_original_url,
url_node.value, # pyright: ignore [reportAttributeAccessIssue]
)
if not relative_css_path:
return
url_node.value = str(relative_css_path) # pyright: ignore [reportAttributeAccessIssue] ; fmt: skip
url_node.representation = f'"{serialize_url(str(relative_css_path))}"' # pyright: ignore [reportAttributeAccessIssue] ; fmt: skip
else:
self._process_list(
css_original_url,
node.arguments,
)
elif isinstance(node, ast.AtRule):
self._process_list(
css_original_url,
node.prelude,
)
self._process_list(
css_original_url,
node.content,
)
elif isinstance(node, ast.Declaration):
self._process_list(
css_original_url,
node.value,
)
elif isinstance(node, ast.URLToken):
relative_css_path = self._process_url(
css_original_url,
node.value,
)
if not relative_css_path:
return
node.value = str(relative_css_path)
node.representation = f"url({serialize_url(str(relative_css_path))})"
Interestingly, I simply removed three of those ignores 🙃
This is my plan indeed. So far it is a bit of copy-and-paste indeed, but with modifications due to the specificities of libretexts. The same will happen with HTML rewriting which will be needed for images, videos, links, ... Both features are indeed not specific to warc2zim at all and primitives should definitely be shared in python-scraperlib, I'm sure we will need them again in other scrapers. |
This first step takes care of CSS stylesheets which are in an external file (two indeed, one for screen and one for print). It does not consider inline CSS which is needed and will be handled in a step 2.
b6d1d52
to
4749161
Compare
This is part of #8
This first step takes care of CSS stylesheets which are in an external file (two indeed, one for screen and one for print).
It handles fetching all assets (images, fonts) referenced in the CSS.
It handles rewriting of the CSS to fix URLs.
It does not consider inline CSS which is needed and will be handled in a step 2.