Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle libretexts.org glossary rewriting #66

Merged
merged 3 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 7 additions & 12 deletions scraper/src/mindtouch2zim/libretexts/glossary.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@

def _get_formatted_glossary_row(row) -> str:
"""Format one row as HTML"""
word = row.find("td", attrs={"data-th": "Word(s)"}).text
definition = row.find("td", attrs={"data-th": "Definition"}).text
return (

Check warning on line 14 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L12-L14

Added lines #L12 - L14 were not covered by tests
'<p class="glossaryElement">\n'
f' <span class="glossaryTerm">{word}</span>\n'
" |\n"
Expand All @@ -20,36 +20,31 @@
)


def rewrite_glossary(original_content: str) -> str:
def rewrite_glossary(original_content: str) -> str | None:
"""Statically rewrite the glossary of libretexts.org

Only word and description columns are supported.
"""

soup = BeautifulSoup(

Check warning on line 29 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L29

Added line #L29 was not covered by tests
original_content,
"html.parser", # prefer html.parser to not add <html><body> tags
)

glossary_table = None

Check warning on line 34 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L34

Added line #L34 was not covered by tests

for table in soup.find_all("table"):
if not table.caption:
continue
if table.caption and table.caption.text.strip() == "Example and Directions":
continue
if glossary_table:
raise GlossaryRewriteError("Too many glossary tables")
glossary_table = table

if not glossary_table:
raise GlossaryRewriteError("Glossary table not found")
tables = soup.find_all("table")

Check warning on line 36 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L36

Added line #L36 was not covered by tests
if len(tables) == 0:
# looks like this glossary is not using default template ; let's rewrite as
# a normal page
return None
glossary_table = tables[-1]

Check warning on line 41 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L40-L41

Added lines #L40 - L41 were not covered by tests

tbody = glossary_table.find("tbody")

Check warning on line 43 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L43

Added line #L43 was not covered by tests
if not tbody:
raise GlossaryRewriteError("Glossary table body not found")

Check warning on line 45 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L45

Added line #L45 was not covered by tests
rgaudin marked this conversation as resolved.
Show resolved Hide resolved

glossary_table.insert_after(

Check warning on line 47 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L47

Added line #L47 was not covered by tests
BeautifulSoup(
"".join([_get_formatted_glossary_row(row) for row in tbody.find_all("tr")]),
"html.parser", # prefer html.parser to not add <html><body> tags
Expand All @@ -58,5 +53,5 @@

# remove all tables and scripts
for item in soup.find_all("table") + soup.find_all("script"):
item.decompose()
return soup.prettify()

Check warning on line 57 in scraper/src/mindtouch2zim/libretexts/glossary.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/libretexts/glossary.py#L56-L57

Added lines #L56 - L57 were not covered by tests
7 changes: 4 additions & 3 deletions scraper/src/mindtouch2zim/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -475,13 +475,14 @@
post_head_insert=None,
notify_js_module=None,
)
if (
self.mindtouch_client.library_url.endswith(".libretexts.org")
and page.title == "Glossary"
if self.mindtouch_client.library_url.endswith(".libretexts.org") and re.match(
benoit74 marked this conversation as resolved.
Show resolved Hide resolved
r"^.*\/zz:_[^\/]*?\/20:_[^\/]*$", page.path
):
rewriten = rewrite_glossary(page_content.html_body)

Check warning on line 481 in scraper/src/mindtouch2zim/processor.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/processor.py#L481

Added line #L481 was not covered by tests
if not rewriten:
rewriten = rewriter.rewrite(page_content.html_body).content

Check warning on line 483 in scraper/src/mindtouch2zim/processor.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/processor.py#L483

Added line #L483 was not covered by tests
else:
rewriten = rewriter.rewrite(page_content.html_body).content

Check warning on line 485 in scraper/src/mindtouch2zim/processor.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/mindtouch2zim/processor.py#L485

Added line #L485 was not covered by tests
for path, urls in url_rewriter.items_to_download.items():
if path in self.items_to_download:
self.items_to_download[path].urls.update(urls)
Expand Down