Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementing the UNIMPLEMENTED_PARSERS #97

Merged
merged 40 commits into from
Sep 13, 2023
Merged

Conversation

Thomas-Lemoine
Copy link
Collaborator

No description provided.

@Thomas-Lemoine
Copy link
Collaborator Author

I'm pushing this branch before adding meaningful things to it, to note down confusions

@Thomas-Lemoine
Copy link
Collaborator Author

I don't quite understand why row.set_status set the status to '' at some point in the code because I don't see how such an issue can occur

@Thomas-Lemoine
Copy link
Collaborator Author

Thomas-Lemoine commented Jul 21, 2023

I just figured out why fetching 'pdfs' and 'html' and etc. using the AlignmentDataset class seemed not to depend on the alignment-research-dataset-sources spreadsheet, https://docs.google.com/spreadsheets/d/1pgG3HzercOhf4gniaqp3tBc3uvZnHpPhXErwHcthmbI/edit?pli=1#gid=980957638. It's actually just getting the items from https://docs.google.com/spreadsheets/d/1l3azVJVukGAvZPgg0GyeqiaQe8bEMZvycBJaA8cRXf4/edit#gid=759210636, items-metadata, afaict.

edit: turns out it was all here:
https://drive.google.com/drive/u/0/folders/1QBJPWXa9PcGhSslReLy3xHx1FwUKNr3h

@Thomas-Lemoine
Copy link
Collaborator Author

@mruwnik gpt4 recommended I create a "logger_config.py" file that configures a logger for every file so that any file can just do from logger_config import logger, and then use the logger as before, with logger.info and whatever else, and it'll all show up in a log file that is gitignored but will print relevant info etc. Thoughts? maybe that clutters stuff a bit more or is more error prone or less readable etc.

@mruwnik
Copy link
Collaborator

mruwnik commented Jul 29, 2023

meh? Are you actually going to use the file? Currently this gets run on github actions, so all you get is what is printed to stdout (unless you specify it to save files), so it doesn't seem that useful. This kind of thing is very useful when you have a long running process (like a web server), as it's a LOT easier to search through a file than a console window. But in this case is doesn't seem to add that much.
https://docs.python.org/3/howto/logging.html is the canonical source of logging info, but is a bit hard to read.

@Thomas-Lemoine
Copy link
Collaborator Author

I see. I guess in that case it's more of a print thing. I did set it to be in the log file because I was printing a lot and that file was easier to read but I could just print things while I'm testing it and remove it afterwards


with patch('requests.get', return_value=Mock(content=response)):
article = dataset.process_entry(item)
assert article.status == 'Withdrawn'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kept causing an error, and then I thought it was a merging issue where at some point this was deleted from main but the merge didn't remove it, but I'm not so sure anymore. It appears that article.status is None but article.to_dict["status"] is "Withdrawn"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignmnet_dataset adds the status to the metadata rather than to the column. You'll need to update the list of columns in settings.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, adding this back in and seeing if the new elements of the ARTICLE_MAIN_KEYS list fix it


it passed, great


@property
def _query_items(self):
return select(Article).where(Article.source == self.name)

@property
def _query_items(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've got this twice

@@ -1,29 +1,36 @@
# %%
from collections import defaultdict, Counter
from dataclasses import dataclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole file should be totally overhauled - a lot of it doesn't make sense anymore

return f"[{title}]({url})"


def flatten(val):
def flatten(val): # val is Union[List, Tuple, str] and returns List[str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just add type hints? wouldn't that be better than a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did, mb, I did the same thing you did yesterday night, looking through the changes and seeing what didn't make sense, and I did add that and removed the query_items deplicate, and a few other things. just pushed that right now

for row in tqdm(iterate_rows(source_sheet)):
title = row.get("title")
if not row.get("source_url"):
row["source_url"] = row["url"]
if row.get("source_url") in seen:
logger.info(f'skipping "{title}", as it has already been seen')
elif row.get('status'):
logger.info(f'skipping "{title}", as it has a status set - remove it for this row to be processed')
# elif row.get('status'): #TODO: Should we remove this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, leave it there. The idea is to change as few rows as possible to avoid rate limits. This function shouldn't be used that much anyway - it'll hopefully be deleted in the near future

return (
item
for item in df.itertuples()
if not pd.isna(self.get_item_key(item))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.isna isn't needed - self.maybe will handle it for you. I'd even go so far as to say that the pd.isna will break this code, as self.get_item_key(item) should never be isna

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, self.get_item_key(item) uses self.maybe which uses pd.isna anyway, right? In tests I just ran, they have the same output; seems like self.get_item_key(item) will be None sometimes (ie when url is missing) and in that case it counts as isna. In any case I removed it, it filters on "self.get_item_key(item) is not None" now, which makes more sense

@@ -231,5 +232,7 @@ def google_doc(url: str) -> Optional[str]:

doc_id = res.group(1)
body = fetch_element(f'https://docs.google.com/document/d/{doc_id}/export?format=html', 'body')
if body:
return MarkdownConverter().convert_soup(body).strip()
if not body:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return body and MarkdownConverter().convert_soup(body).strip()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy complains about this since it's possible for body (Tag | None) to have a Tag that is falsy, in which case google_doc outputs str | None | Tag. I guess I can just ignore mypy though that is pretty minor

align_data/sources/articles/pdf.py Show resolved Hide resolved
requirements.txt Outdated
python-dateutil
bs4
pytz
epub_meta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] these shouldn't be needed, as they're requirements of other things

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. That's hard to tell, do you know of a way of knowing this in general or do you only know it for those ones in particular

@@ -34,6 +34,7 @@
port = os.environ.get("ARD_DB_PORT", "3306")
db_name = os.environ.get("ARD_DB_NAME", "alignment_research_dataset")
DB_CONNECTION_URI = f"mysql+mysqldb://{user}:{password}@{host}:{port}/{db_name}"
ARTICLE_MAIN_KEYS = ["id", "source", "title", "authors", "text", "url", "date_published"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to ARTICLE_MAIN_KEYS = ["id", "source", 'source_type', "title", "authors", "text", "url", "date_published", "status", "comments"] for your failing test to start working

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, nice catch


with patch('requests.get', return_value=Mock(content=response)):
article = dataset.process_entry(item)
assert article.status == 'Withdrawn'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignmnet_dataset adds the status to the metadata rather than to the column. You'll need to update the list of columns in settings.py

Copy link
Collaborator

@mruwnik mruwnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

"arxiv-vanity.com": parse_vanity,
"ar5iv.labs.arxiv.org": parse_vanity,
# TODO: arxiv-vanity.com does not output the same type as the other parsers: Dict[str, str] instead of str
# ar5iv.labs.arxiv.org too. Are these pdf parsers? not rly, but they don't output the same type as the other html parsers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check #138 for a fix. This sort of evolved over time from a simple return the contents of this link to if basic html then just return the text, otherwise return a dict with additional metadata. The aforementioned PR should move everything over to dicts. Which means that these parsers can also extract e.g. title, authors etc. if it makes sense for them to do so

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah that makes a lot of sense, I will review it now

return url and urlparse(url).netloc.lstrip('www.')
def remove_www(net_loc: str) -> str:
return net_loc[4:] if net_loc.startswith("www.") else net_loc
return remove_www(urlparse(url).netloc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add the extra function here? How is it better than just having the code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, not sure why I did all tha

net_loc = urlparse(url).netloc
return net_loc[4:] if net_loc.startswith("www.") else net_loc

makes more sense

if contents := parse_vanity(f"https://www.arxiv-vanity.com/papers/{paper_id}"):
return contents
if contents := parse_vanity(f"https://ar5iv.org/abs/{paper_id}"):
return contents
return fetch_pdf(f"https://arxiv.org/pdf/{paper_id}.pdf")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could even do

return (
    parse_vanity(f"https://www.arxiv-vanity.com/papers/{paper_id}") or
    parse_vanity(f"https://ar5iv.org/abs/{paper_id}") or
    fetch_pdf(f"https://arxiv.org/pdf/{paper_id}.pdf")
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice


authors = data.get('authors') or paper.get("authors")
if not authors: data['authors'] = []
elif not isinstance(authors, list): data['authors'] = [str(authors).strip()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] I'm not a fan of these oneliner clauses

mruwnik
mruwnik previously approved these changes Aug 21, 2023
@Thomas-Lemoine Thomas-Lemoine mentioned this pull request Aug 27, 2023
@Thomas-Lemoine
Copy link
Collaborator Author

not sure if I'm doing this wrong; I keep fetching origin/main to merge onto this branch and then pushing that merge but it makes the whole commit history very messy, and I have to fix some merge conflicts, which get drowned in the merge and is hard to find afterwards.

@@ -189,7 +189,7 @@ def unprocessed_items(self, items=None) -> list | filter:

return items_to_process

def fetch_entries(self) -> Generator[Article, None, None]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you removing this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much of a reason but the reason was that the ide could already infer the type signature without it being explicitely written

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought it back

Copy link
Collaborator

@mruwnik mruwnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go go go!

@Thomas-Lemoine Thomas-Lemoine merged commit 16e4c84 into main Sep 13, 2023
1 check passed
@Thomas-Lemoine Thomas-Lemoine deleted the implement_more_parsers branch September 13, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants