-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implementing the UNIMPLEMENTED_PARSERS #97
Conversation
I'm pushing this branch before adding meaningful things to it, to note down confusions |
I don't quite understand why row.set_status set the status to '' at some point in the code because I don't see how such an issue can occur |
I just figured out why fetching 'pdfs' and 'html' and etc. using the AlignmentDataset class seemed not to depend on the alignment-research-dataset-sources spreadsheet, https://docs.google.com/spreadsheets/d/1pgG3HzercOhf4gniaqp3tBc3uvZnHpPhXErwHcthmbI/edit?pli=1#gid=980957638. It's actually just getting the items from https://docs.google.com/spreadsheets/d/1l3azVJVukGAvZPgg0GyeqiaQe8bEMZvycBJaA8cRXf4/edit#gid=759210636, items-metadata, afaict. edit: turns out it was all here: |
@mruwnik gpt4 recommended I create a "logger_config.py" file that configures a logger for every file so that any file can just do from logger_config import logger, and then use the logger as before, with logger.info and whatever else, and it'll all show up in a log file that is gitignored but will print relevant info etc. Thoughts? maybe that clutters stuff a bit more or is more error prone or less readable etc. |
meh? Are you actually going to use the file? Currently this gets run on github actions, so all you get is what is printed to stdout (unless you specify it to save files), so it doesn't seem that useful. This kind of thing is very useful when you have a long running process (like a web server), as it's a LOT easier to search through a file than a console window. But in this case is doesn't seem to add that much. |
I see. I guess in that case it's more of a print thing. I did set it to be in the log file because I was printing a lot and that file was easier to read but I could just print things while I'm testing it and remove it afterwards |
…lement_more_parsers
|
||
with patch('requests.get', return_value=Mock(content=response)): | ||
article = dataset.process_entry(item) | ||
assert article.status == 'Withdrawn' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kept causing an error, and then I thought it was a merging issue where at some point this was deleted from main but the merge didn't remove it, but I'm not so sure anymore. It appears that article.status is None but article.to_dict["status"] is "Withdrawn"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignmnet_dataset adds the status to the metadata rather than to the column. You'll need to update the list of columns in settings.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, adding this back in and seeing if the new elements of the ARTICLE_MAIN_KEYS list fix it
it passed, great
|
||
@property | ||
def _query_items(self): | ||
return select(Article).where(Article.source == self.name) | ||
|
||
@property | ||
def _query_items(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you've got this twice
@@ -1,29 +1,36 @@ | |||
# %% | |||
from collections import defaultdict, Counter | |||
from dataclasses import dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole file should be totally overhauled - a lot of it doesn't make sense anymore
return f"[{title}]({url})" | ||
|
||
|
||
def flatten(val): | ||
def flatten(val): # val is Union[List, Tuple, str] and returns List[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just add type hints? wouldn't that be better than a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I did, mb, I did the same thing you did yesterday night, looking through the changes and seeing what didn't make sense, and I did add that and removed the query_items deplicate, and a few other things. just pushed that right now
for row in tqdm(iterate_rows(source_sheet)): | ||
title = row.get("title") | ||
if not row.get("source_url"): | ||
row["source_url"] = row["url"] | ||
if row.get("source_url") in seen: | ||
logger.info(f'skipping "{title}", as it has already been seen') | ||
elif row.get('status'): | ||
logger.info(f'skipping "{title}", as it has a status set - remove it for this row to be processed') | ||
# elif row.get('status'): #TODO: Should we remove this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, leave it there. The idea is to change as few rows as possible to avoid rate limits. This function shouldn't be used that much anyway - it'll hopefully be deleted in the near future
return ( | ||
item | ||
for item in df.itertuples() | ||
if not pd.isna(self.get_item_key(item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.isna
isn't needed - self.maybe
will handle it for you. I'd even go so far as to say that the pd.isna
will break this code, as self.get_item_key(item)
should never be isna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, self.get_item_key(item) uses self.maybe which uses pd.isna anyway, right? In tests I just ran, they have the same output; seems like self.get_item_key(item) will be None sometimes (ie when url is missing) and in that case it counts as isna. In any case I removed it, it filters on "self.get_item_key(item) is not None" now, which makes more sense
@@ -231,5 +232,7 @@ def google_doc(url: str) -> Optional[str]: | |||
|
|||
doc_id = res.group(1) | |||
body = fetch_element(f'https://docs.google.com/document/d/{doc_id}/export?format=html', 'body') | |||
if body: | |||
return MarkdownConverter().convert_soup(body).strip() | |||
if not body: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return body and MarkdownConverter().convert_soup(body).strip()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy complains about this since it's possible for body (Tag | None) to have a Tag that is falsy, in which case google_doc outputs str | None | Tag. I guess I can just ignore mypy though that is pretty minor
requirements.txt
Outdated
python-dateutil | ||
bs4 | ||
pytz | ||
epub_meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] these shouldn't be needed, as they're requirements of other things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. That's hard to tell, do you know of a way of knowing this in general or do you only know it for those ones in particular
align_data/settings.py
Outdated
@@ -34,6 +34,7 @@ | |||
port = os.environ.get("ARD_DB_PORT", "3306") | |||
db_name = os.environ.get("ARD_DB_NAME", "alignment_research_dataset") | |||
DB_CONNECTION_URI = f"mysql+mysqldb://{user}:{password}@{host}:{port}/{db_name}" | |||
ARTICLE_MAIN_KEYS = ["id", "source", "title", "authors", "text", "url", "date_published"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to ARTICLE_MAIN_KEYS = ["id", "source", 'source_type', "title", "authors", "text", "url", "date_published", "status", "comments"]
for your failing test to start working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, nice catch
|
||
with patch('requests.get', return_value=Mock(content=response)): | ||
article = dataset.process_entry(item) | ||
assert article.status == 'Withdrawn' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignmnet_dataset adds the status to the metadata rather than to the column. You'll need to update the list of columns in settings.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good!
"arxiv-vanity.com": parse_vanity, | ||
"ar5iv.labs.arxiv.org": parse_vanity, | ||
# TODO: arxiv-vanity.com does not output the same type as the other parsers: Dict[str, str] instead of str | ||
# ar5iv.labs.arxiv.org too. Are these pdf parsers? not rly, but they don't output the same type as the other html parsers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check #138 for a fix. This sort of evolved over time from a simple return the contents of this link
to if basic html then just return the text, otherwise return a dict with additional metadata
. The aforementioned PR should move everything over to dicts. Which means that these parsers can also extract e.g. title, authors etc. if it makes sense for them to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah that makes a lot of sense, I will review it now
return url and urlparse(url).netloc.lstrip('www.') | ||
def remove_www(net_loc: str) -> str: | ||
return net_loc[4:] if net_loc.startswith("www.") else net_loc | ||
return remove_www(urlparse(url).netloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add the extra function here? How is it better than just having the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, not sure why I did all tha
net_loc = urlparse(url).netloc
return net_loc[4:] if net_loc.startswith("www.") else net_loc
makes more sense
if contents := parse_vanity(f"https://www.arxiv-vanity.com/papers/{paper_id}"): | ||
return contents | ||
if contents := parse_vanity(f"https://ar5iv.org/abs/{paper_id}"): | ||
return contents | ||
return fetch_pdf(f"https://arxiv.org/pdf/{paper_id}.pdf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could even do
return (
parse_vanity(f"https://www.arxiv-vanity.com/papers/{paper_id}") or
parse_vanity(f"https://ar5iv.org/abs/{paper_id}") or
fetch_pdf(f"https://arxiv.org/pdf/{paper_id}.pdf")
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
|
||
authors = data.get('authors') or paper.get("authors") | ||
if not authors: data['authors'] = [] | ||
elif not isinstance(authors, list): data['authors'] = [str(authors).strip()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] I'm not a fan of these oneliner clauses
not sure if I'm doing this wrong; I keep fetching origin/main to merge onto this branch and then pushing that merge but it makes the whole commit history very messy, and I have to fix some merge conflicts, which get drowned in the merge and is hard to find afterwards. |
@@ -189,7 +189,7 @@ def unprocessed_items(self, items=None) -> list | filter: | |||
|
|||
return items_to_process | |||
|
|||
def fetch_entries(self) -> Generator[Article, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much of a reason but the reason was that the ide could already infer the type signature without it being explicitely written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go go go!
No description provided.