implementing the UNIMPLEMENTED_PARSERS #97

Thomas-Lemoine · 2023-07-19T15:06:15Z

No description provided.

Thomas-Lemoine · 2023-07-19T15:08:13Z

I'm pushing this branch before adding meaningful things to it, to note down confusions

Thomas-Lemoine · 2023-07-21T00:35:31Z

I don't quite understand why row.set_status set the status to '' at some point in the code because I don't see how such an issue can occur

Thomas-Lemoine · 2023-07-21T01:59:06Z

I just figured out why fetching 'pdfs' and 'html' and etc. using the AlignmentDataset class seemed not to depend on the alignment-research-dataset-sources spreadsheet, https://docs.google.com/spreadsheets/d/1pgG3HzercOhf4gniaqp3tBc3uvZnHpPhXErwHcthmbI/edit?pli=1#gid=980957638. It's actually just getting the items from https://docs.google.com/spreadsheets/d/1l3azVJVukGAvZPgg0GyeqiaQe8bEMZvycBJaA8cRXf4/edit#gid=759210636, items-metadata, afaict.

edit: turns out it was all here:
https://drive.google.com/drive/u/0/folders/1QBJPWXa9PcGhSslReLy3xHx1FwUKNr3h

meaningless merge

Thomas-Lemoine · 2023-07-28T22:36:16Z

@mruwnik gpt4 recommended I create a "logger_config.py" file that configures a logger for every file so that any file can just do from logger_config import logger, and then use the logger as before, with logger.info and whatever else, and it'll all show up in a log file that is gitignored but will print relevant info etc. Thoughts? maybe that clutters stuff a bit more or is more error prone or less readable etc.

mruwnik · 2023-07-29T08:43:29Z

meh? Are you actually going to use the file? Currently this gets run on github actions, so all you get is what is printed to stdout (unless you specify it to save files), so it doesn't seem that useful. This kind of thing is very useful when you have a long running process (like a web server), as it's a LOT easier to search through a file than a console window. But in this case is doesn't seem to add that much.
https://docs.python.org/3/howto/logging.html is the canonical source of logging info, but is a bit hard to read.

Thomas-Lemoine · 2023-07-29T11:28:35Z

I see. I guess in that case it's more of a print thing. I did set it to be in the log file because I was printing a lot and that file was easier to read but I could just print things while I'm testing it and remove it afterwards

…lement_more_parsers

Thomas-Lemoine · 2023-08-16T06:23:42Z

tests/align_data/articles/test_datasets.py

-
-    with patch('requests.get', return_value=Mock(content=response)):
-        article = dataset.process_entry(item)
-        assert article.status == 'Withdrawn'


This kept causing an error, and then I thought it was a merging issue where at some point this was deleted from main but the merge didn't remove it, but I'm not so sure anymore. It appears that article.status is None but article.to_dict["status"] is "Withdrawn"?

alignmnet_dataset adds the status to the metadata rather than to the column. You'll need to update the list of columns in settings.py

Alright, adding this back in and seeing if the new elements of the ARTICLE_MAIN_KEYS list fix it

it passed, great

mruwnik · 2023-08-16T07:39:17Z

align_data/common/alignment_dataset.py

+
+    @property
+    def _query_items(self):
+        return select(Article).where(Article.source == self.name)

    @property
    def _query_items(self):


you've got this twice

mruwnik · 2023-08-16T07:42:43Z

align_data/postprocess/postprocess.py

@@ -1,29 +1,36 @@
 # %%
 from collections import defaultdict, Counter
-from dataclasses import dataclass


this whole file should be totally overhauled - a lot of it doesn't make sense anymore

mruwnik · 2023-08-16T07:44:28Z

align_data/sources/arbital/arbital.py

    return f"[{title}]({url})"


-def flatten(val):
+def flatten(val): # val is Union[List, Tuple, str] and returns List[str]


why not just add type hints? wouldn't that be better than a comment?

Yeah I did, mb, I did the same thing you did yesterday night, looking through the changes and seeing what didn't make sense, and I did add that and removed the query_items deplicate, and a few other things. just pushed that right now

mruwnik · 2023-08-16T07:47:07Z

align_data/sources/articles/articles.py

    for row in tqdm(iterate_rows(source_sheet)):
        title = row.get("title")
        if not row.get("source_url"):
            row["source_url"] = row["url"]
        if row.get("source_url") in seen:
            logger.info(f'skipping "{title}", as it has already been seen')
-        elif row.get('status'):
-            logger.info(f'skipping "{title}", as it has a status set - remove it for this row to be processed')
+        # elif row.get('status'): #TODO: Should we remove this


nah, leave it there. The idea is to change as few rows as possible to avoid rate limits. This function shouldn't be used that much anyway - it'll hopefully be deleted in the near future

mruwnik · 2023-08-16T07:48:07Z

align_data/sources/articles/datasets.py

+        return (
+            item 
+            for item in df.itertuples() 
+            if not pd.isna(self.get_item_key(item))


pd.isna isn't needed - self.maybe will handle it for you. I'd even go so far as to say that the pd.isna will break this code, as self.get_item_key(item) should never be isna

ah, self.get_item_key(item) uses self.maybe which uses pd.isna anyway, right? In tests I just ran, they have the same output; seems like self.get_item_key(item) will be None sometimes (ie when url is missing) and in that case it counts as isna. In any case I removed it, it filters on "self.get_item_key(item) is not None" now, which makes more sense

mruwnik · 2023-08-16T07:52:10Z

align_data/sources/articles/google_cloud.py

@@ -231,5 +232,7 @@ def google_doc(url: str) -> Optional[str]:

    doc_id = res.group(1)
    body = fetch_element(f'https://docs.google.com/document/d/{doc_id}/export?format=html', 'body')
-    if body:
-        return MarkdownConverter().convert_soup(body).strip()
+    if not body:


return body and MarkdownConverter().convert_soup(body).strip()?

mypy complains about this since it's possible for body (Tag | None) to have a Tag that is falsy, in which case google_doc outputs str | None | Tag. I guess I can just ignore mypy though that is pretty minor

align_data/sources/articles/pdf.py

mruwnik · 2023-08-16T07:59:37Z

requirements.txt

+python-dateutil
+bs4
+pytz
+epub_meta


[NIT] these shouldn't be needed, as they're requirements of other things

Oh I see. That's hard to tell, do you know of a way of knowing this in general or do you only know it for those ones in particular

mruwnik · 2023-08-16T08:02:30Z

align_data/settings.py

@@ -34,6 +34,7 @@
 port = os.environ.get("ARD_DB_PORT", "3306")
 db_name = os.environ.get("ARD_DB_NAME", "alignment_research_dataset")
 DB_CONNECTION_URI = f"mysql+mysqldb://{user}:{password}@{host}:{port}/{db_name}"
+ARTICLE_MAIN_KEYS = ["id", "source", "title", "authors", "text", "url", "date_published"]


change this to ARTICLE_MAIN_KEYS = ["id", "source", 'source_type', "title", "authors", "text", "url", "date_published", "status", "comments"] for your failing test to start working

Ah I see, nice catch

mruwnik · 2023-08-16T08:04:02Z

tests/align_data/articles/test_datasets.py

-
-    with patch('requests.get', return_value=Mock(content=response)):
-        article = dataset.process_entry(item)
-        assert article.status == 'Withdrawn'


alignmnet_dataset adds the status to the metadata rather than to the column. You'll need to update the list of columns in settings.py

mruwnik

looking good!

mruwnik · 2023-08-17T13:34:37Z

align_data/sources/articles/parsers.py

-    "arxiv-vanity.com": parse_vanity,
-    "ar5iv.labs.arxiv.org": parse_vanity,
+    # TODO: arxiv-vanity.com does not output the same type as the other parsers: Dict[str, str] instead of str
+    # ar5iv.labs.arxiv.org too. Are these pdf parsers? not rly, but they don't output the same type as the other html parsers


check #138 for a fix. This sort of evolved over time from a simple return the contents of this link to if basic html then just return the text, otherwise return a dict with additional metadata. The aforementioned PR should move everything over to dicts. Which means that these parsers can also extract e.g. title, authors etc. if it makes sense for them to do so

ah yeah that makes a lot of sense, I will review it now

mruwnik · 2023-08-17T13:35:30Z

align_data/sources/articles/parsers.py

-    return url and urlparse(url).netloc.lstrip('www.')
+    def remove_www(net_loc: str) -> str:
+        return net_loc[4:] if net_loc.startswith("www.") else net_loc
+    return remove_www(urlparse(url).netloc)


why add the extra function here? How is it better than just having the code?

good point, not sure why I did all tha

net_loc = urlparse(url).netloc return net_loc[4:] if net_loc.startswith("www.") else net_loc

makes more sense

mruwnik · 2023-08-17T13:37:55Z

align_data/sources/arxiv_papers/arxiv_papers.py

+    if contents := parse_vanity(f"https://www.arxiv-vanity.com/papers/{paper_id}"):
+        return contents
+    if contents := parse_vanity(f"https://ar5iv.org/abs/{paper_id}"):
+        return contents
    return fetch_pdf(f"https://arxiv.org/pdf/{paper_id}.pdf")


you could even do

return ( parse_vanity(f"https://www.arxiv-vanity.com/papers/{paper_id}") or parse_vanity(f"https://ar5iv.org/abs/{paper_id}") or fetch_pdf(f"https://arxiv.org/pdf/{paper_id}.pdf") )

mruwnik · 2023-08-17T13:38:45Z

align_data/sources/arxiv_papers/arxiv_papers.py

+
+    authors = data.get('authors') or paper.get("authors")
+    if not authors: data['authors'] = []
+    elif not isinstance(authors, list): data['authors'] = [str(authors).strip()]


[NIT] I'm not a fan of these oneliner clauses

Thomas-Lemoine · 2023-08-27T10:38:59Z

not sure if I'm doing this wrong; I keep fetching origin/main to merge onto this branch and then pushing that merge but it makes the whole commit history very messy, and I have to fix some merge conflicts, which get drowned in the merge and is hard to find afterwards.

mruwnik · 2023-08-31T15:03:23Z

align_data/common/alignment_dataset.py

@@ -189,7 +189,7 @@ def unprocessed_items(self, items=None) -> list | filter:

        return items_to_process

-    def fetch_entries(self) -> Generator[Article, None, None]:


why are you removing this?

Not much of a reason but the reason was that the ide could already infer the type signature without it being explicitely written

I brought it back

mruwnik

go go go!

to start the pr to add comments

c8f77aa

Thomas-Lemoine added 2 commits July 19, 2023 14:21

removed spaces

90517c4

Merge remote-tracking branch 'origin/main' into implement_more_parsers

3617c9e

Thomas-Lemoine added 6 commits July 23, 2023 10:40

Merge remote-tracking branch 'origin/main' into implement_more_parsers

049fb56

meaningless merge

create logger_config and reorder the imports

6848dee

main's logger

682e96e

ignore the log files

3b38600

postprocess notes

367f5df

fix test with new download order for pdfarticles

932561a

mruwnik and others added 11 commits August 4, 2023 15:56

Handle special docs

728b124

Fetch new items from indices

b9999b4

fixed domain getter from network location

7ee7f9a

logger and minor fixes

042fc67

comment: add www2. and www6. handling

f6b0afc

Merge branch 'special_docs' into special_docs_with_parsers

3381f1b

removed logger_config

e85b04c

Merge remote-tracking branch 'origin/main' into implement_more_parsers

cad2749

merge with main and minor changes

43905ef

Merge remote-tracking branch 'origin/implement_more_parsers' into imp…

720ec97

…lement_more_parsers

rm logger_config.py

654d76a

Thomas-Lemoine commented Aug 16, 2023

View reviewed changes

mruwnik reviewed Aug 16, 2023

View reviewed changes

Thomas-Lemoine added 2 commits August 16, 2023 14:10

minor fixes

e41ad00

minor fixes 2

40cc96c

parsers type signature

d36687d

mruwnik reviewed Aug 17, 2023

View reviewed changes

Thomas-Lemoine added 2 commits August 17, 2023 10:02

test_arxiv_process_entry_retracted fixed

34ccba9

Refactor of special_indices

a5115cd

Thomas-Lemoine mentioned this pull request Aug 17, 2023

Special indices refactor #139

Merged

Thomas-Lemoine added 7 commits August 17, 2023 16:49

1239283019481293043902

f2a3b96

Merge remote-tracking branch 'origin/main' into implement_more_parsers

3cff71b

Merge branch 'special_indices_refactor' into implement_more_parsers

7c5c4ab

alignmentdataset class removed some init fields

00b70be

removed the wrong arxivpapers file

6ef15f3

minor changes

ad89b44

Merge branch 'special_docs_with_parsers' into implement_more_parsers

70a9757

mruwnik previously approved these changes Aug 21, 2023

View reviewed changes

Thomas-Lemoine mentioned this pull request Aug 27, 2023

Tidy up #144

Merged

Thomas-Lemoine added 2 commits August 27, 2023 06:10

Merge branch 'main' into implement_more_parsers

cf0bdf4

pdf date_published is a datetime

8add5de

Thomas-Lemoine dismissed mruwnik’s stale review via 8add5de August 27, 2023 10:11

revert some useless changes

057015b

mruwnik reviewed Aug 31, 2023

View reviewed changes

Thomas-Lemoine and others added 5 commits August 31, 2023 14:50

revert type annotation change

789a9c8

Merge remote-tracking branch 'origin/main' into implement_more_parsers

662db51

nits

e58292f

nits 2

15efdb8

nits 2

f05c4a9

mruwnik approved these changes Sep 12, 2023

View reviewed changes

Thomas-Lemoine merged commit 16e4c84 into main Sep 13, 2023
1 check passed

Thomas-Lemoine deleted the implement_more_parsers branch September 13, 2023 17:58

		@@ -189,7 +189,7 @@ def unprocessed_items(self, items=None) -> list \| filter:

		return items_to_process

		def fetch_entries(self) -> Generator[Article, None, None]:

implementing the UNIMPLEMENTED_PARSERS #97

implementing the UNIMPLEMENTED_PARSERS #97

Conversation

Thomas-Lemoine commented Jul 19, 2023

Thomas-Lemoine commented Jul 19, 2023

Thomas-Lemoine commented Jul 21, 2023

Thomas-Lemoine commented Jul 21, 2023 • edited Loading

Thomas-Lemoine commented Jul 28, 2023

mruwnik commented Jul 29, 2023

Thomas-Lemoine commented Jul 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruwnik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Thomas-Lemoine commented Aug 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruwnik left a comment

Choose a reason for hiding this comment

Thomas-Lemoine commented Jul 21, 2023 •

edited

Loading