Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementing the UNIMPLEMENTED_PARSERS #97

Merged
merged 40 commits into from
Sep 13, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
c8f77aa
to start the pr to add comments
Thomas-Lemoine Jul 19, 2023
90517c4
removed spaces
Thomas-Lemoine Jul 19, 2023
3617c9e
Merge remote-tracking branch 'origin/main' into implement_more_parsers
Thomas-Lemoine Jul 19, 2023
049fb56
Merge remote-tracking branch 'origin/main' into implement_more_parsers
Thomas-Lemoine Jul 23, 2023
6848dee
create logger_config and reorder the imports
Thomas-Lemoine Jul 28, 2023
682e96e
main's logger
Thomas-Lemoine Jul 28, 2023
3b38600
ignore the log files
Thomas-Lemoine Jul 28, 2023
367f5df
postprocess notes
Thomas-Lemoine Jul 28, 2023
932561a
fix test with new download order for pdfarticles
Thomas-Lemoine Jul 28, 2023
728b124
Handle special docs
mruwnik Aug 4, 2023
b9999b4
Fetch new items from indices
mruwnik Aug 4, 2023
7ee7f9a
fixed domain getter from network location
Thomas-Lemoine Aug 4, 2023
042fc67
logger and minor fixes
Thomas-Lemoine Aug 4, 2023
f6b0afc
comment: add www2. and www6. handling
Thomas-Lemoine Aug 4, 2023
3381f1b
Merge branch 'special_docs' into special_docs_with_parsers
Thomas-Lemoine Aug 6, 2023
e85b04c
removed logger_config
Thomas-Lemoine Aug 6, 2023
cad2749
Merge remote-tracking branch 'origin/main' into implement_more_parsers
Thomas-Lemoine Aug 16, 2023
43905ef
merge with main and minor changes
Thomas-Lemoine Aug 16, 2023
720ec97
Merge remote-tracking branch 'origin/implement_more_parsers' into imp…
Thomas-Lemoine Aug 16, 2023
654d76a
rm logger_config.py
Thomas-Lemoine Aug 16, 2023
e41ad00
minor fixes
Thomas-Lemoine Aug 16, 2023
40cc96c
minor fixes 2
Thomas-Lemoine Aug 16, 2023
d36687d
parsers type signature
Thomas-Lemoine Aug 17, 2023
34ccba9
test_arxiv_process_entry_retracted fixed
Thomas-Lemoine Aug 17, 2023
a5115cd
Refactor of special_indices
Thomas-Lemoine Aug 17, 2023
f2a3b96
1239283019481293043902
Thomas-Lemoine Aug 17, 2023
3cff71b
Merge remote-tracking branch 'origin/main' into implement_more_parsers
Thomas-Lemoine Aug 21, 2023
7c5c4ab
Merge branch 'special_indices_refactor' into implement_more_parsers
Thomas-Lemoine Aug 21, 2023
00b70be
alignmentdataset class removed some init fields
Thomas-Lemoine Aug 21, 2023
6ef15f3
removed the wrong arxivpapers file
Thomas-Lemoine Aug 21, 2023
ad89b44
minor changes
Thomas-Lemoine Aug 21, 2023
70a9757
Merge branch 'special_docs_with_parsers' into implement_more_parsers
Thomas-Lemoine Aug 21, 2023
cf0bdf4
Merge branch 'main' into implement_more_parsers
Thomas-Lemoine Aug 27, 2023
8add5de
pdf date_published is a datetime
Thomas-Lemoine Aug 27, 2023
057015b
revert some useless changes
Thomas-Lemoine Aug 31, 2023
789a9c8
revert type annotation change
Thomas-Lemoine Aug 31, 2023
662db51
Merge remote-tracking branch 'origin/main' into implement_more_parsers
Thomas-Lemoine Sep 8, 2023
e58292f
nits
henri123lemoine Sep 8, 2023
15efdb8
nits 2
henri123lemoine Sep 9, 2023
f05c4a9
nits 2
henri123lemoine Sep 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions align_data/common/alignment_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import time
from dataclasses import dataclass, field, KW_ONLY
from pathlib import Path
from typing import List, Optional, Dict, Any, Set, Iterable, Tuple, Generator
from typing import List, Optional, Set, Iterable, Tuple, Generator

import pytz
from sqlalchemy import select, Select, JSON
Expand Down Expand Up @@ -189,7 +189,7 @@ def unprocessed_items(self, items=None) -> list | filter:

return items_to_process

def fetch_entries(self) -> Generator[Article, None, None]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you removing this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much of a reason but the reason was that the ide could already infer the type signature without it being explicitely written

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought it back

def fetch_entries(self):
"""Get all entries to be written to the file."""
for item in tqdm(self.unprocessed_items(), desc=f"Processing {self.name}"):
entry = self.process_entry(item)
Expand Down
6 changes: 3 additions & 3 deletions align_data/common/html_dataset.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import pytz
import logging
from datetime import datetime
from dataclasses import dataclass, field
from urllib.parse import urljoin
from typing import List, Dict, Any
import re

import pytz
import requests
import feedparser
from bs4 import BeautifulSoup
Expand All @@ -17,7 +17,7 @@
logger = logging.getLogger(__name__)


@dataclass()
@dataclass
class HTMLDataset(AlignmentDataset):
"""
Fetches articles from a different blog by collecting links to articles from an index page.
Expand All @@ -34,7 +34,7 @@ class HTMLDataset(AlignmentDataset):
source_type = "blog"
ignored_selectors = []

def extract_authors(self, article): #TODO: make this work
def extract_authors(self, article):
return self.authors


Expand Down
4 changes: 0 additions & 4 deletions align_data/db/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,6 @@ def update(self, other: "Article"):
if field not in ["id", "hash_id", "metadata"] and getattr(other, field):
setattr(self, field, getattr(other, field))
self.meta = dict((self.meta or {}), **{k: v for k, v in other.meta.items() if k and v})
# TODO: verify that this actually updates the meta column;
# https://amercader.net/blog/beware-of-json-fields-in-sqlalchemy/

if other._id:
self._id = other._id
Expand All @@ -129,8 +127,6 @@ def add_meta(self, key: str, val):
if self.meta is None:
self.meta = {}
self.meta[key] = val
# TODO: verify that this actually updates the meta column;
# https://amercader.net/blog/beware-of-json-fields-in-sqlalchemy/

def append_comment(self, comment: str):
"""Appends a comment to the article.comments field. You must run session.commit() to save the comment to the database."""
Expand Down
2 changes: 1 addition & 1 deletion align_data/db/session.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import List, Generator
from typing import List
import logging

from contextlib import contextmanager
Expand Down
3 changes: 2 additions & 1 deletion align_data/postprocess/postprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
from dataclasses import dataclass, field
from typing import List, DefaultDict
import logging
from pathlib import Path

import jsonlines
from tqdm import tqdm
from pathlib import Path
import pylab as plt
from nltk.tokenize import sent_tokenize, word_tokenize
import seaborn as sns #TODO: install seaborn or fix this file
Expand Down