Skip to content

Commit

Permalink
GreaterWrong: Fix stuck scraper iteration (#198)
Browse files Browse the repository at this point in the history
* GreaterWrong: Fix setting cooldown on AF/EAF/LW scraper.

* GreaterWrong: Fix iteration to not get stuck on the last item.

Unclear what circumstances cause this to occur,
but sometimes AF returns the item you moved the after date to,
which causes iteration to loop without the date advancing until the job is killed.

This adds a check for the exact case which simply returns,
as well as a more general check for failure to progress which throws.
  • Loading branch information
jbeshir authored Apr 22, 2024
1 parent 03f1d9b commit 7c0f8c6
Showing 1 changed file with 12 additions and 2 deletions.
14 changes: 12 additions & 2 deletions align_data/sources/greaterwrong/greaterwrong.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ class GreaterWrong(AlignmentDataset):
"""Whether alignment forum posts should be returned"""

limit = 50
COOLDOWN_TIME: float = 0.5
COOLDOWN = 0.5
done_key = "url"
lazy_eval = True
source_type = 'GreaterWrong'
Expand Down Expand Up @@ -182,16 +182,26 @@ def last_date_published(self) -> str:
def items_list(self):
next_date = self.last_date_published
logger.info("Starting from %s", next_date)
last_item = None
while next_date:
posts = self.fetch_posts(self.make_query(next_date))
if not posts["results"]:
return

# If the only item we find was the one we advanced our iterator to, we're done
if len(posts["results"]) == 1 and last_item and posts["results"][0]["pageUrl"] == last_item["pageUrl"]:
return

for post in posts["results"]:
if post["htmlBody"] and self.tags_ok(post):
yield post

next_date = posts["results"][-1]["postedAt"]
last_item = posts["results"][-1]
new_next_date = posts["results"][-1]["postedAt"]
if next_date == new_next_date:
raise ValueError(f'could not advance through dataset, next date did not advance after {next_date}')

next_date = new_next_date
time.sleep(self.COOLDOWN)

def extract_authors(self, item):
Expand Down

0 comments on commit 7c0f8c6

Please sign in to comment.