-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cambridge CLI (test version) #2
base: main
Are you sure you want to change the base?
Conversation
File containing functions that will be called in other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdm010 thank you for the PR, I have few comments. I'm happy to discuss if you have any questions. Also, could you please include tests for this provider?
|
||
subjectfiles = [] | ||
for subject in subjects: | ||
subjectfiles.append(os.path.join(os.path.dirname(os.path.abspath(__file__)), subject) + '.tsv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make the download location a env var, so we can easily configure it
@@ -0,0 +1,53 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please rename it to utils.py
?
@@ -0,0 +1,39 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between this and src/providers/cambridge/cambridge_cli.py? Would you like to just test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just for testing.
def get_page_content(url): | ||
response = requests.get(url) | ||
if response.status_code == 200: | ||
return response.text | ||
else: | ||
print(f"Error: Failed to fetch the webpage ({response.status_code})") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_page_content(url): | |
response = requests.get(url) | |
if response.status_code == 200: | |
return response.text | |
else: | |
print(f"Error: Failed to fetch the webpage ({response.status_code})") | |
return None | |
def get_page_content(url): | |
response = requests.get(url) | |
try: | |
r = requests.get('http://www.google.com/nothere') | |
r.raise_for_status() | |
return r.text | |
except requests.exceptions.HTTPError as err: | |
print(f"Error: Failed to fetch the webpage ({response.status_code})") | |
return None |
Also, since it's a cli it's better to use click's echo
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this code, I get "Error: Failed to fetch the webpage (200)" and the rest of the code does not run.
with open(input_file, 'r') as f: | ||
lines = f.readlines() | ||
|
||
first_line = lines[0] # Preserve the first line | ||
|
||
lines_to_remove = random.randint(0, min(5, len(lines) - 1)) # Ensure at least one line is kept | ||
remaining_lines = random.sample(lines[1:], max(len(lines) - 1 - lines_to_remove, 0)) | ||
lines_to_keep = [first_line] + remaining_lines | ||
|
||
with open(output_file, 'w') as f: | ||
f.writelines(lines_to_keep) | ||
os.remove(input_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "random" lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is used for testing purposes. I'm removing a random assortment of lines from the files that I initially fetch from the publisher. I'm using this as the "old" data, so that when we fetch the new files using the cli, we have something to compare against. So essentially just for testing.
CLI for Cambridge University Press.
get_test_data.py must be run first to fetch and store files locally. These files are then randomly modified, so we can check for new releases.
cambridge_cli.py can be used to fetch new releases for the subject specified by the user.
Future versions of the code will be higher-level (where the user can specify the publisher in the CLI), and also be able to fetch releases for all subjects.