-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented a script to extract orgchart data from gazettes #26
base: master
Are you sure you want to change the base?
Conversation
created a script to extract ministers and departments from the gazette PDFs
Made extract_orgchart_data to automatically download gazette PDFs from the website and extract data from them.
Generalized the script to extract data from all the gazettes which is in tabular data format
Dataset is updated ( latest gazette: gazette-2023-01-19)
update_extracted_data.py can extract data from gazettes in texts
Implemented a script to extract orgchart data from the gazettes. The script works as follows: 1. Download Gazette PDFs from the cabinetoffice website 2. Convert those to docx 3. Iterate through the tables and extract the Ministry and the departments(Which is in the Column II in tables) 4. Save the extracted data in scripts/orgchart CSV format.
config.example.go
Outdated
ApiUrl: "http://localhost:9000/api/", | ||
ApiKey: "$2a$12$dcKw7SVbheBUwWSupp1Pze7zOHJBlcgW2vuQGSEh0QVHC/KUeRgwW", | ||
NerServerUrl: "http://localhost:8080/classify", | ||
NormalizationServerUrl: "http://localhost:9000/api/", | ||
OcrServerUrl: "http://localhost:8081/extract?url=", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't commit the configs and keys in the Git repo
extract_orgchart_data/.gitignore
Outdated
@@ -0,0 +1,3 @@ | |||
venv | |||
.idea | |||
helpers/__pycache__/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new line
@@ -0,0 +1 @@ | |||
python extract_orgchart_data.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new line
for item in sublist: | ||
if search_term == item.strip(): | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new line
docx_content = docx2python(docx_file) | ||
|
||
return docx_content | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new line
for file_name in os.listdir(directory): | ||
if file_name.endswith('.pdf'): | ||
pdf_names.append(file_name) | ||
return pdf_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new line
for department in extracted_data[ministry]: | ||
row = [ministry, department] | ||
writer.writerow(row) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not need to commit all the extracted files to the repo. Can you ask in the Group chat whether we are planning to commit these files in the repo
orgchart/import_orgchart_data.sh
Outdated
go run import_csv.go "extracted/2010-2020/gazette-2006-1-3.csv" | ||
go run import_csv.go "extracted/2010-2020/gazette-2006-1-4.csv" | ||
go run import_csv.go "extracted/2010-2020/gazette-2006-1-5.csv" | ||
|
||
|
||
go run import_csv.go "extracted/2010-2020/gazette-2010-4-30.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not need to manually add the filenames to this script. You can write a bash script that will first get a list of files and then in a for loop run the command on each of them
extracted_data = extract_ministers_departments(pdf_location) | ||
|
||
# writing to csv | ||
write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some linting issues. Try to check in other places. You can install Python linters in your IDE and check for these.
write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY) | |
write_to_csv(extracted_data, pdf_file_name, CSV_DIRECTORY) |
download_all_pdfs(WEBSITE_URL, PDF_DIRECTORY) | ||
pdf_file_names = get_pdf_names(PDF_DIRECTORY) | ||
|
||
for pdf_file_name in pdf_file_names: | ||
# extract ministers and corresponding departments | ||
pdf_location = os.path.join(os.getcwd(), PDF_DIRECTORY, pdf_file_name) | ||
extracted_data = extract_ministers_departments(pdf_location) | ||
|
||
# writing to csv | ||
write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling using try except
|
||
print("All PDFs downloaded successfully!") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra line
def download_pdf(url, save_directory): | ||
response = requests.get(url) | ||
file_name = os.path.join(save_directory, url.split("/")[-1]) | ||
with open(file_name, 'wb') as file: | ||
file.write(response.content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling when there are non 200 status codes
save_directory = os.path.join(os.getcwd(), save_directory) | ||
|
||
if not os.path.exists(save_directory): | ||
os.makedirs(save_directory) | ||
|
||
domain_name = urlparse(url).scheme + "://" + urlparse(url).netloc | ||
|
||
pdf_links = get_pdf_links(url) | ||
print(f"Found {len(pdf_links)} PDFs to download.") | ||
|
||
for link in pdf_links: | ||
pdf_url = link if link.startswith('http') else domain_name + link | ||
print(f"Downloading {pdf_url}...") | ||
download_pdf(pdf_url, save_directory) | ||
print("Download complete!") | ||
|
||
print("All PDFs downloaded successfully!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling when downloading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docstrings to provide Code Documentation for all the methods
Updated gazette csv files Added gazette-2023-05-30.csv
Added docstring Created a go file to batch import ogchart data.
Kudos, SonarCloud Quality Gate passed! |
This is a script to extract orgchart data from gazettes.
extract_orgchart_data.py can extract data (ministers and the corresponding departments) from the gazette PDFs which the data is in tabular format. The script downloads all the PDFs(English version) from that website directly and extracts data from those gazettes.
The script works as follows: