Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented a script to extract orgchart data from gazettes #26

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

LakinduOshadha
Copy link

This is a script to extract orgchart data from gazettes.

extract_orgchart_data.py can extract data (ministers and the corresponding departments) from the gazette PDFs which the data is in tabular format. The script downloads all the PDFs(English version) from that website directly and extracts data from those gazettes.

The script works as follows:

  1. Download Gazette PDFs from the cabinetoffice website
  2. Convert those to docx
  3. Iterate through the tables and extract the Ministry and the departments(Which is in the Column II in tables)
  4. Save the extracted data in scripts/orgchart CSV format.

LakinduOshadha and others added 12 commits May 4, 2023 17:52
created a script to extract ministers and departments from the gazette PDFs
Made extract_orgchart_data to automatically download gazette PDFs from the website and extract data from them.
Generalized the script to extract data from all the gazettes which is in tabular data format
Dataset is updated ( latest gazette: gazette-2023-01-19)
update_extracted_data.py can extract data from gazettes in texts
Implemented a script to extract orgchart data from the gazettes.

The script works as follows:
1. Download Gazette PDFs from the cabinetoffice website
2. Convert those to docx
3. Iterate through the tables and extract the Ministry and the departments(Which is in the Column II in tables)
4. Save the extracted data in scripts/orgchart CSV format.
Comment on lines 9 to 13
ApiUrl: "http://localhost:9000/api/",
ApiKey: "$2a$12$dcKw7SVbheBUwWSupp1Pze7zOHJBlcgW2vuQGSEh0QVHC/KUeRgwW",
NerServerUrl: "http://localhost:8080/classify",
NormalizationServerUrl: "http://localhost:9000/api/",
OcrServerUrl: "http://localhost:8081/extract?url=",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't commit the configs and keys in the Git repo

@@ -0,0 +1,3 @@
venv
.idea
helpers/__pycache__/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new line

@@ -0,0 +1 @@
python extract_orgchart_data.py
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new line

for item in sublist:
if search_term == item.strip():
return True
return False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new line

docx_content = docx2python(docx_file)

return docx_content

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new line

for file_name in os.listdir(directory):
if file_name.endswith('.pdf'):
pdf_names.append(file_name)
return pdf_names
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new line

for department in extracted_data[ministry]:
row = [ministry, department]
writer.writerow(row)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new line

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not need to commit all the extracted files to the repo. Can you ask in the Group chat whether we are planning to commit these files in the repo

go run import_csv.go "extracted/2010-2020/gazette-2006-1-3.csv"
go run import_csv.go "extracted/2010-2020/gazette-2006-1-4.csv"
go run import_csv.go "extracted/2010-2020/gazette-2006-1-5.csv"


go run import_csv.go "extracted/2010-2020/gazette-2010-4-30.csv"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not need to manually add the filenames to this script. You can write a bash script that will first get a list of files and then in a for loop run the command on each of them

extracted_data = extract_ministers_departments(pdf_location)

# writing to csv
write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some linting issues. Try to check in other places. You can install Python linters in your IDE and check for these.

Suggested change
write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY)
write_to_csv(extracted_data, pdf_file_name, CSV_DIRECTORY)

Comment on lines 13 to 22
download_all_pdfs(WEBSITE_URL, PDF_DIRECTORY)
pdf_file_names = get_pdf_names(PDF_DIRECTORY)

for pdf_file_name in pdf_file_names:
# extract ministers and corresponding departments
pdf_location = os.path.join(os.getcwd(), PDF_DIRECTORY, pdf_file_name)
extracted_data = extract_ministers_departments(pdf_location)

# writing to csv
write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling using try except


print("All PDFs downloaded successfully!")


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra line

Comment on lines 7 to 11
def download_pdf(url, save_directory):
response = requests.get(url)
file_name = os.path.join(save_directory, url.split("/")[-1])
with open(file_name, 'wb') as file:
file.write(response.content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling when there are non 200 status codes

Comment on lines 25 to 41
save_directory = os.path.join(os.getcwd(), save_directory)

if not os.path.exists(save_directory):
os.makedirs(save_directory)

domain_name = urlparse(url).scheme + "://" + urlparse(url).netloc

pdf_links = get_pdf_links(url)
print(f"Found {len(pdf_links)} PDFs to download.")

for link in pdf_links:
pdf_url = link if link.startswith('http') else domain_name + link
print(f"Downloading {pdf_url}...")
download_pdf(pdf_url, save_directory)
print("Download complete!")

print("All PDFs downloaded successfully!")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling when downloading

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstrings to provide Code Documentation for all the methods

Updated gazette csv files
Added gazette-2023-05-30.csv
Added docstring
Created a go file to batch import ogchart data.
@sonarcloud
Copy link

sonarcloud bot commented Jun 21, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants