Implemented a script to extract orgchart data from gazettes #26

LakinduOshadha · 2023-06-01T11:39:26Z

This is a script to extract orgchart data from gazettes.

extract_orgchart_data.py can extract data (ministers and the corresponding departments) from the gazette PDFs which the data is in tabular format. The script downloads all the PDFs(English version) from that website directly and extracts data from those gazettes.

The script works as follows:

Download Gazette PDFs from the cabinetoffice website
Convert those to docx
Iterate through the tables and extract the Ministry and the departments(Which is in the Column II in tables)
Save the extracted data in scripts/orgchart CSV format.

created a script to extract ministers and departments from the gazette PDFs

Made extract_orgchart_data to automatically download gazette PDFs from the website and extract data from them.

Generalized the script to extract data from all the gazettes which is in tabular data format

Dataset is updated ( latest gazette: gazette-2023-01-19)

update_extracted_data.py can extract data from gazettes in texts

Implemented a script to extract orgchart data from the gazettes. The script works as follows: 1. Download Gazette PDFs from the cabinetoffice website 2. Convert those to docx 3. Iterate through the tables and extract the Ministry and the departments(Which is in the Column II in tables) 4. Save the extracted data in scripts/orgchart CSV format.

AathmanT · 2023-06-06T04:59:05Z

config.example.go

+	ApiUrl:                 "http://localhost:9000/api/",
+	ApiKey:                 "$2a$12$dcKw7SVbheBUwWSupp1Pze7zOHJBlcgW2vuQGSEh0QVHC/KUeRgwW",
+	NerServerUrl:           "http://localhost:8080/classify",
+	NormalizationServerUrl: "http://localhost:9000/api/",
+	OcrServerUrl:           "http://localhost:8081/extract?url=",


We shouldn't commit the configs and keys in the Git repo

AathmanT · 2023-06-06T04:59:20Z

extract_orgchart_data/.gitignore

@@ -0,0 +1,3 @@
+venv
+.idea
+helpers/__pycache__/


Add new line

AathmanT · 2023-06-06T05:00:31Z

extract_orgchart_data/extract_orgchart_data.sh

@@ -0,0 +1 @@
+python extract_orgchart_data.py


Add new line

AathmanT · 2023-06-06T05:00:38Z

extract_orgchart_data/helpers/extract_ministers_departments.py

+    for item in sublist:
+        if search_term == item.strip():
+            return True
+    return False


Add new line

AathmanT · 2023-06-06T05:00:43Z

extract_orgchart_data/helpers/extract_pdf_text.py

+    docx_content = docx2python(docx_file)
+
+    return docx_content
+


Add new line

AathmanT · 2023-06-06T05:00:46Z

extract_orgchart_data/helpers/get_pdf_names.py

+    for file_name in os.listdir(directory):
+        if file_name.endswith('.pdf'):
+            pdf_names.append(file_name)
+    return pdf_names


Add new line

AathmanT · 2023-06-06T05:00:50Z

extract_orgchart_data/helpers/write_to_csv.py

+            for department in extracted_data[ministry]:
+                row = [ministry, department]
+                writer.writerow(row)
+


Add new line

AathmanT · 2023-06-06T05:03:03Z

orgchart/extracted/2020-2021/gazette-2020-10-6.csv

We may not need to commit all the extracted files to the repo. Can you ask in the Group chat whether we are planning to commit these files in the repo

AathmanT · 2023-06-06T05:05:01Z

orgchart/import_orgchart_data.sh

-go run import_csv.go "extracted/2010-2020/gazette-2006-1-3.csv"
-go run import_csv.go "extracted/2010-2020/gazette-2006-1-4.csv"
-go run import_csv.go "extracted/2010-2020/gazette-2006-1-5.csv"
-

 go run import_csv.go "extracted/2010-2020/gazette-2010-4-30.csv"


We may not need to manually add the filenames to this script. You can write a bash script that will first get a list of files and then in a for loop run the command on each of them

AathmanT · 2023-06-06T05:09:43Z

extract_orgchart_data/extract_orgchart_data.py

+        extracted_data = extract_ministers_departments(pdf_location)
+
+        # writing to csv
+        write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY)


There are some linting issues. Try to check in other places. You can install Python linters in your IDE and check for these.

Suggested change

write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY)

write_to_csv(extracted_data, pdf_file_name, CSV_DIRECTORY)

AathmanT · 2023-06-06T05:10:11Z

extract_orgchart_data/extract_orgchart_data.py

+    download_all_pdfs(WEBSITE_URL, PDF_DIRECTORY)
+    pdf_file_names = get_pdf_names(PDF_DIRECTORY)
+
+    for pdf_file_name in pdf_file_names:
+        # extract ministers and corresponding departments
+        pdf_location = os.path.join(os.getcwd(), PDF_DIRECTORY, pdf_file_name)
+        extracted_data = extract_ministers_departments(pdf_location)
+
+        # writing to csv
+        write_to_csv(extracted_data,pdf_file_name,CSV_DIRECTORY)


Add error handling using try except

AathmanT · 2023-06-06T05:10:34Z

extract_orgchart_data/helpers/crawl_pdfs.py

+
+    print("All PDFs downloaded successfully!")
+
+


remove extra line

AathmanT · 2023-06-06T05:16:22Z

extract_orgchart_data/helpers/crawl_pdfs.py

+def download_pdf(url, save_directory):
+    response = requests.get(url)
+    file_name = os.path.join(save_directory, url.split("/")[-1])
+    with open(file_name, 'wb') as file:
+        file.write(response.content)


Add error handling when there are non 200 status codes

AathmanT · 2023-06-06T05:16:52Z

extract_orgchart_data/helpers/crawl_pdfs.py

+    save_directory = os.path.join(os.getcwd(), save_directory)
+
+    if not os.path.exists(save_directory):
+        os.makedirs(save_directory)
+
+    domain_name = urlparse(url).scheme + "://" + urlparse(url).netloc
+
+    pdf_links = get_pdf_links(url)
+    print(f"Found {len(pdf_links)} PDFs to download.")
+
+    for link in pdf_links:
+        pdf_url = link if link.startswith('http') else domain_name + link
+        print(f"Downloading {pdf_url}...")
+        download_pdf(pdf_url, save_directory)
+        print("Download complete!")
+
+    print("All PDFs downloaded successfully!")


Add error handling when downloading

AathmanT · 2023-06-06T05:18:04Z

extract_orgchart_data/helpers/crawl_pdfs.py

Add docstrings to provide Code Documentation for all the methods

Updated gazette csv files Added gazette-2023-05-30.csv

Added docstring Created a go file to batch import ogchart data.

sonarcloud · 2023-06-21T13:02:29Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

LakinduOshadha and others added 12 commits May 4, 2023 17:52

Initiated orgchart data extraction

43b7edb

created a script to extract ministers and departments from the gazette PDFs

Merge branch 'master' of https://github.com/LSFLK/GIG-Scripts

3b21789

Extract PDFs directly from the website

5775844

Made extract_orgchart_data to automatically download gazette PDFs from the website and extract data from them.

Generalized to extract data from the amendment tables

d9fd8bd

Generalized the script to extract data from all the gazettes which is in tabular data format

Improved accuracy

a694982

initiated an approach to extract data from gazette amendments

821a1de

Updated extracted data

e747238

Dataset is updated ( latest gazette: gazette-2023-01-19)

Create README.md

c3d76ce

Updated update_extracted_data.py

6a4c79b

update_extracted_data.py can extract data from gazettes in texts

Merge branch 'master' of https://github.com/LakinduOshadha/GIG-Scripts

e8b895c

Updated the name

646667d