Increasing Accessibility of Government Documents

This repository contains the data and notebooks for the Short paper submission 'Making PDFs accessible for Visually Impaired Users (and Everybody Else)' to the TPDL conference.

Installation

To be able to run the code and experiments in this repository, follow these steps:

Install Anaconda:
- Visit the Anaconda website and download the installer for your operating system.
- Follow the installation instructions provided for your specific OS.
Clone this repository:

git clone https://github.com/RubenvanHeusden/TPDLAccessibilityofGovernmentDocuments.git

Navigate to the project directory:

cd TPDLAccessibilityofGovernmentDocuments

Create a new Anaconda environment:

Open a terminal (or Anaconda Prompt on Windows) and run the following command, which installs the requirement according to the environment file we provid:
```
conda env create -f environment.yml
```
Activate the environment:

conda activate accessibilifier_env

Alternative using pip If you prefer using pip, you can also install the environment using the requirements file we supplied
```
pip install - requirements.txt
```
To install the package, run the following command while in the root folder:
```
pip install -e .
```
Running Jupyter Notebook: If you haven't worked with Jupyter Notebook yet, you should set up jupyter so that you can select the right kernel and work with the packages we just installed.
```
ipython kernel install --name "accessibilifier_env" --user
```
Some additional packages
- There are some packages that cannot be installed through the pip process, such as verapdf, tesseract and pdftohtml, depending on your system you can these by following the links on the websites of the packages. (On Mac all of these can be installed via HomeBrew as well.)

Directory Structure

notebooks/: Contains Jupyter Notebook files.
- Experiments.ipynb: Notebook containing the main experiments and explanation of the algorithm.
data/: Contains the dataset and word lists used in this research
- data.csv.gz dataframe containing the indidivual pages their text.
- parlamint_wordlist.txt.gz: Word list containing all the unique words from the ParlaMint dataset
- taalunie_wordlist.txt.gz : Word list containing all the unique words from the TaalUnie dataset
- wordlist.txt.gz: Word list containing all the unique words from the ParlaMint and TaalUnie lists combine.
scripts/
- badsegmentdetector.py: file that contains the complete implementation of the bad segment detector as shown in the experiment notebook.
- run_bad_segment_detection.py: command line script that can be used to run the bad segments detector on an input file
- run_pdfconverter.py: command line script that can be used to run the pdf to markdown converter.
examples/ - test_file.pdf: An example PDF file that you can use to try the scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
accessibilifier.egg-info		accessibilifier.egg-info
data		data
examples		examples
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Increasing Accessibility of Government Documents

Installation

Directory Structure

About

Releases

Packages

Languages

irlabamsterdam/accessibilifier

Folders and files

Latest commit

History

Repository files navigation

Increasing Accessibility of Government Documents

Installation

Directory Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages