2020 GRC Machine Learning Hackathon Scripts

This repository contains the scripts used to preprocess the provided data for the 2020 GRC ML Hackathon.

What do these do? Machine learning models require well-formatted and organized inputs. However, most documentation for datasheets and scientific publications are written to be read by humans. These scripts demo one way of converting many human-readable documents into an organized dataset for training a ML model.

These diagrams describe the conversion process:

These scripts rely on Textricator Textricator is an open source PDF text extraction tool - See the below install directions on how to get running

Preprequisites:

Python 3 https://www.python.org/downloads/

Installing python is outside of the scope of these instructions, but there are many guides online for every system configuration one could have.

One example, if you don't have Admin rights: https://stackoverflow.com/questions/33876657/how-to-install-python-any-version-in-windows-when-youve-no-admin-priviledges

Explanation of files here:

Scripts: Current version of datasheet conversion scripts, requiring minimal user input
Alternative_Scripts: Old version of conversion scripts. Takes a significantly different approach to covnerting files. Tested and working on a subset of dataset.
Images: Images embedded in the readme.md file

How to get Running:

Short Instructions:

Download & extract a copy of this repo and textricator

robocopy ./Downloads/2020-grc-machine-learning-hackathon-master/2020-grc-machine-learning-hackathon-master/Scripts/ ./Downloads/textricator-9.2.57-bin/textricator-9.2.57/ /s
cd .\Downloads\textricator-9.2.57-bin\textricator-9.2.57\
0_run_all.bat
cd .\data_folder\
dir

If that doesn't make sense, no worries! Follow the directions below:

Step 1: Download a copy of Textrictor

It's a bit difficult to get to the textricator zip, so I've provided screenshots of the process. Go here: https://github.com/measuresforjustice/

Extract the files

Step 2: Download this repository

Extract the files

Step 3: Copy scripts into textricator

Step 4: Run preprocessing scripts "0_run_all.bat"

The raw PDF files are converted and combined into one machine-readable CSV file.

For details on how each step works, open the python scripts (the ____.py files you copied) in a text editor (e.g. notepad, Spyder IDE, nano, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Scripts		Scripts
processed-data		processed-data
starter-notebooks		starter-notebooks
team-notebooks		team-notebooks
.gitignore		.gitignore
README.md		README.md
Submission.md		Submission.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2020 GRC Machine Learning Hackathon Scripts

How to get Running:

Step 1: Download a copy of Textrictor

Step 2: Download this repository

Step 3: Copy scripts into textricator

Step 4: Run preprocessing scripts "0_run_all.bat"

About

Releases

Packages

Contributors 4

Languages

vizzies/NASA-Heat-Exhange-Knowledge-Explorer

Folders and files

Latest commit

History

Repository files navigation

2020 GRC Machine Learning Hackathon Scripts

How to get Running:

Step 1: Download a copy of Textrictor

Step 2: Download this repository

Step 3: Copy scripts into textricator

Step 4: Run preprocessing scripts "0_run_all.bat"

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages