Preprocessing Nordjyske

This project is part of the Knox pipeline. It resides in the first layer and is centered around processing the data set of newspaper articles.

The project is divided into five packages:

Crawling (to traverse the folder structure of the data set)
Segmentation (based on the alto.xml files from the data set)
Pre-processing (of the images [not used])
OCR (for extract the text from the image files)
Parsing of NITF Files (to convert .nitf files to the desired output)

In addition to these packages, the following repositories are also used:
https://github.com/Knox-AAU/PreprocessingLayer_Source_Data_IO
https://github.com/Knox-AAU/PreprocessingLayer_MongoDB_API
https://github.com/Knox-AAU/PreprocessingLayer_Alto_Segment_Lib

Along with the shared UI-repository:
https://github.com/Knox-AAU/UI_React

Development of new features, such as pre- or post-processing should be added as new packages to this project.

Installation and documentation

Please refer to the wiki for installation and usage guides.

Support

The development team behind this project will change once a year and for this reason support will be limited. However, if you have any questions or stumble upon something that has not clearly been described, you can contact one of the developers below, according to what issues your are experiencing.

HOWEVER, please be sure you have researched the issue extensively, and documented the issues along with the question elaborately.

Questions about the MongoDB + API and file watcher, please contact
[email protected] (Fall 2021)

Questions about using and training Tesseract, please contact:
[email protected] (Fall 2021)

Generel questions for the project as a whole (alongside with all modules), please contact:
[email protected] (Fall 2020)

Contributing

A general walkthrough of how the codebase is structure and what standards are followed can be seen below. Please ensure that you follow these conventions when contributing to the project.

Please also red the agreements on the wiki page. We have used pep8 standard, along with Black to refactor the code (this is also run as the cont. integration when pushing to the repository)

Naming conventions

The naming of different components follow the Python naming conventions, which can be seen in full here or boiled down here.

Unit testing

The unit tests for the project are structured according to the pytest documentation found here.

Defining tests

A test is defined as seen below:

def test_{method_name}_{what_should_happen_given_input}:
	assert method(param)

To speed up the process of creating tests, some IDE's can generate the tests for an entire class or a specific method.

The tests for a given file/class is grouped in a file named 'test_{name_of_file/class}.py'. The class within the file should follow Test{TestedClass}. Each of the test files can have specialized setUp and tearDown methods as seen below:

def setup_method(self, method):
	""" setup any state tied to the execution of the given method in a
	class.  setup_method is invoked for every test method of a class.
	"""

...

{tests}

...


def teardown_method(self, method):
	""" teardown any state that was previously setup with a setup_method
	call.
	"""

The setup_method method is used to avoid duplicate set up code for the tests in the file, as they often use similar or identical prerequisite data, such as initialized objects. The _ teardown_method_ is used to dispose of any objects or structures that otherwise could be left behind as garbage.

Mocking

For some of the unit tests, mocking is used. We have utilized the unittest.mock library.

Test suite

It is also possible to set up test suites as well, allowing us to set the execution order for the tests and add conditions for what test to be executed.

Generate packages

If additional packages are desired to improve the structure of additions to the project, the following links were used when the current packages were created:

https://packaging.python.org/tutorials/packaging-projects/ https://packaging.python.org/guides/hosting-your-own-index/

.jp2 to .tiff conversion (Linux)

The data set consists of .jp2 files, but Tesseract requires .tiff files for training. This means that the conversion of these files is necessary for certain situations. For Linux users, the package ffmpeg can be downloaded with the command:

sudo apt install ffmpeg

The conversion can after this be performed with the command:

ffmpeg -i input.jp2 output.tiff

Authors and achknowledgement

The development teams who have been part of the development can be seen below:

Cs-21-sw-5-15 (Fall 2021)
- Kristian Morsing Pedersen
- Alex Farup Christensen
- Cecilie Welling Fog
- Jonas Noermark
- Alexander Hansen
SW517e20 (Fall 2020)
- Alex Immerkær Kristensen
- Ida Thoft Christiansen
- Jakob Kjeldbjerg Lund
- Lau Ernebjerg Josefsen
- Lena Said
- Niels Vistisen
- Thomas Gjedsted Lorentzen

Special thanks to all supervisors, who have contributed their expertice to the development of the project:

Theis Erik Jendal

Status of project

The project is not done and is currently a work in progress.

The functionality still missing from the system includes:

Pre-processing
Post-processing
Enhanced extraction of text written in a Gothic font (improved in the fall of 2021)
Segmentation and text extraction across multiple pages

Known issues

None :-)

Name		Name	Last commit message	Last commit date
Latest commit History 535 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.run		.run
alto_segment_lib		alto_segment_lib
crawler		crawler
nitf_parser		nitf_parser
ocr		ocr
preprocessing		preprocessing
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
consume_folders.py		consume_folders.py
file.txt		file.txt
file_watcher.py		file_watcher.py
find_pdf_filename.py		find_pdf_filename.py
main.py		main.py
metadata-mapper.ini		metadata-mapper.ini
nordjyske_preprocessing.mdj		nordjyske_preprocessing.mdj
post_all_json.py		post_all_json.py
publication.schema.json		publication.schema.json
publication_default.ini		publication_default.ini
requirements.txt		requirements.txt
save_to_json.py		save_to_json.py
todo.txt		todo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preprocessing Nordjyske

Installation and documentation

Support

Contributing

Naming conventions

Unit testing

Defining tests

Mocking

Test suite

Generate packages

.jp2 to .tiff conversion (Linux)

Authors and achknowledgement

Status of project

Known issues

About

Releases

Packages

Contributors 4

Languages

License

Knox-AAU/PreprocessingLayer_Nordjyske

Folders and files

Latest commit

History

Repository files navigation

Preprocessing Nordjyske

Installation and documentation

Support

Contributing

Naming conventions

Unit testing

Defining tests

Mocking

Test suite

Generate packages

.jp2 to .tiff conversion (Linux)

Authors and achknowledgement

Status of project

Known issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages