Article Separation

Description

Article separation (AS), also called article segmentation, is the process of dividing a newspaper page into its articles. So far, existing systems still need considerable input from human users to solve this task due to handcrafted rules that need parameter tuning to work well dependent on the layout of the newspaper page. In the context of the NewsEye project the aim is to create an automated workflow that is independant from any user input.

Workflow

The preceeding tasks 2.1 Layout Analysis (LA) and 2.2 Automated Text Recognition (ATR) in the NewsEye project provide geometrical information in the form of text lines / baselines and the corresponding transcription of the text. This information we want to use to solve the AS task by combining traditional LA methods, semantic information and machine learning based approaches.

Tasks 2.1 and 2.2 are processed and further developed in the Transkribus platform which has its roots in the FP7 Project tranScriptorium (2013-2015) and was further developed in the H2020 Project READ (2016-2019).

The Transkribus GitHub repository can be found at https://github.com/transkribus/.

Used models and algorithms

Seq2Seq models for ATR: Paper
ARU-Net / Pixel labeling: GitHub Repository - Paper
Stroke Width Transform: Paper
DBScan: GitHub Repository - Paper
BERT: Paper
Graph Neural Networks: Paper

Data

To our knowledge, there is no general AS dataset on newspaper pages on which a comparison with other existing workflows is possible.

CITlab AS GitHub Repository:

The code for the article separation can be found in the following GitHub repository, which was updated for M45: https://github.com/CITlabRostock/citlab-article-separation-new

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Article Separation

Description

Workflow

Used models and algorithms

Data

CITlab AS GitHub Repository:

About

Releases

Packages

NewsEye/Article-Separation

Folders and files

Latest commit

History

Repository files navigation

Article Separation

Description

Workflow

Used models and algorithms

Data

CITlab AS GitHub Repository:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages