Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
piconti committed Sep 23, 2024
2 parents 8a3d232 + 8c8722e commit 2ab8682
Show file tree
Hide file tree
Showing 11 changed files with 295 additions and 84 deletions.
5 changes: 5 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# configuration for black compatibility

[flake8]
max-line-length = 88
extend-ignore = E203, W503
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ We define schemas for:
- [Language Identification](docs/language_identification.md) (draft 06)
- Entities
- [Entities](docs/entities.md) (2020-12)
- [OCR Quality Assessment](docs/ocr_qa.md) (OCR-QA)


#### Processes
- Data processing manifests (todo)
Expand Down
3 changes: 3 additions & 0 deletions docs/ocr_qa-properties-ci_ref.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## ci\_ref Type

`string`
3 changes: 3 additions & 0 deletions docs/ocr_qa-properties-ocrqa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## ocrqa Type

`number`
46 changes: 46 additions & 0 deletions docs/ocr_qa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
## OCR-QA JSON Schema Type

`object` ([OCR-QA JSON Schema](ocr_qa.md))

# OCR-QA JSON Schema Properties

| Property | Type | Required | Nullable | Defined by |
| :----------------- | :------- | :------- | :------------- | :------------------------------------------------------------------------------------------------------------------------------------------------ |
| [ci\_ref](#ci_ref) | `string` | Required | cannot be null | [OCR-QA JSON Schema](ocr_qa-properties-ci_ref.md "https://impresso.github.io/impresso-schemas/json/ocr_qa/ocr_qa.schema.json#/properties/ci_ref") |
| [ocrqa](#ocrqa) | `number` | Required | cannot be null | [OCR-QA JSON Schema](ocr_qa-properties-ocrqa.md "https://impresso.github.io/impresso-schemas/json/ocr_qa/ocr_qa.schema.json#/properties/ocrqa") |

## ci\_ref

Reference to canonical content item id, typically an article

`ci_ref`

* is required

* Type: `string`

* cannot be null

* defined in: [OCR-QA JSON Schema](ocr_qa-properties-ci_ref.md "https://impresso.github.io/impresso-schemas/json/ocr_qa/ocr_qa.schema.json#/properties/ci_ref")

### ci\_ref Type

`string`

## ocrqa

The estimated OCR quality, between 0 and 1

`ocrqa`

* is required

* Type: `number`

* cannot be null

* defined in: [OCR-QA JSON Schema](ocr_qa-properties-ocrqa.md "https://impresso.github.io/impresso-schemas/json/ocr_qa/ocr_qa.schema.json#/properties/ocrqa")

### ocrqa Type

`number`
62 changes: 62 additions & 0 deletions examples/language_identification/example0-invalid.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{
"id": "luxzeit1858-1859-01-18-a-i0026",
"lg_decision": "voting",
"tp": "tb",
"len": 63,
"orig_lg": null,
"alphabetical_ratio": 0.254,
"impresso_language_identifier_version": {
"version": "v1.4.1",
"ts": "2020-12-28T10:27:11+00:00"
},
"language_identifier_version": {
"version": "v1.4.1",
"ts": "2020-12-28T10:15:45+00:00"
},
"impresso_ft": [
{
"lang": "fr",
"prob": 0.969
},
{
"lang": "de",
"prob": 0.03
}
],
"langdetect": [
{
"lang": "ro",
"prob": 0.667
},
{
"lang": "ca",
"prob": 0.333
}
],
"langid": [
{
"lang": "ro",
"prob": 0.655
}
],
"wp_ft": [
{
"lang": "es",
"prob": 0.305
},
{
"lang": "ca",
"prob": 0.121
},
{
"lang": "war",
"prob": 0.106
}
],
"votes": [
{
"lang": "fr",
"vote": 0.942
}
]
}
4 changes: 4 additions & 0 deletions examples/ocr_qa/ocr_qa_example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"ci_ref": "actionfem-1939-05-15-a-i0022",
"ocrqa": 0.86
}
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,13 @@
"lg_decision": {
"enum": [
"all",
"all-but-impresso-ft",
"all-but-impresso_ft",
"voting",
"dominant-by-len",
"dominant-by-lowvote"
],
"type": "string",
"description": "An identifier for the decision strategy applied to the content item: 'all' = all LID systems/info agree; 'all-but-impresso-ft' = all LID except impresso_ft agree on a language other than de/fr; 'dominant-by-len' = the most frequent language of the ensemble decisions is selected because there are too few characters; 'dominant-by-lowvote' = the most frequent language of the ensemble decisions is selected because there are too few votes; 'voting' = the language with the highest vote count is selected "
"description": "An identifier for the decision strategy applied to the content item: 'all' = all LID systems/info agree; 'all-but-impresso_ft' = all LID except impresso_ft agree on a language other than de/fr; 'dominant-by-len' = the most frequent language of the ensemble decisions is selected because there are too few characters; 'dominant-by-lowvote' = the most frequent language of the ensemble decisions is selected because there are too few votes; 'voting' = the language with the highest vote count is selected "
},
"tp": {
"type": "string",
Expand Down
6 changes: 3 additions & 3 deletions json/linguistic_annotation/ling_spacy.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"examples": [
"2009-06-15T13:45:30"
],
"pattern": "20\d{2}-((0[1-9])|(1[0-2]))-((0[1-9])|([1-2][0-9])|(3[0-1]))T(([0-1]|[0-1][0-9])|(2[0-3])):([0-5][0-9]):([0-5][0-9])"
"pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$"
},
"properties": {
"id": {
Expand Down Expand Up @@ -73,7 +73,7 @@
"title": "The Items Schema",
"required": [
"t",
"o",
"o"
],
"properties": {
"t": {
Expand Down Expand Up @@ -124,7 +124,7 @@
"B-LOC"
],
"pattern": "^(.*)$"
},
}
}
}
}
Expand Down
21 changes: 21 additions & 0 deletions json/ocr_qa/ocr_qa.schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://impresso.github.io/impresso-schemas/json/ocr_qa/ocr_qa.schema.json",
"title": "OCR-QA JSON Schema",
"description": "A representation for the assessment of OCR quality of content items.",
"type": "object",
"properties": {
"ci_ref": {
"type": "string",
"description": "Reference to canonical content item id, typically an article"
},
"ocrqa": {
"type": "number",
"description": "The estimated OCR quality, between 0 and 1"
}
},
"required": [
"ocrqa",
"ci_ref"
]
}
Loading

0 comments on commit 2ab8682

Please sign in to comment.