Skip to content

Commit

Permalink
Merge pull request #38 from impresso/text_reuse_schemas
Browse files Browse the repository at this point in the history
Text reuse schemas
  • Loading branch information
piconti authored Oct 7, 2024
2 parents 2ab8682 + 0c19b95 commit ec5e806
Show file tree
Hide file tree
Showing 59 changed files with 1,184 additions and 81 deletions.
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ tests:
jsonschema -V Draft202012Validator -i examples/entities/example1.json json/entities/entities.schema.json && $(print-test-ok)|| $(print-test-failed)
jsonschema -V Draft202012Validator -i examples/entities/example2.json json/entities/entities.schema.json && $(print-test-ok)|| $(print-test-failed)
jsonschema -V Draft202012Validator -i examples/versioning_manifest/canonical_v0-0-1.json json/versioning/manifest.schema.json && $(print-test-ok)|| $(print-test-failed)
jsonschema -V Draft202012Validator -i examples/text_reuse/tr_cluster_example.json json/text_reuse/cluster.schema.json && $(print-test-ok)|| $(print-test-failed)
jsonschema -V Draft202012Validator -i examples/text_reuse/tr_passage_example.json json/text_reuse/passage.schema.json && $(print-test-ok)|| $(print-test-failed)


clean-documentation:
Expand All @@ -30,6 +32,7 @@ documentation:
jsonschema2md -d json/entities/ --header false -n -v 2020-12 -o docs -x - -s propTable
#jsonschema2md -d json/linguistic_annotation/ -n -v 06 -o docs
jsonschema2md -d json/versioning/ --header false -n -v 2024-02 -o docs -x - -s propTable
jsonschema2md -d json/text_reuse/ --header false -n -v 2024-09 -o docs -x - -s propTable

##########################################################################################
# Simple macros for tests
Expand Down
3 changes: 3 additions & 0 deletions docs/cluster-properties-cluster_size.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## cluster\_size Type

`integer`
13 changes: 13 additions & 0 deletions docs/cluster-properties-doc_ids-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## items Type

`string`

## items Constraints

**pattern**: the string must match the following regular expression: 

```regexp
^[a-zA-Z0-9]+-\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])-[a-z]-i\d{4}$
```

[try pattern](https://regexr.com/?expression=%5E%5Ba-zA-Z0-9%5D%2B-%5Cd%7B4%7D-\(0%5B1-9%5D%7C1%5B0-2%5D\)-\(0%5B1-9%5D%7C%5B12%5D%5Cd%7C3%5B01%5D\)-%5Ba-z%5D-i%5Cd%7B4%7D%24 "try regular expression with regexr.com")
3 changes: 3 additions & 0 deletions docs/cluster-properties-doc_ids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## doc\_ids Type

`string[]`
3 changes: 3 additions & 0 deletions docs/cluster-properties-id.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## id Type

`string`
3 changes: 3 additions & 0 deletions docs/cluster-properties-lexical_overlap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## lexical\_overlap Type

`number`
13 changes: 13 additions & 0 deletions docs/cluster-properties-max_date.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## max\_date Type

`string`

## max\_date Constraints

**pattern**: the string must match the following regular expression: 

```regexp
^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
```

[try pattern](https://regexr.com/?expression=%5E\(%5Cd%7B4%7D\)-\(0%5B1-9%5D%7C1%5B0-2%5D\)-\(0%5B1-9%5D%7C%5B12%5D%5Cd%7C3%5B01%5D\)%24 "try regular expression with regexr.com")
13 changes: 13 additions & 0 deletions docs/cluster-properties-min_date.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## min\_date Type

`string`

## min\_date Constraints

**pattern**: the string must match the following regular expression: 

```regexp
^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
```

[try pattern](https://regexr.com/?expression=%5E\(%5Cd%7B4%7D\)-\(0%5B1-9%5D%7C1%5B0-2%5D\)-\(0%5B1-9%5D%7C%5B12%5D%5Cd%7C3%5B01%5D\)%24 "try regular expression with regexr.com")
3 changes: 3 additions & 0 deletions docs/cluster-properties-newspapers-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## items Type

`string`
3 changes: 3 additions & 0 deletions docs/cluster-properties-newspapers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## newspapers Type

`string[]`
13 changes: 13 additions & 0 deletions docs/cluster-properties-passages-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## items Type

`string`

## items Constraints

**pattern**: the string must match the following regular expression: 

```regexp
^[a-zA-Z0-9]+-\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])-[a-z]-i\d{4}@\d+:\d+$
```

[try pattern](https://regexr.com/?expression=%5E%5Ba-zA-Z0-9%5D%2B-%5Cd%7B4%7D-\(0%5B1-9%5D%7C1%5B0-2%5D\)-\(0%5B1-9%5D%7C%5B12%5D%5Cd%7C3%5B01%5D\)-%5Ba-z%5D-i%5Cd%7B4%7D%40%5Cd%2B%3A%5Cd%2B%24 "try regular expression with regexr.com")
3 changes: 3 additions & 0 deletions docs/cluster-properties-passages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## passages Type

`string[]`
3 changes: 3 additions & 0 deletions docs/cluster-properties-time_delta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## time\_delta Type

`integer`
199 changes: 199 additions & 0 deletions docs/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
## Text-Reuse Cluster Type

`object` ([Text-Reuse Cluster](cluster.md))

# Text-Reuse Cluster Properties

| Property | Type | Required | Nullable | Defined by |
| :----------------------------------- | :-------- | :------- | :------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [id](#id) | `string` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-id.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/id") |
| [min\_date](#min_date) | `string` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-min_date.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/min_date") |
| [max\_date](#max_date) | `string` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-max_date.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/max_date") |
| [cluster\_size](#cluster_size) | `integer` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-cluster_size.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/cluster_size") |
| [time\_delta](#time_delta) | `integer` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-time_delta.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/time_delta") |
| [newspapers](#newspapers) | `array` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-newspapers.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/newspapers") |
| [passages](#passages) | `array` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-passages.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/passages") |
| [doc\_ids](#doc_ids) | `array` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-doc_ids.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/doc_ids") |
| [lexical\_overlap](#lexical_overlap) | `number` | Required | cannot be null | [Text-Reuse Cluster](cluster-properties-lexical_overlap.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/lexical_overlap") |

## id

Unique identifier for this cluster.

`id`

* is required

* Type: `string`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-id.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/id")

### id Type

`string`

## min\_date

Earliest date represented in the article passages of the cluster

`min_date`

* is required

* Type: `string`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-min_date.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/min_date")

### min\_date Type

`string`

### min\_date Constraints

**pattern**: the string must match the following regular expression: 

```regexp
^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
```

[try pattern](https://regexr.com/?expression=%5E\(%5Cd%7B4%7D\)-\(0%5B1-9%5D%7C1%5B0-2%5D\)-\(0%5B1-9%5D%7C%5B12%5D%5Cd%7C3%5B01%5D\)%24 "try regular expression with regexr.com")

## max\_date

Latest date represented in the article passages of the cluster

`max_date`

* is required

* Type: `string`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-max_date.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/max_date")

### max\_date Type

`string`

### max\_date Constraints

**pattern**: the string must match the following regular expression: 

```regexp
^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
```

[try pattern](https://regexr.com/?expression=%5E\(%5Cd%7B4%7D\)-\(0%5B1-9%5D%7C1%5B0-2%5D\)-\(0%5B1-9%5D%7C%5B12%5D%5Cd%7C3%5B01%5D\)%24 "try regular expression with regexr.com")

## cluster\_size

Number of article passages present in the cluster.

`cluster_size`

* is required

* Type: `integer`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-cluster_size.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/cluster_size")

### cluster\_size Type

`integer`

## time\_delta

Number of days between min\_date and max\_date.

`time_delta`

* is required

* Type: `integer`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-time_delta.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/time_delta")

### time\_delta Type

`integer`

## newspapers

List of all newspapers represented in this cluster with article passages.

`newspapers`

* is required

* Type: `string[]`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-newspapers.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/newspapers")

### newspapers Type

`string[]`

## passages

List of all article passages composing this cluster.

`passages`

* is required

* Type: `string[]`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-passages.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/passages")

### passages Type

`string[]`

## doc\_ids

List of all article passages composing this cluster.

`doc_ids`

* is required

* Type: `string[]`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-doc_ids.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/doc_ids")

### doc\_ids Type

`string[]`

## lexical\_overlap

Computed average lexical overlap of the passages within the cluster.

`lexical_overlap`

* is required

* Type: `number`

* cannot be null

* defined in: [Text-Reuse Cluster](cluster-properties-lexical_overlap.md "https://impresso.github.io/impresso-schemas/json/text_reuse/cluster.schema.json#/properties/lexical_overlap")

### lexical\_overlap Type

`number`
3 changes: 3 additions & 0 deletions docs/contentitem-properties-ro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## ro Type

`integer`
19 changes: 19 additions & 0 deletions docs/contentitem.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
| [d](#d) | `string` | Optional | cannot be null | [Content Item](contentitem-properties-d.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/d") |
| [ts](#ts) | `string` | Required | cannot be null | [Content Item](contentitem-properties-ts.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/ts") |
| [lg](#lg) | `string` | Optional | cannot be null | [Content Item](contentitem-properties-lg.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/lg") |
| [ro](#ro) | `integer` | Optional | cannot be null | [Content Item](contentitem-properties-ro.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/ro") |
| [ft](#ft) | `string` | Optional | cannot be null | [Content Item](contentitem-properties-ft.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/ft") |
| [lb](#lb) | `array` | Optional | cannot be null | [Content Item](contentitem-properties-lb.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/lb") |
| [pb](#pb) | `array` | Optional | cannot be null | [Content Item](contentitem-properties-pb.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/pb") |
Expand Down Expand Up @@ -217,6 +218,24 @@ two letter language code

[try pattern](https://regexr.com/?expression=%5E%5Ba-z%5D%7B2%7D%24 "try regular expression with regexr.com")

## ro

Reading order index of the content item, for the table of contents view on the interface. If not defined, the CI number (after 'i' in the ID) should be used.

`ro`

* is optional

* Type: `integer`

* cannot be null

* defined in: [Content Item](contentitem-properties-ro.md "https://impresso.github.io/impresso-schemas/json/newspaper/contentitem.schema.json#/properties/ro")

### ro Type

`integer`

## ft

the rebuilt fulltext
Expand Down
3 changes: 3 additions & 0 deletions docs/issue-definitions-metadata-properties-ro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## ro Type

`integer`
Loading

0 comments on commit ec5e806

Please sign in to comment.