Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralized metadata constraints #339

Merged
merged 16 commits into from
Apr 26, 2023
Merged

Centralized metadata constraints #339

merged 16 commits into from
Apr 26, 2023

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Mar 22, 2023

This PR addresses #334 (comment)

Fixes #334
Fixes #336

src/metadata.cpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Patch coverage: 70.63% and project coverage change: +1.79 🎉

Comparison is base (3dd2f59) 43.79% compared to head (d32037c) 45.58%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #339      +/-   ##
==========================================
+ Coverage   43.79%   45.58%   +1.79%     
==========================================
  Files          23       26       +3     
  Lines        2393     2481      +88     
  Branches     1326     1380      +54     
==========================================
+ Hits         1048     1131      +83     
- Misses       1329     1334       +5     
  Partials       16       16              
Impacted Files Coverage Δ
src/zimcheck/checks.h 100.00% <ø> (ø)
src/zimwriterfs/zimwriterfs.cpp 0.00% <0.00%> (ø)
src/metadata.h 50.00% <50.00%> (ø)
src/metadata.cpp 98.70% <98.70%> (ø)
src/metadata_constraints.cpp 100.00% <100.00%> (ø)
src/zimcheck/checks.cpp 96.56% <100.00%> (-0.01%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@veloman-yunkan
Copy link
Collaborator Author

Inviting @mgautierfr @kelson42 to review the overall approach for centralizing all metadata checks in one place. Please look at the file metadata_constraints.cpp (note that it is not compiled separately, but is rather #included in metadata.cpp.

Issues noticed so far:

  • regex may be a convenient tool as a lowest common denominator for validating various strings, however it falls short for some types of data if we need more rigorous checks. In our case, validating the date with a simple regex will allow entering invalid dates (e.g. 30th of February). This can be fixed by changing the last field in the metadata info table from a regexp to a special polymorphic validator object - a concrete implementation of thereof can be created from a regular expression. Do we want to do it now?
  • complex checks involving dependencies between different metadata entries (e.g. the long description, if present, should be longer than the short description; the text of description metadata should be in the language matching the value of the Language metadata, etc). A draft implementation for such checks is provided in the second commit.

@kelson42
Copy link
Contributor

regex may be a convenient tool as a lowest common denominator for validating various strings, however it falls short for some types of data if we need more rigorous checks. In our case, validating the date with a simple regex will allow entering invalid dates (e.g. 30th of February). This can be fixed by changing the last field in the metadata info table from a regexp to a special polymorphic validator object - a concrete implementation of thereof can be created from a regular expression. Do we want to do it now?

@veloman-yunkan Thank you very much for your PR which is, at a first sight, really much the kind of code I was expected.

Indeed regexs fall short in many cases. The case I have in mind is checking illustration entries (requires libpng).

I'm not in favour of extending much the validation scenarios because this can go endlessly and my goal is to have the basic ones.

But I would appreciate to have the appropriate code infrastructure so we can easily do that work later... without rearchitecturing effort.

@kelson42
Copy link
Contributor

On my local computer, it does not compile:

FAILED: test/metadata-test.p/.._src_metadata.cpp.o 
ccache c++ -Itest/metadata-test.p -Itest -I../test -I/usr/local/include -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -Werror -std=c++11 -g -Werror -Wall '-DVERSION="3.1.3"' -DGTEST_HAS_PTHREAD=1 -pthread -MD -MQ test/metadata-test.p/.._src_metadata.cpp.o -MF test/metadata-test.p/.._src_metadata.cpp.o.d -o test/metadata-test.p/.._src_metadata.cpp.o -c ../src/metadata.cpp
../src/metadata.cpp: In static member function ‘static const zim::Metadata::ReservedMetadataRecord& zim::Metadata::getReservedMetadataRecord(const string&)’:
../src/metadata.cpp:100:14: error: ‘out_of_range’ is not a member of ‘std’
  100 |   throw std::out_of_range(name + " is not a reserved metadata name");
      |              ^~~~~~~~~~~~
../src/metadata.cpp:101:1: error: control reaches end of non-void function [-Werror=return-type]
  101 | }
      | ^
cc1plus: all warnings being treated as errors
[41/48] Compiling C++ object test/tools-test.p/.._src_zimwriterfs_zimcreatorfs.cpp.o
ninja: build stopped: subcommand failed.

@rgaudin
Copy link
Member

rgaudin commented Mar 27, 2023

I have mixed feelings about this. On one hand, this mainly highlights the shortcomings of such an approach but on the other hand, simple checks are better than none.

Couple of comments (already identified):

  • I don't think a language regexp makes sense. If you want to ensure validity, you can only work of a valid list IMO. Making sure that it's 3 lowercase letters (or proper list of such) would already be beneficial.
  • date regexp wont capture dates such as 2023-13-01. Going the extra mile (as suggested) would not guarantee a valid date with cases like Feb 29th… that would require a date dependency
  • min length for all those string metadata looks problematic. Max length as well as long as the spec is silent about it. We could raise a warning above a certain threshold though.
  • I am not a fan of the “LongDescription shouldn't be shorter than Description” because that's not in the spec.
  • “Description must be in the language of the ZIM file” is an interesting than deserves a discussion and decision. I am in favor of updating the spec on this to specify whatever we decide.
  • Illustrations are not checked despite it being the ticket's moto.

Once again we fall short on setting clear goals for our tools. zimcheck's description is “zimcheck checks the quality of a ZIM file.”. Does that mean that whenever zimcheck doesn't report an issue, the ZIM is guaranteed to be valid?

I join @kelson42 in thinking we want basic checks for now that we could extend in the future.

  • Simple date format regex
  • Simple format regexp for language
  • magic number check for Illustrations
  • len >= 1 for mandatory texts
  • len <= max for bounded texts

And that's it. The rest can be discussed and extended in separate tickets, raised by actual needs.

Although it serves a different purpose, scraperlib now (not being used yet) enforces correct metadata with more elaborate checks (actual language code, proper PNG with correct sizes, etc) so most of what we produce shall be valid in this regard.

@holta

This comment was marked as off-topic.

@mgautierfr
Copy link
Collaborator

I have open a issue #340 to discuss what to test and how. Please continue the discussion there and keep this discussion for PR review.

@kelson42
Copy link
Contributor

kelson42 commented Mar 27, 2023

I just want to emphasis again that we need this basic metadata feature pretty quickly. Thank you for having open side tickets to continue the interesting discussion and keep this PR "simple" without "insulting" the future.

@kelson42
Copy link
Contributor

@veloman-yunkan I hope you got all the feedbacks needed, let us know othwerwise.

@veloman-yunkan
Copy link
Collaborator Author

With illustration metadata in the mix, in what units min and max length values should be specified - bytes or (unicode) characters?

  1. Most metadata being textual, it's better to specify the metadata size limits in characters and not impose any size constraints on illustration metadata.
    1.5 Like above, but handle illustration metadata as a special case in the C++ code and interpret its size limits in bytes.
  2. Specify size limits in bytes. Then languages with non-latin scripts will be penalized with lower human-comprehensible upper limits on the length of the description metadata.

@rgaudin
Copy link
Member

rgaudin commented Mar 28, 2023

With illustration metadata in the mix, in what units min and max length values should be specified - bytes or (unicode) characters?

  1. We'd only check the text length already in the spec for the couple that have limits. Illustration have no such constraint in the spec so we'd leave them out for now.

@veloman-yunkan veloman-yunkan force-pushed the metadata_table branch 4 times, most recently from 04d2355 to ecbd62c Compare March 28, 2023 15:17
@veloman-yunkan veloman-yunkan changed the title Centralized metadata table Centralized metadata constraints Mar 28, 2023
@veloman-yunkan veloman-yunkan marked this pull request as ready for review March 28, 2023 15:19
@veloman-yunkan
Copy link
Collaborator Author

A preliminary version of this PR (reflecting feedback received on the draft version) is ready. Note that the PR doesn't utilize the new Metadata utility in any of the zim tools (zimcheck, zimwriterfs).

@kelson42
Copy link
Contributor

kelson42 commented Apr 2, 2023

@veloman-yunkan Thank you, just to ensure we understand us well, this is mandatory that this is used at least in zimcheck.

@kelson42
Copy link
Contributor

kelson42 commented Apr 5, 2023

@mgautierfr any feedback/review?

Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems globally good to me.
A small comment on the static pointer thing but the global structure is good.

How do you plan to implement checkTextLanguage ?
Or it is just a sample and it have to be removed before merge ?

src/metadata.cpp Show resolved Hide resolved
@veloman-yunkan
Copy link
Collaborator Author

How do you plan to implement checkTextLanguage ?
Or it is just a sample and it have to be removed before merge ?

In this PR checkTextLanguage() was introduced only to illustrate support for complex constraints. I will get rid of it before merging.

@mgautierfr
Copy link
Collaborator

In this PR checkTextLanguage() was introduced only to illustrate support for complex constraints. I will get rid of it before merging.

Ok. You can remove it right now (or comment it to not loose the illustration purpose) while you fix the issue with last pointer.

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Apr 19, 2023

@veloman-yunkan Thank you, just to ensure we understand us well, this is mandatory that this is used at least in zimcheck.

I added usage of this new utility in zimwriterfs. As a result a few issues (including two preexisting bugs related to the --longDescription option of zimwriterfs) were uncovered during testing. I will finalize this PR tomorrow.

@veloman-yunkan
Copy link
Collaborator Author

Note that metadata constraints were relaxed in order to let the zimcheck
unittests pass without updating/regenerating the unittest ZIM data.

How complex it is to regenerate the unittest data ?

The regeneration itself should be easy. The question is whether I should use the current release of zimwriterfs or the one used to create the current version of test ZIM files.

@mgautierfr
Copy link
Collaborator

The regeneration itself should be easy. The question is whether I should use the current release of zimwriterfs or the one used to create the current version of test ZIM files.

I think here this is not relevant (at least for the change we want to made).
But the question is not easy to answer for generic case. The solution is probably in creating a full test suite with zim files created/tested with different version (https://github.com/openzim/zim-testing-suite)

@veloman-yunkan
Copy link
Collaborator Author

The PR is final (though needs at least a rebase&fixup unless other issues are spotted).

Known issues:

@kelson42
Copy link
Contributor

@veloman-yunkan For now, I'm happy with this kind of regex error.

@mgautierfr
Copy link
Collaborator

Please rebase-fixup and we can merge.

This commit is intended to demonstrate how complex checks (involving more
than one metadata entries) can be added if required.
When -L|--longDesciption option was added to zimwriterfs its argument
remained unused (participating in constraint checks doesn't count).
-L option of zimwriterfs was not properly registered in the call to
`getopt_long()`
For some reason the PNG regex check fails on real data. But at least it
uncovered the issue with the error message which reads as bellow:

```
Illustration_48x48@1 doesn't match regex: ^�PNG
�
.+
```
Regexp is not the best tool for use with binary data where NUL characters
may occur.
Note that metadata constraints were relaxed in order to let the zimcheck
unittests pass without updating/regenerating the unittest ZIM data.
Test ZIM files (good.zim, poor.zim and bad_checksum.zim) under
test/data/zimfiles were regenerated using zimwriterfs v3.1.3 and the
updated script create_test_zimfiles.

Metadata constraints that were relaxed in the previous commit have
been restored.
Before this change, simple metadata checks were performed only if there
were no missing metadata entries.
@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Apr 26, 2023

Please rebase-fixup and we can merge.

Done

@mgautierfr
Copy link
Collaborator

Regexp constraints can duplicate length constraints. For example for Language metadata the following pair of errors is reported when the language is specified using a 2-char code:

  • Language must contain at least 3 characters
  • Language doesn't match regex: \w{3}(,\w{3})*

I think this is a problem of coherence between checks and it can be fixed later (and we probably have to update the checks). For now the structure is good and we can merge this long PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Common constraint: Description metadata should not be empty Don't check the existing of favicon by zimcheck
5 participants