-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing duplicates from the output? #26
Comments
Hi, |
@dachafra I do not think that the R2RML spec clarifies what the processors need to do with duplicate triples. I think this might indeed become an issue for streaming data but for static KGs generated with [R2]RML, I think it is a matter of libraries implementations and stores to deal with duplicates. E.g., some SPARQL endpoints deduplicate. All I see at the R2RML spec is:
but there is no clarification if the processors need to deduplicate. To the contrary, the latter sentence gives the feeling that it doesn't matter if there are duplicates |
@andimou I partially agree with you. Although the R2RML specification does not specifically say that RDF graphs with duplicates are invalid, it mentions On the other hand, I understand that this task cannot be relevant for some of the parsers and it can be delegated to SPARQL endpoints or other mechanisms that perform the deduplication |
I interpret it as redundant but not harmful if it happens because it doesn't matter because it has the same effect! For as long as the spec allows different interpretations, as we do here and we are both as knowledgeable about [R2]RML, it means that the processors may chose to implement as they like I think because the spec does not specify what the processors should do as it does in other cases. In this case, I like the vagueness :) Forcing deduplication to be implemented may become more complicated in different scenarios. For instance in a streaming scenario, it might mean keeping in memory all triples ever generated so you don't feed the stream with the same triple, exhausting eventually all your memory capacity. I'd rather have duplicates then! They are not harmful after all ;) Then again, what's considered duplicate? The actual triple might be the same but its timestamp in the case of the Streamer might be different. Is it the same or different triple?! In the RMLMapper, we have PROV being automatically generated, 2 triples which are the same may be considered duplicates but if you capture their provenance, they are not because their provenance might be different. In one case the processor may remove them, in the other may not. I do not see though how having duplicates affects quality? I don't think that existence of duplicate triples is considered in any of the Linked Data quality dimensions/metrics. |
Hi all!, For me, removing duplicates is not mandatory, but a very good to have feature. About the Linked Data quality, the article Quality Assessment for Linked Open Data: A Survey penalizes extensional concisness if there are duplicated triples (if I understood well). Best, |
Indeed, I totally agree about the points of the Streamer and PROV+RMLMapper, I think that the timestamp in streaming data will definitely play a relevant point in the elimination of duplicates. I was asking more focused on the |
Hello all, I would have interest in this feature because I had the case that on large files (~120k rows, not so extreme though) (maybe then only the Streamer is valid), the knowledge graph generated was 900Mo and after merge with the Tbox using Rdflib and the OWL API that removes duplicates directly, it went down to 4mo. Though it does not impact the result indeed, it seems that the mapping took almost 15min but could have been much faster if duplicates were ignored. Have a good day, Stéphane |
Hi!
Is there any possibility to indicate to the engine to not generate duplicates during the construction of the KG?
The text was updated successfully, but these errors were encountered: