Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file-based reconciliation service hangs #5

Open
ghost opened this issue Jun 20, 2013 · 17 comments
Open

Add file-based reconciliation service hangs #5

ghost opened this issue Jun 20, 2013 · 17 comments
Assignees

Comments

@ghost
Copy link

ghost commented Jun 20, 2013

When trying to add a reconciliation service from a local RDF file (turtle syntax) the system opens the import (spinning wheel) pane, but it does not go past this window.

@ghost ghost assigned sparkica Jun 20, 2013
@sparkica
Copy link
Owner

I'll look into it ASAP.

@erajabi
Copy link

erajabi commented Jul 19, 2013

I have the same issue with file-based reconciliation service. I can not add a nt file!?

@sparkica
Copy link
Owner

I'm working on it. Updating some libraries in the rdf extension caused all this havoc.

@sparkica
Copy link
Owner

@erajabi can you please provide more details and your nt file or a part of it? I was able to add Locations from NY Times data as a reconciliation service.
What is the size of the file you're using? How long did you wait?

@erajabi
Copy link

erajabi commented Jul 22, 2013

I provided a list of countries, which I got from the DBpedia in nt or RDF/XML. I add the file to the LODrefine. It hangs on "Adding new reconciliation service" status. You can find the example in DBpedia by ruuning this query . I exported the file in NT or RDF/XML with 388K size.

@sparkica
Copy link
Owner

@erajabi Unfortunately DBpedia query results are not formatted in such a way to be useful for file-based reconciliation. Please take a look at one of the datasets from NY Times and compare them to the RDF/XML output of the query you tried to use. You'll see the difference in the structure of the file. While NY Times data can be imported, LODRefine cannot parse DBpedia result.
Now, if you want to reconcile with DBpedia, I suggest you use DBpedia reconcilation service. Register http://dbpedia.org/sparql as a sparql endpoint.
If you want to use some other dataset for file-based reconcilation, make sure it is in appropriate format (the structure of the file) or first import (tabular) data into LODRefine, define schema and export data into RDF (Turtle or RDF/XML). This dataset will be formatted in the right way to be used for file-based reconciliation.

@paulzh When using large datasets for file-based reconciliation it can take very long before file is indexed (hence the spinning wheel), mostly due to Jena library this extension relies upon. I updated the library and tried to speed up the import as much as possible. Some performance issues might still remain. Please note that if you have large datasets you want to reconcile with, it might be better to install open-source version of Virtuoso and set up a sparql endpoint for your data. Refine itself awesome tool, but it has its limitations and so do extensions.

I'm closing this issue.

@erajabi
Copy link

erajabi commented Jul 23, 2013

I validated the NT file in rdfabout.com and it is VALID. You can simply copy and paste the following code and validate it, then add it to the LODRefine.
This simple file takes time! and hangs! I wonder how it works for you!
Besides, as I mentioned before the reconciliation service of DBpedia is too slow. As the reconciliation service doesn't detect data as well, it doesn't show dbo:Country to me. That's why I selected the "add file" to proceed. The "against type..." option is also too slow...

http://data.kasabi.com/dataset/european-election-results/def/ElectionResult http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class .
http://data.kasabi.com/dataset/european-election-results/def/ElectionResult http://www.w3.org/2000/01/rdf-schema#subClassOf http://purl.org/linked-data/cube#Observation .
http://data.kasabi.com/dataset/european-election-results/def/ElectionResult http://www.w3.org/2000/01/rdf-schema#label "Election Result" .
http://data.kasabi.com/dataset/european-election-results/def/Country http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class .
http://data.kasabi.com/dataset/european-election-results/def/Country http://www.w3.org/2000/01/rdf-schema#label "Country"@en .
http://data.kasabi.com/dataset/european-election-results/def/Country http://www.w3.org/2000/01/rdf-schema#comment "A country"@en .
http://data.kasabi.com/dataset/european-election-results/def/PoliticalGroup http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class .
http://data.kasabi.com/dataset/european-election-results/def/PoliticalGroup http://www.w3.org/2000/01/rdf-schema#label "Political Group"@en .

@sparkica
Copy link
Owner

First let make this clear: you are absolutely right about the file being valid, but the structure of the file is not suitable for the import.
This is an excerpt from DBpedia RDF/XML output (notice <res:solution> tags):

<rdf:RDF xmlns:res="http://www.w3.org/2005/sparql-results#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:nodeID="rset">
<rdf:type rdf:resource="http://www.w3.org/2005/sparql-results#ResultSet" />
    <res:resultVariable>country</res:resultVariable>
    <res:resultVariable>name</res:resultVariable>
    <res:resultVariable>label</res:resultVariable>
    <res:solution rdf:nodeID="r0">
      <res:binding rdf:nodeID="r0c0"><res:variable>country</res:variable><res:value rdf:resource="http://dbpedia.org/resource/Alamannia"/></res:binding>
      <res:binding rdf:nodeID="r0c1"><res:variable>name</res:variable><res:value xml:lang="en">Alamannia</res:value></res:binding>
    </res:solution>
    <res:solution rdf:nodeID="r1">
      <res:binding rdf:nodeID="r1c0"><res:variable>country</res:variable><res:value rdf:resource="http://dbpedia.org/resource/Alamannia"/></res:binding>
      <res:binding rdf:nodeID="r1c1"><res:variable>name</res:variable><res:value xml:lang="en">Alamannia</res:value></res:binding>
    </res:solution>

And this is how it should be formatted for import (notice <rdf:Description> tags):

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:blah="http://example.com/blah#">

<rdf:Description rdf:about="http://data.nytimes.com/N7749429140577003771">
    <rdfs:label>Afghanistan</rdfs:label>
    <blah:test rdf:resource="http://localhost:3333/capital_city/Tahiti"/>
</rdf:Description>

<rdf:Description rdf:about="http://data.nytimes.com/66220885864277068001">
    <rdfs:label>Albania</rdfs:label>
    <blah:test rdf:resource="http://localhost:3333/capital_city/Tirana"/>
</rdf:Description>

Triples (from kasabi) you pasted are from another dataset and now I'm not quite sure what you're trying to achieve. If you explain your case into more details, then maybe I can provide some suggestions what to do and how to do it.

About reconciliation time: it depends on the service and amount of data you're trying to reconcile. It can take from few minutes to half an hour or even more. Usually, this has little to do with LODRefine (and rdf-extension).

@sparkica sparkica reopened this Jul 23, 2013
@erajabi
Copy link

erajabi commented Jul 23, 2013

It is clear that the RDF file I sent is not from DBpedia. That was a sample and valid RDF file which takes much time to add. I just wanted to be sure whether the adding file works correctly or not? As I mentioned, it hangs.
I have only 5500 records, is this much? some of them are numbers, some of them are words(strings). In particular, I want to know how many of these collection can be interlinked to the DBpedia. Just this.
First I tried the DBpedia Sparql, took time, much time. Then changed the strategy to adding file. Didn't work. I put the specific type "dbpedia.org/ontology/Country". Didn't work. Maybe I misunderstood that this tool is not the case for my collection (a collection of which some are countries or places).
Hope it is clear now.

@sparkica
Copy link
Owner

Thank you for details. From what you wrote LODRefine should be able to handle your data.
From my experience 5500 records can be quite a load for reconciliation service and sometimes DBpedia queries also time out. It would be great if you could share with me the Refine project file so I can test with your actual data.

Some issues you're experiencing are probably due to outdated LODRefine Windows binary on Sourceforge. I plan to update it in next few days.

@erajabi
Copy link

erajabi commented Jul 23, 2013

Thanks in advance.
This is the project. Please inform me if you could do the linking.

@sparkica
Copy link
Owner

@erajabi Requests to DBpedia have been timing out (or taking several minutes to result results) if reconciliation type was anything but owl:Thing. I tested reconciliation queries (used in reconciliation requests) in DBpedia web interface and got same performance issues as in LODRefine. Funny thing is that these same requests worked just fine in the past. Removing a small part of a SPARQL query improved performance for types other that owl:Thing with DBpedia.

I was able to reconcile your dataset. I did some cleaning first. After that I reconciled data with DBpedia twice: once with type set to http://dbpedia.org/ontology/Place and once with type set to http://dbpedia.org/ontology/Country. Cleaned and reconciled dataset is available here. If you have any questions related to reconciliation or resulting dataset, please email me, I'll be glad to help.

@erajabi
Copy link

erajabi commented Jul 26, 2013

@sparkica: Thanks for your time and efforts. I found this tool very useful. I could import the reconciled dataset and now I am going through the data. Regarding the timeout issue, could you please tell me when I can have the new version? Did you resolve the timeout issue? How could you reconcile the data against the DBpedia? Did the DBpedia proposed dbo:Country or Place to you? or You just wrote the type specifically? It would be great if you can clarify them, as I want to test other data as well, and want to do the reconciliation by myself. Absolutely you helped on this issue.I really appreciate.

@sparkica
Copy link
Owner

@erajabi Code here has been updated. If you have JDK (javac) installed, you can clone the repository, build it and you should be able to run it with refine.bat. You'll have to wait a little bit longer for new binary version for Windows as I plan to fix some more issues before building it.

@erajabi
Copy link

erajabi commented Aug 27, 2013

Unfortunately after adding a simple file in 7.0.1 (to add filebased section) containing following sample rows :
http://agencies.publicdata.eu/r/country/France http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://agencies.publicdata.eu/ontology/Country .
http://agencies.publicdata.eu/r/country/France http://purl.org/dc/terms/title """France"""@en .
http://agencies.publicdata.eu/r/country/France http://purl.org/dc/terms/title """Francie"""@cs .

....
It hangs on "adding reconciliation service" and takes much time without any result. It seems adding a RDF file to the system should be very straight forward regardless to your data. In conciliation section, it may take time depends on the data... am I wrong?

@sparkica
Copy link
Owner

RDF-extension uses Jena for reading RDF files (to be later used for reconciliation). Your sample rows return Jena SAXParseException: Content is not allowed in prolog. RDF-extension uses Jena library to read files in one of the RDF formats.

I updated it to proper Turtle (see below) and saved it into test.ttl:

<http://agencies.publicdata.eu/r/country/France> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://agencies.publicdata.eu/ontology/Country> .
<http://agencies.publicdata.eu/r/country/France> <http://purl.org/dc/terms/title> """France"""@en .
<http://agencies.publicdata.eu/r/country/France> <http://purl.org/dc/terms/title> """Francie"""@cs .

I was able to import test.ttl and use it to reconcile from file.

@erajabi
Copy link

erajabi commented Oct 10, 2013

On my opinion, if it uses the Jena it should be able to read either NT or ttl files. I am again testing a large N-TRIPLE file (around 500MB) and it seems takes looong time. Can we have a status bar of reading file (like percentage of reading.. or something like that)? I think if the file can not be read, the tool should inform user as well.right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants