-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CollateX refuses Json input #76
Comments
Hi Hayim, I am able to reproduce the error.I am not yet sure what is causing it. The algorithm detects a transposition but then it the processing of the transposition something unexpected happens. |
During the alignment the algorithm traverses the graph. It turns out that not all the nodes are visited. The graph contains 29 nodes (excluding the start and end vertices) and only 20 (including the start vertex) are visited. The question now becomes why that is the case. |
Thank you for looking at this. It is very vexing b/c it appears unpredictable. |
I replaced the graph traversal algorithm with a well known true and tested algorithm and it did not change the result. With this specific dataset for some reason not the whole graph is traversed. So I will need to look into it further. |
Thank you.
I have reproduced this with other datasets. If you want another working
dataset for comparison please let me know.
FWIW, I have checked for hidden control characters etc., but could not find
any.
…On Sat, Nov 7, 2020 at 5:13 PM Ronald Haentjens Dekker < ***@***.***> wrote:
I replaced the graph traversal algorithm with a well known true and tested
algorithm and it did not change the result. With this specific dataset for
some reason not the whole graph is traversed. So I will need to look into
it further.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIFDTOLR2QRZ6T72BYN6I3SOXBCBANCNFSM4SYRTHDA>
.
|
Still investigating. Do you have a dataset that triggers this bug in a roman language by any chance? I understand that this is a weird request maybe, but I have a hard time figuring out what tokens should be aligned or transposed because I can't read the Hebrew text. Right now the algorithm states that 004-P179204:16:'השנ' and 004-P179204:17:'הרת' are transposed compared to the previous witnesses. Does that sound plausible to you? |
I don't have examples that generate the error with a roman font, and since
we don't actually know what's causing it I'm not sure how to generate one.
I *could* generate 1:1 character correspondences to a Roman character set,
and see if this triggers the same error.
As for the transposition, no it does not quite make sense. but S01520 has
what is likely missing text (homoioteleuton) at this point, and switching
the order of S01520 and P179204 IN THE JSON (this is the sole difference
between working.txt and nonworking.txt) triggers the error. Thus (cells in
LTR order):
![image](https://user-images.githubusercontent.com/1069517/98618019-46b7f380-22ce-11eb-8e90-f3abfb5cf6bc.png)
12 13 14 15 16 17 18 19 20 21
S00483 … שלא בקדושה ולידתו בקדושה והשיני הורתו ולידתו בקדושה וכן
S07326 … שלא בקדושה ולידתו בקדושה והשיני הרתו ולדתו בקדושה וכן
P179204 … ראשון שלא בקדושה ולידתו בקדושה והשני הורתו ולידתו בקדושה וכן
S01520 … ראשון שלא בקדושה ולידתו בקדושה וכן
[In case it matters: the prefixes 001-, 002- etc. in the JSON are there to
force Collatex to return responses in query order not in alpha order; could
handle that in post-processing. nonworking.txt swaps the position in the
JSON of 003- and 004- without changing the IDs.]
…On Mon, Nov 9, 2020 at 6:08 PM Ronald Haentjens Dekker < ***@***.***> wrote:
Still investigating. Do you have a dataset that triggers this bug in a
roman language by any chance? I understand that this is a weird request
maybe, but I have a hard time figuring out what tokens should be aligned or
transposed because I can't read the Hebrew text. Right now the algorithm
states that 004-P179204:16:'השנ' and 004-P179204:17:'הרת' are transposed
compared to the previous witnesses. Does that sound plausible to you?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIFDTPU4S2ZSHO7ZOAU7EDSPBY57ANCNFSM4SYRTHDA>
.
|
Thanks for pointing out the S01520 has what is likely missing text and that transpositions are not expected. That actually gives me a huge hint and a new direction to look into the issue. |
A short update. I got a bit further in identifying the problem. The algorithm consists of several steps: 1. finding an optimal set of matches. -> 2. Identify transpositions -> 3. mark transpositions in the graph -> 4. graph traversal -> crash. At first I started looking at step 4. But that is not the cause of the crash. Then I turned my attention to step 2. If step 2 ignores a transposition it causes a cycle in the graph causing the traversal to fail. I thought that that might be the problem. But after your previous post indicating that there is a gap in one of the witnesses and no transpositions I realised that the problem is rather that too many transpositions are found. I checked that piece of code multiple times and could not find a mistake. Then I released that the problem might actually be in step 1. Each token of a witness should align with a unique vertex in the graph. It turns out that there is a bug somewhere in the code of step 1 that cause multiple tokens of the witness to be aligned with one and the same vertex. That should not happen. But somehow it does. Causing step 2, 3 and 4 to fail. |
Thanks so much for the update! |
Not sure if this repo is being maintained.
Possibly a version of #44
json of tokenized witnesses in order A (working.txt) works; in order B (nonworking.txt) collatex returns an error
Hand editing nonworking.txt so that the witnesses and array of tokens are in the same order returns alignment.
Sending data to collatex via REST
nonworking.txt
working.txt
The text was updated successfully, but these errors were encountered: