CollateX refuses Json input #76

hlapin · 2020-10-20T15:57:31Z

Not sure if this repo is being maintained.
Possibly a version of #44
json of tokenized witnesses in order A (working.txt) works; in order B (nonworking.txt) collatex returns an error
Hand editing nonworking.txt so that the witnesses and array of tokens are in the same order returns alignment.
Sending data to collatex via REST

nonworking.txt
working.txt

rhdekker · 2020-11-04T16:34:22Z

Hi Hayim, I am able to reproduce the error.I am not yet sure what is causing it. The algorithm detects a transposition but then it the processing of the transposition something unexpected happens.

rhdekker · 2020-11-04T19:27:15Z

During the alignment the algorithm traverses the graph. It turns out that not all the nodes are visited. The graph contains 29 nodes (excluding the start and end vertices) and only 20 (including the start vertex) are visited. The question now becomes why that is the case.

hlapin · 2020-11-04T20:29:22Z

Thank you for looking at this. It is very vexing b/c it appears unpredictable.

rhdekker · 2020-11-07T22:13:41Z

I replaced the graph traversal algorithm with a well known true and tested algorithm and it did not change the result. With this specific dataset for some reason not the whole graph is traversed. So I will need to look into it further.

hlapin · 2020-11-07T23:13:26Z

Thank you. I have reproduced this with other datasets. If you want another working dataset for comparison please let me know. FWIW, I have checked for hidden control characters etc., but could not find any.

…

On Sat, Nov 7, 2020 at 5:13 PM Ronald Haentjens Dekker < ***@***.***> wrote: I replaced the graph traversal algorithm with a well known true and tested algorithm and it did not change the result. With this specific dataset for some reason not the whole graph is traversed. So I will need to look into it further. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIFDTOLR2QRZ6T72BYN6I3SOXBCBANCNFSM4SYRTHDA> .

rhdekker · 2020-11-09T23:08:02Z

Still investigating. Do you have a dataset that triggers this bug in a roman language by any chance? I understand that this is a weird request maybe, but I have a hard time figuring out what tokens should be aligned or transposed because I can't read the Hebrew text. Right now the algorithm states that 004-P179204:16:'השנ' and 004-P179204:17:'הרת' are transposed compared to the previous witnesses. Does that sound plausible to you?

hlapin · 2020-11-10T01:00:58Z

I don't have examples that generate the error with a roman font, and since we don't actually know what's causing it I'm not sure how to generate one. I *could* generate 1:1 character correspondences to a Roman character set, and see if this triggers the same error. As for the transposition, no it does not quite make sense. but S01520 has what is likely missing text (homoioteleuton) at this point, and switching the order of S01520 and P179204 IN THE JSON (this is the sole difference between working.txt and nonworking.txt) triggers the error. Thus (cells in LTR order): ![image](https://user-images.githubusercontent.com/1069517/98618019-46b7f380-22ce-11eb-8e90-f3abfb5cf6bc.png) 12 13 14 15 16 17 18 19 20 21 S00483 … שלא בקדושה ולידתו בקדושה והשיני הורתו ולידתו בקדושה וכן S07326 … שלא בקדושה ולידתו בקדושה והשיני הרתו ולדתו בקדושה וכן P179204 … ראשון שלא בקדושה ולידתו בקדושה והשני הורתו ולידתו בקדושה וכן S01520 … ראשון שלא בקדושה ולידתו בקדושה וכן [In case it matters: the prefixes 001-, 002- etc. in the JSON are there to force Collatex to return responses in query order not in alpha order; could handle that in post-processing. nonworking.txt swaps the position in the JSON of 003- and 004- without changing the IDs.]

…

On Mon, Nov 9, 2020 at 6:08 PM Ronald Haentjens Dekker < ***@***.***> wrote: Still investigating. Do you have a dataset that triggers this bug in a roman language by any chance? I understand that this is a weird request maybe, but I have a hard time figuring out what tokens should be aligned or transposed because I can't read the Hebrew text. Right now the algorithm states that 004-P179204:16:'השנ' and 004-P179204:17:'הרת' are transposed compared to the previous witnesses. Does that sound plausible to you? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIFDTPU4S2ZSHO7ZOAU7EDSPBY57ANCNFSM4SYRTHDA> .

rhdekker · 2020-11-10T21:11:55Z

Thanks for pointing out the S01520 has what is likely missing text and that transpositions are not expected. That actually gives me a huge hint and a new direction to look into the issue.

rhdekker · 2020-11-12T22:11:58Z

A short update. I got a bit further in identifying the problem. The algorithm consists of several steps: 1. finding an optimal set of matches. -> 2. Identify transpositions -> 3. mark transpositions in the graph -> 4. graph traversal -> crash. At first I started looking at step 4. But that is not the cause of the crash. Then I turned my attention to step 2. If step 2 ignores a transposition it causes a cycle in the graph causing the traversal to fail. I thought that that might be the problem. But after your previous post indicating that there is a gap in one of the witnesses and no transpositions I realised that the problem is rather that too many transpositions are found. I checked that piece of code multiple times and could not find a mistake. Then I released that the problem might actually be in step 1. Each token of a witness should align with a unique vertex in the graph. It turns out that there is a bug somewhere in the code of step 1 that cause multiple tokens of the witness to be aligned with one and the same vertex. That should not happen. But somehow it does. Causing step 2, 3 and 4 to fail.

hlapin · 2020-11-13T00:40:07Z

Thanks so much for the update!
For the time being I have the work around of using the option "algorithm":"needleman-wunsch". In fact, since I am using JSON tabular output rather than graph output I am not at present actually getting the benefit of detected transpositions.

hlapin closed this as completed Oct 26, 2020

hlapin reopened this Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CollateX refuses Json input #76

CollateX refuses Json input #76

hlapin commented Oct 20, 2020

rhdekker commented Nov 4, 2020

rhdekker commented Nov 4, 2020

hlapin commented Nov 4, 2020

rhdekker commented Nov 7, 2020

hlapin commented Nov 7, 2020 via email

rhdekker commented Nov 9, 2020

hlapin commented Nov 10, 2020 via email •

edited

Loading

rhdekker commented Nov 10, 2020

rhdekker commented Nov 12, 2020 •

edited

Loading

hlapin commented Nov 13, 2020

CollateX refuses Json input #76

CollateX refuses Json input #76

Comments

hlapin commented Oct 20, 2020

rhdekker commented Nov 4, 2020

rhdekker commented Nov 4, 2020

hlapin commented Nov 4, 2020

rhdekker commented Nov 7, 2020

hlapin commented Nov 7, 2020 via email

rhdekker commented Nov 9, 2020

hlapin commented Nov 10, 2020 via email • edited Loading

rhdekker commented Nov 10, 2020

rhdekker commented Nov 12, 2020 • edited Loading

hlapin commented Nov 13, 2020

hlapin commented Nov 10, 2020 via email •

edited

Loading

rhdekker commented Nov 12, 2020 •

edited

Loading