Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CollateX refuses Json input #76

Open
hlapin opened this issue Oct 20, 2020 · 10 comments
Open

CollateX refuses Json input #76

hlapin opened this issue Oct 20, 2020 · 10 comments

Comments

@hlapin
Copy link

hlapin commented Oct 20, 2020

Not sure if this repo is being maintained.
Possibly a version of #44
json of tokenized witnesses in order A (working.txt) works; in order B (nonworking.txt) collatex returns an error
Hand editing nonworking.txt so that the witnesses and array of tokens are in the same order returns alignment.
Sending data to collatex via REST

nonworking.txt
working.txt

@hlapin hlapin closed this as completed Oct 26, 2020
@hlapin hlapin reopened this Oct 26, 2020
@rhdekker
Copy link
Member

rhdekker commented Nov 4, 2020

Hi Hayim, I am able to reproduce the error.I am not yet sure what is causing it. The algorithm detects a transposition but then it the processing of the transposition something unexpected happens.

@rhdekker
Copy link
Member

rhdekker commented Nov 4, 2020

During the alignment the algorithm traverses the graph. It turns out that not all the nodes are visited. The graph contains 29 nodes (excluding the start and end vertices) and only 20 (including the start vertex) are visited. The question now becomes why that is the case.

@hlapin
Copy link
Author

hlapin commented Nov 4, 2020

Thank you for looking at this. It is very vexing b/c it appears unpredictable.

@rhdekker
Copy link
Member

rhdekker commented Nov 7, 2020

I replaced the graph traversal algorithm with a well known true and tested algorithm and it did not change the result. With this specific dataset for some reason not the whole graph is traversed. So I will need to look into it further.

@hlapin
Copy link
Author

hlapin commented Nov 7, 2020 via email

@rhdekker
Copy link
Member

rhdekker commented Nov 9, 2020

Still investigating. Do you have a dataset that triggers this bug in a roman language by any chance? I understand that this is a weird request maybe, but I have a hard time figuring out what tokens should be aligned or transposed because I can't read the Hebrew text. Right now the algorithm states that 004-P179204:16:'השנ' and 004-P179204:17:'הרת' are transposed compared to the previous witnesses. Does that sound plausible to you?

@hlapin
Copy link
Author

hlapin commented Nov 10, 2020 via email

@rhdekker
Copy link
Member

Thanks for pointing out the S01520 has what is likely missing text and that transpositions are not expected. That actually gives me a huge hint and a new direction to look into the issue.

@rhdekker
Copy link
Member

rhdekker commented Nov 12, 2020

A short update. I got a bit further in identifying the problem. The algorithm consists of several steps: 1. finding an optimal set of matches. -> 2. Identify transpositions -> 3. mark transpositions in the graph -> 4. graph traversal -> crash. At first I started looking at step 4. But that is not the cause of the crash. Then I turned my attention to step 2. If step 2 ignores a transposition it causes a cycle in the graph causing the traversal to fail. I thought that that might be the problem. But after your previous post indicating that there is a gap in one of the witnesses and no transpositions I realised that the problem is rather that too many transpositions are found. I checked that piece of code multiple times and could not find a mistake. Then I released that the problem might actually be in step 1. Each token of a witness should align with a unique vertex in the graph. It turns out that there is a bug somewhere in the code of step 1 that cause multiple tokens of the witness to be aligned with one and the same vertex. That should not happen. But somehow it does. Causing step 2, 3 and 4 to fail.

@hlapin
Copy link
Author

hlapin commented Nov 13, 2020

Thanks so much for the update!
For the time being I have the work around of using the option "algorithm":"needleman-wunsch". In fact, since I am using JSON tabular output rather than graph output I am not at present actually getting the benefit of detected transpositions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants