-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better table header handling #14
Comments
Sure. Go for it. I don't really use the table information that much, but this does remind me that I need to make sure that CIViCmine is properly aware of the tables. Remind me, there's some metadata on these passages that indicate that they are tables, right? |
yup, the xml_path infon can be used for that |
Thanks |
I've been using the lineraized tables but one thing I've noticed is that when we have something complex like a multi-level header just linearizing makes the number of cells not always match up. so something like this
example article used: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2873663/
Currently gets turned into
And we lose a lot of meaning, not to mention it becomes impossible to match these up properly to the cells text from the body of the table. (see below)
So i'd like to try something more complex where we simplfiy the header into a single row before we linearize but it would require making the text differ slightly from the original by repeating some words which I am not sure on. The end results would look like this
@jakelever what do you think? I've already been implementing this for my own purposes but would be happy to put up a PR if you like the idea
The text was updated successfully, but these errors were encountered: