Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed Parsing Updates #16

Open
creisle opened this issue Mar 8, 2023 · 4 comments
Open

Proposed Parsing Updates #16

creisle opened this issue Mar 8, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@creisle
Copy link
Collaborator

creisle commented Mar 8, 2023

Since I've been going through these in such detail I've noticed a few cases where the output doesn't look like what I would expect but I want to clear them with you @jakelever before I make the appropriate changes. I've listed them in a table below

Input XML Proposed Output Current Output
incubator containing 5% CO<sub>2</sub> incubator containing 5% CO2 incubator containing 5% CO 2
10<sup>4</sup> 10^4 10 4
especially in <italic>CBL</italic>-W802* cells especially in CBL-W802* cells especially in CBL -W802* cells
influenced by the presence of allelic variants&#x2014;GSTP1 Ile<sub>105</sub>Val (rs1695) and <italic>GSTP1</italic> Ala<sub>114</sub>Val (rs1138272), with homozygote influenced by the presence of allelic variants--GSTP1 Ile105Val (rs1695) and GSTP1 Ala114Val (rs1138272), with homozygote influenced by the presence of allelic variants—GSTP1 Ile 105 Val (rs1695) and GSTP1 Ala 114 Val (rs1138272), with homozygote
breast cancer, clear cell renal carcinoma, and colon cancer<xref ref-type="bibr" rid="b6">6</xref><xref ref-type="bibr" rid="b7">7</xref> <xref ref-type="bibr" rid="b8">8</xref> <xref ref-type="bibr" rid="b9">9</xref> <xref ref-type="bibr" rid="b10">10</xref> have successfully identified breast cancer, clear cell renal carcinoma, and colon cancer have successfully identified breast cancer, clear cell renal carcinoma, and colon cancerhave successfully identified
, and in the transgenic\nGATA-1,\n<sup>low</sup> mouse , and in the transgenic GATA-1, low mouse , and in the transgenicGATA-1, low mouse
we selected an allele (designated <italic>cic</italic><sup><italic>4</italic></sup>) that removes we selected an allele (designated cic^4) that removes we selected an allele (designated cic 4) that removes
regulation of the Wnt-&#x3B2;-catenin pathway regulation of the Wnt-beta-catenin pathway regulation of the Wnt-β-catenin pathway
the specific HPV<sup>+</sup> gene expression the specific HPV+ gene expression the specific HPV + gene expression
known to be resistant to 1<sup>st</sup> and 2<sup>nd</sup> generation EGFR-TKIS, osimertinib known to be resistant to 1st and 2nd generation EGFR-TKIS, osimertinib known to be resistant to 1 st and 2 nd generation EGFR-TKIS, osimertinib
at 37&#xB0;C in a humidified 5% CO<sub>2</sub> incubator at 37 deg C in a humidified 5% CO2 incubator at 37°C in a humidified 5% CO 2 incubator
seeded at concentrations below 1 &#xD7; 10<sup>6</sup>/ml, selected seeded at concentrations below 1 x 10^6/ml, selected seeded at concentrations below 1 × 10 6 /ml, selected
9 patients with a <italic>BRAF</italic>-mutant tumour 9 patients with a BRAF-mutant tumour 9 patients with a BRAF -mutant tumour
patients with <italic>BRAF</italic><sup>WT</sup> tumours patients with BRAF-WT tumours patients with BRAF WT tumours
MSI<sup>hi</sup> tumours MSI-hi tumours MSI hi tumours
upper limit of normal, creatinine clearance &#x2A7E;30&#x2009;ml&#x2009;min<sup>&#x2212;1</sup>, upper limit of normal, creatinine clearance ⩾30 ml min^-1, upper limit of normal, creatinine clearance ⩾30 ml min −1,
the oncometabolite R(&#x2013;)-2-hydroxyglutarate at the the oncometabolite R(-)-2-hydroxyglutarate at the the oncometabolite R-2-hydroxyglutarate at the
[<sup>3</sup>H]-Thymidine [3H]-Thymidine [ 3 H]-Thymidine
@creisle creisle added the enhancement New feature or request label Mar 8, 2023
@creisle creisle self-assigned this Mar 8, 2023
creisle added a commit that referenced this issue Mar 8, 2023
Deals with several edge-cases to create a more human-readable output
that matches the input more as expected by the reader

resolves: #16
@jakelever
Copy link
Owner

These all look good to me

@creisle
Copy link
Collaborator Author

creisle commented Mar 9, 2023

Another weird case I am not sure what to do with

<sec><title>Title of a thing</title><p>paragraph content</p></sec>

becomes

<passage>Title of a thing</passage><passage>paragraph content</passage>

which makes less sense for bioc, but when it gets concatenated together tho as

Title of a thingparagraph content

We need whitespace between the two. Should we be adding a trailing single space or new line to the first passage when we parse the XML?

@creisle
Copy link
Collaborator Author

creisle commented Mar 14, 2023

Another weird special case on the superscripts to add to the tests

Compared with <italic>KRAS</italic> wild type and empty vector controls, <italic>KRAS</italic> <sup>10</sup>G<sup>11</sup> and <sup>11</sup>GA<sup>12</sup> significantly enhanced in vivo tumor growth

should be

Compared with KRAS wild type and empty vector controls, KRAS 10G11 and 11GA12 significantly enhanced in vivo tumor growth

@creisle
Copy link
Collaborator Author

creisle commented Mar 14, 2023

Input XML Proposed Output Current Output
The 2-year invasive disease-free survival rate was 93·9% The 2-year invasive disease-free survival rate was 93.9% The 2-year invasive disease-free survival rate was 93*9%

creisle added a commit that referenced this issue Mar 15, 2023
creisle added a commit that referenced this issue Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants