Character markup in ToC results in invalid XML in OSIS output #102

mmartin9684-sil · 2020-01-14T22:08:05Z

When a Table of Contents line in the source USFM file includes character markup (e.g., italics), the resulting XML in the OSIS output file is not well formed and results in a parsing error.

Source USFM Line:

\toc2 \it (Lista de lectura)\it*

OSIS Output File:

36225:<milestone type="x-usfm-toc2" n="(Lista de lectura)" />

Error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 36225, column 33

adyeths · 2020-01-15T02:55:16Z

The osis line you shared here is valid. The problem must be on a different line. I would have to see more to see what is actually going on here. (It may very well be the character markup. I didn't anticipate that in this location.)

Is the USFM character markup even valid here?

mmartin9684-sil · 2020-01-15T14:51:45Z

Ryan – Apologies, the GitHub issue tracking page seems to have interpreted some of the OSIS markup as formatting for the issue itself. I didn’t notice that when posting the issue. Here’s the original USFM markup: \id XXB - Aguaruna (Awajun) NT -Peru 2009 (DBL-2013)\toc1 A'na pi'i marëáchin ya'ipi Yosë quiricanën nontahua'\toc2 \it (Lista de lectura)\it*\mt1 Lista de lectura para leer este volumen en un año Here is the OSIS markup that’s being produced: <div type="x-other"><milestone type="x-usfm-toc1" n="A'na pi'i marëáchin ya'ipi Yosë quiricanën nontahua'" /><milestone type="x-usfm-toc2" n="<hi type="italic">(Lista de lectura)</hi>" /><title level="1" type="main">Lista de lectura para leer este volumen en un año</title><div type="introduction"> I suspect the issue has to do with the double-quotes (“) both surrounding and embedded in the value for n. I’ve tried using Paratext to process this source file, and it does appear that it recognizes italics markup on a ToC entry: Again, my apologies for not catching the problem with the way the GitHub issue tracker was stripping out the OSIS markup. Regards, Michael A. MartinSILM: +1.908.432.8677 From: RyanSent: Tuesday, January 14, 2020 9:55 PMTo: adyeths/u2oCc: Michael A. Martin; AuthorSubject: Re: [adyeths/u2o] Character markup in ToC results in invalid XML in OSIS output (#102) The osis line you shared here is valid. The problem must be on a different line. I would have to see more to see what is actually going on here. (It may very well be the character markup. I didn't anticipate that in this location.)Is the USFM character markup even valid here?—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe.

adyeths · 2020-01-15T17:22:15Z

I can't find anything in the documentation for usfm to suggest the \it tags are valid in this location. And looking at the default paratext stylesheet, it doesn't indicate that it's valid either. If it is, I'll need to see something to suggest it is in order to do something about this particular issue.

DavidHaslam · 2020-01-15T17:32:40Z

It’s not likely to be valid USFM to have character level markup within the ToC strings.

Aside: I did once come across a language in which the alphabet included an italicised letter that was pronounced differently to the normal letter. I suppose there’s a remote possibility that a bookname in that language might contain one or more such letter.

Suggest put the validity question to the UBS ICAP team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character markup in ToC results in invalid XML in OSIS output #102

Character markup in ToC results in invalid XML in OSIS output #102

mmartin9684-sil commented Jan 14, 2020

adyeths commented Jan 15, 2020

mmartin9684-sil commented Jan 15, 2020 via email

adyeths commented Jan 15, 2020

DavidHaslam commented Jan 15, 2020

Character markup in ToC results in invalid XML in OSIS output #102

Character markup in ToC results in invalid XML in OSIS output #102

Comments

mmartin9684-sil commented Jan 14, 2020

adyeths commented Jan 15, 2020

mmartin9684-sil commented Jan 15, 2020 via email

adyeths commented Jan 15, 2020

DavidHaslam commented Jan 15, 2020