Skip to content

Latest commit

 

History

History
16 lines (9 loc) · 5.47 KB

validation.md

File metadata and controls

16 lines (9 loc) · 5.47 KB

Schemas and Validation

Part 1 (see README)

Schemas for XML documents (including TEI documents) are a complex topic. We might start by asking what is a schema and why would you care if your TEI document is valid? XML documents have two "modes" of correctness. The first deals with "Well-formedness" which basically asks whether the basic syntax of the document is ok: are all of my opening tags closed in the right order? Have I remembered to put quotes around attribute values? That sort of thing. XML is very strict about well-formedness. If your document isn't well-formed, an XML parser is suppose to fail immediately with an error message. Contrast this with a web browser, which will patch broken HTML on-the-fly and attempt to display something useful regardless.

Schemas provide a next level of correctness checking: not only do all your tags line up, but they use the vocabulary defined in the schema and your document has the right tags in the right places. No <div>s inside <p> elements for example. Schema validation provides some of the same kinds of security that tests do for programs—it can alert you when something has gone wrong, and since that something might make it difficult to process the document, it's good to have a way to check for problems other than seeing if your system breaks.

There are arguments to be made against the TEI's overall document model as defined, and it's my sense that some TEI people worry too much about "the rules" and about people breaking them, and the consequences thereof. But at the same time, it's really all-or-nothing. If you want to have a vocabulary of elements and attributes at all, then you need a schema.

There are, confusingly, several ways to validate an XML document. Most people have heard of DTDs at some point. Document Type Definitions are a legacy of XML's SGML past, and their use is actively discouraged by many TEI practitioners. If you take a TEI class from me, you will be told to just forget about them. They use a non-XML syntax and are both less powerful and more dangerous than the other schema types. A lot of the complexity of XML processing is due to the need to support DTDs. There are newer methods available, like XML Schema (XSD), Relax NG, and Schematron, and you should use one of, or a combination of those. TEI people tend to use Relax NG plus Schematron. A TEI schema is generally defined in TEI, using a document called an ODD (One Document Does it all), which contains both the documentation for your project plus the definition of your rule set, and from which a schema in DTD, XML Schema, Relax NG, and/or Schematron can be generated. Typically, a TEI document will specify what schemas should be used to validate it. If your document has a DTD, it will have a Document Type Declaration, If an XSD, there will be attributes on the root element that links your schema(s), if Relax NG or Schematron, then there will be processing instructions that link to the schemas.

This leads us immediately to problem #1. If our putative TEI editor is going to validate our document, then it will have to fetch the schema, read it, and do things with it. It will have to be available online in such a way that we can retrieve it. We can't just straightforwardly download it unless some work has been done up front on the server hosting the schema. Interestingly, this is not true for DTDs, which the parser is required to retrieve. DTDs might define "entities" which your document might reference, and these could actually contain parts of the document—you might have boilerplate bits of the header defined in your DTD, for example. But the XML parser bundled with your browser is a non-validating one, so that does us no good at all. For us to have any hope of reading your schema, the server it lives on will need to have Cross-Origin Resource Sharing (CORS) set up. If not, you will have to supply the schema in some other way. In sum: there is a usability issue beyond our control here that might confuse and annoy potential users. Fortunately, the pre-built "exemplars" hosted by the TEI-C (e.g. TEI Lite) would be available in this way, as would any schema hosted on GitHub.

If we can get the schema, next we'll have to do something with it. Now it gets tricky. Note that there may be a shortcut that lets us bypass all that's to follow, thanks to the miracle of modern technology. When we tried building a TEI editor during the Angles project, a few years ago. There was no avoiding the fact that if we wanted proper validation we'd need to do it on the server side, rather than in the browser. This is messy, because it means uploading your document, retrieving the schema, validating against the schema, returning the resulting error messages (if any) in a form that your editor can do something useful with, and then doing it again. A lot. It's all doable, but it means you're going to incur a recurring cost for running a process on one or more servers, not to mention dealing with the maintenance costs and mitigating any security problems. Your process will involve people posting random URIs to your server process, you downloading and caching whatever happens to be on the other end of those, reading it, and then feeding it into a program, along with random text posted from a website. This is the sort of thing that makes a developer with any experience at least mildly nervous. It is, frankly, not something I would do without an ongoing revenue stream to support it.

Validation Part 2