-
Notifications
You must be signed in to change notification settings - Fork 49
NLP Pipelines
Steven Bethard and Jason Baldridge are working on a new scheme for creating and using NLP processing pipelines, based in part on ideas from UIMA, but bringing in the capabilities of functional programming and actors.
The central data structure is something we're thinking to call a Slab, which is akin to the CAS (Common Analysis System) objects in UIMA. Slab could be thought to stand for something like "Standoff Layered Annotated Blob". The main thing is to think of it as something that you can write notations on, and the thing being annotated could be a text, image, or whatever (though we'll primarily be dealing with text). So, that would mean Slab is a parameterized class, e.g. we'll have Slab[String] and so on.
NLP components are then, unsurprisingly, functions of the type Slab => Slab, where the output has additional annotations layered on it. One question that arises is whether we enforce pipeline consistency at compile time via the type system, or do it at run-time. Steven has already put some draft code together for a compile time scheme: https://github.com/bethard/nlp
Jason thinks using Akka's actors would be great for constructing pipelines and enjoying the concurrency benefits that framework has to offer. We'd have a message class like ProcessSlab(slab: Slab) and then every analyzer would ensure it could handle this in its receive method (which can be enforced with a trait and auxiliary receive method slabReceive). This should make it fairly straightforward to wrap an actor around existing implementations and have them conform to the pipeline requirements.
To get things off the ground, we'll build the initial rough drafts around the annotations in the OANC MASC (Open American National Corpus Manually Annotated Sub-Corpus), which has exactly the sort of stand-off markup we need to do a decent pipeline.