Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement/make brat reader more forgiving take2 #1448

Open
wants to merge 55 commits into
base: 1.12.x
Choose a base branch
from

Conversation

alaindesilets
Copy link
Contributor

This is Take2 of my attempt to improve BratReader. Please ignore the previous pull request.

I am not done yet but I am struggling with the method Mapping.merge(). Right now, there is a line at the start of the method:

boolean justCustomMapping = true;

If you leave it at true, all the tests pass, but only because the merge() method basically just returns a mapping that is equivalent to the first of the two arguments.

If you set the first line to false, a bunch of tests start failing and I have no idea why.

If someone can resolve that mystery, I should be OK for doing the rest of the work which is:

  • Making sure that BratReader assigns a catch-all type (ex: NamedEntity) to any unexpected brat label it encounters.

@ukp-svc-jenkins
Copy link

Can one of the admins verify this patch?

@reckart
Copy link
Member

reckart commented Dec 22, 2019

There is something wrong with this PR. I'm sure there shouldn't be over 3000 changed files in it.

@reckart reckart changed the base branch from 1.2.x to 1.12.x December 22, 2019 14:33
@reckart
Copy link
Member

reckart commented Dec 22, 2019

Changed target branch from 1.2.x to 1.12.x ;)

@reckart reckart added Module-io.brat ⭐️ Enhancement New feature or request labels Dec 22, 2019
@reckart reckart modified the milestones: 1.12.1, 1.13.0 Dec 22, 2019
@alaindesilets
Copy link
Contributor Author

alaindesilets commented Dec 22, 2019 via email

@alaindesilets
Copy link
Contributor Author

I figured out the problem with Mapping.merge(). As it turns out, I wasn't generating the default mappings correctly (I had the wrong order for the arguments when constructing TypeMapping instances).

I will continue with the rest of the work and let you know when the branch is ready for pulling.

@reckart
Copy link
Member

reckart commented Dec 23, 2019

I have a question about the part regarding the whole globbing stuff and also your claim that IOTestRunner would add a .ann extension and thus work around the issues you are facing.

AFAIK, IOTestRunner does not add anything. The unit tests in BratReaderWriterTest call it with a specific .ann file - but it can be used with wildcards as well.

The normal use-case if you wanted to read multiple files would be to either use PARAM_SOURCE _LOCATION with a wildcard, e.g. annotations_folder/*.ann or to use PARAM_SOURCE _LOCATION and PARAM_PATTERNS together, e.g.

        CollectionReader reader = createReader(BratReader.class,
                BratReader.PARAM_SOURCE _LOCATION, "annotations_folder/*.ann");

        CollectionReader reader = createReader(BratReader.class,
                BratReader.PARAM_SOURCE _LOCATION, "annotations_folder",
                BratReader.PARAM_PATTERNS, "*.ann");

My understanding is that you want to avoid having to specify *.ann all the time - is that correct?

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Dec 23, 2019 via email

@alaindesilets
Copy link
Contributor Author

AFAIK, IOTestRunner does not add anything. The unit tests in
BratReaderWriterTest call it with a specific .ann file.

Hum... not sure how I had come to that conclusion but I confirm that if you pass a directory path without a *.ann pattern, the old BratReader raises an exception.

So I CAN write a failing test with testOneWay(). However, I then run into the problem that testOneWay() expects the aExpectedFile and aFile to be individual files, not directories. I tried to modify IOTestRunner so it checks whether or not those are individual files or directories, but I run into all sorts of difficulties.

Is it OK if I just write my own tests using my testReadWrite method()?

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Dec 24, 2019 via email

@reckart
Copy link
Member

reckart commented Dec 24, 2019

Basically, testOneWay() uses a location in the project's target folder for
the PARAM_TARGET_LOCATION. This location will be exactly the same for every
execution of a test with a given name. That means that successive execution
of a same test ARE NOT INDENPENDANT of each other. I encountered that
problem just now and it took me a while to find the problem.

This is intentional because often I need to look at the output after a test has run to verify by eye or to pick the output up and drop it into src/test/resources as reference data for the test.

Normally, it should be the DkproTestContext.getTestOutputFolder() which is invoked to generate the target location and that method removes any potentially previously existing data. If for some reason that method is not used, then I think that would be a bug in the test code. Best open a separate issue/PR for it if this is the case.

@alaindesilets
Copy link
Contributor Author

Mostly done!

I just pushed my final draft changes to this branch. I will now create a series of threads on this page that describe

  • The verious improvements I implemented
  • A list of TODOs that I couldn't deal with by myself and where I need some help (those will have at 'TODO-n' at the start of the thread).

Please take a look at those and respond with concerns and suggestions.

@alaindesilets
Copy link
Contributor Author

Impr 1: BratReader.PARAM_SOURCE_LOCATION can be the path of a brat dir without *.ann suffix

  • Reader assumes that it needs to process all .ann files in that dir

@alaindesilets
Copy link
Contributor Author

Impr 2: BratReader.PARAM_SOURCE_LOCATION can be a single .txt file

BratReader will automatically find the corresponding .ann file and process it.

@alaindesilets
Copy link
Contributor Author

Impr 3: If a brat file to be processed only has a .txt file it the BratReader will still work

  • It will simply assume an empty .ann file

@alaindesilets
Copy link
Contributor Author

Impr 4: BratReader comes with pre-activated default mappings

  • Those mappings define short-names for most of the UIMA types defined in the dkpro-core type system
  • Those defaults are ALWAYS loaded by the BratReader
  • If user provides PARAM_MAPPING, those are PRE-PENDED to the default ones, in such a way that the user defined mappings will take precedence over the default ones.
  • Note that if the BratReader is used to read a file that was saved without mappings, it will
    still work because BratReader ALWAYS recognizes labels that are fully qualified class names.

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Jan 22, 2020

Impr 5: BratWriter always has some default mappings, and those are always enabled by default

If the user provies additional mappings, those are PRE-PENDED to the default ones, in such a way that the user defined mappings will take precedence over the default ones.

PARAM_ENABLE_MAPPINGS defaults to true, which means that the defaults are always enabled.

On the face of it, this sounds like it could introduce some problems for legacy code which assume that PARAM_ENABLE_MAPPINGS defaults to 'false'. Let's look at some cases.

Case 1: Code that does not provide mappings

Such code does not provide either PARAM_ENABLE_MAPPINGS nor any of the mappings PARAM.

In this case, the default mappings will be loaded and be active. Since no user-defined mappings are
defined, there is no risk that the default mappings will conflict with user-defined ones.

The code will behave a bit differently in the sense that the .ann file will use short names for
the annotations, instead of the long, fully qualified class name. I am assuming that users who
invoke BratReader that way are really just using the brat files as a persistence mechanism, as opposed to
a way to load the CAS into the Brat viewer (cause the long labels are pretty much unusable in the Brat viewer)
So they shouldn't care about the labels used in the .ann file, as long as
the original CAS can be re-created from the .ann file.

Case 2: Code that DOES provide custom mappings

  • In those cases, the custom mappings will be pre-pended to the default ones in such a way
    that they take precedence.
  • One danger here is that some of those custom mappings will conflict with default ones, i.e.
    a UIMA type can be mapped to two different brat labels
  • Note that this has ALWAYS been the case. The fact that we have default mappings just increases
    the risk that it will happen.
  • To that effect, the BratReader checks for this sort of conflict and raises an exception.
  • If the user really wants to create a mapping whose name is already used in the defaults (not
    a good idea in my opinion), they can always set the new parameter PARAM_CHECK_MAPPING_CONFLICTS = false
    (default is true)

@alaindesilets
Copy link
Contributor Author

Impr 6: BratReader won't fail if it encounters an unmapped, attribute-less brat annotation

If BratReader encounters a brat label that is neither:

  • mapped
  • nor a fully qualified class name

then it will capture it as a new class of annotation called BratTag. This type has a
single attribute: 'label', which is set to the brat label that was seen in the .ann file.

Note that this only works for annotation that do not have attributes.

@alaindesilets
Copy link
Contributor Author

Impr 7: DKProTestContext now separates input and output directories

In the course of testing PARAM_SOURCE_LOCATION=/path/to/a/directory/, I found
I needed to delete the -ref. files because the Reader complained that
it could not find a .txt file for those.

Since I didn't want to delete -ref. from the src/test/resources directory
I modified .readingFrom() so that it copies the files to a test input directory,
and deletes the -ref. files (unless the second argument is false).

In order to support this, I changed DKProContext so that it generates two directories
for the test (input and output), instead of just one.

@alaindesilets
Copy link
Contributor Author

TODO 1: Get rid of the FileGlog and FileCopy classes

These are homegrown classes I use in my various projects for manipulating
files and directories. In the course of writing and testing the various
improvements, I found myself craving some of their methods, so I
brought them over to dkpro-core. I am happy to donate them to the project if that
seems warranted, but I suspect that dkpro-core already has methods for doing what
I needed from FileGlog and FileCopy. I just didn't want to spent too much time
figuring that out.

It would be nice if someone with more knoledge of dkpro-core looked at the places
where I invoke FileGlob and FileCopy to see if a more dkpro-core native approach
could be used instead.

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Jan 22, 2020

TODO 2: Reset failOnMissingMetaData=true in dkpro-core/pom.xml

I had to it to 'false' in dkpro-core/pom.xml
, because otherwise, the mvn-uimafit plugin complained about one of the new
PARAMs I added to BratReader or BratWriter (forget which one).

@alaindesilets
Copy link
Contributor Author

TODO 3: Refactor code that deals with un-mapped entities

I made a change so that if BratReader encounters a brat label that is

  • not mapped
  • not a fully qualified annotation class

then the annotation is stored in a generic BratLabel, with its 'label' attribute
set to the brat label that yielded it.

There are a couple of places in the code where I had to introduce special
processing if an annotation correspond to a BratLabel. This introduced a number of
'if' along the lines of:

if (TypeMappings.isGenericBratTag(x)) { etc... }

This smells of bad design to me, but I couldn't figure out how to avoid those 'if's.

Also, I couldn't make it work in situations where the unmapped brat annotation has some
attributes. The problem is that the code that reads/writes brat annotations expects the
annotation's attributes to be attributes of the UIMA type. But BratTag only has a single
attribute 'label'.

I gave it a shot and was able to read the attributes of an unmapped brat annoation into
a FSList attribute of BratTag. This was at the cost of one of more of those
hacky "if (TypeMappings.isGenericBratTag" statements. But I coulndt figure out how to make
the BratWriter write those read attributes, in the exact same way that they were read.

Maybe someone more experienced with the BratWriter can figure this out?

@alaindesilets
Copy link
Contributor Author

TODO 4: Failing ImsCwbReaderWriterTest.testWacky

This seems to be a discrepancy in the encoding used by the BratWriter versus
encoding of the reference file.

Not sure how to deal with that.

@alaindesilets
Copy link
Contributor Author

TODO 5: Some tests are unable to initialize the test workspace

In several test cases, I had to add a setUp() method to initialize the test workspace:

@Before
public void setUp() throws IOException {
    DkproTestContext.get().initializeTestWorkspace();
} 

For some reason that I cannot fathom, some test cases start with DKproTeestContext.context == null,
which means that DkproTestContext.get() returns null, with setUp() raising a null exception.

There two testcases that I know if which have this problem:

  • NifReaderWriterTest
  • XcesXmlReaderTest

@alaindesilets
Copy link
Contributor Author

TODO FINAL: Style checks

There are many violations to style checking which make the build fail.

I ignore those during development, and built with switches that ignore style checking.

Once we have decided that the code is exactly the way we want it to be, I will
fix all the style check problems.

Copy link
Member

@reckart reckart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first look...

Can you please remind me: why do we need these test workspaces?

@@ -0,0 +1,21 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not having a reader/format-specific type here. Types should be generally useful and be defined in one of the API modules - not in component or I/O modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test__SingleDir__ThatDoesContainAnnFiles__AssumesEmptyAnnFiles() (which should be renamed to move the AssumesEmpty etc.. bit) illustrates why I needed to modify the structure of the test workspaces.

If I run BratReader with PARAM_SOURCE_LOCATION set to one of the resoures directories, the reader sees some files of the form *-ref.ann and goes looking for its corresponding *-ref.txt file. Not finding them, it raises an exception.

In order to test situations where I pass a directory path in PARAM_SOURCE_LOCATION, I therefore need to create an input directory that only contains .ann files for which there is a corresponding .txt file. That means erasing all the *-ref.ann files from the input directory.

Obviously I don't want to erase them from src/resoources, so instead I copy those files over to a test workspace input dir, then erase the *-ref files from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not having a reader/format-specific type here. Types should be generally useful and be defined in one of the API modules - not in component or I/O modules.

I thought TypeMapping only applied to Brat files, hence the reason why I created a Brat-specific type for unmapped labels.

If the concept of TypeMapping is also used for other file formats, then I agree with you that we should have a generic, non-format-specific type that is used for un-mapped 'labels' seen in a file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be possible to use PARAM_PATTERNS: "[-]*-ref.ann"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would defeat the point of the test. The point of the test is to ensure that if you have a sound brat directory ("sound" meaning that every .ann file has a .txt file) then you can run the BratReader on that directory without having to specify a pattern that identifies which files need to be analyzed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case i would suggest to simply set up multiple subfolders of src/test/resources each with exactly one scenario instead of assembling scenarios by copying data to these test workspaces - if necessary with an input and reference folder inside them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case i would suggest to simply set up multiple subfolders of src/test/resources each with exactly one scenario instead of assembling scenarios by copying data to these test workspaces - if necessary with an input and reference folder inside them.

I think copying the original directory to a test input dir offers you the flexibility to "customize" a set of files for a particular test, without having to create a different directory of input files just to introduce a small test variant.

Also, I remembered that there was another reason for copying the files over. Basically, test__SingleDir__ThatDoesContainAnnFiles__AssumesEmptyAnnFiles() requires that you feed it a directory that contains .txt files, and no .ann files. But if this test uses the src/test/resources for PARAM_SOURCE_LOCATION, then it will end up creating empty .ann files in that directory. This means that the next time you run the same test, it will not be testing with the initial conditions that it's trying to test (because the .ann files are now present).

@reckart
Copy link
Member

reckart commented Feb 28, 2020

Jenkins, can you test this please?

…__Take2

* 1.12.x:
  dkpro#1473 - Upgrade dependencies
  dkpro#1473 - Upgrade dependencies
  dkpro#1473 - Upgrade dependencies
  No issue. Disable tests using GermEvl 2014 because that is currently not downloadable.
  dkpro#1473 - Upgrade dependencies
  dkpro#1473 - Upgrade dependencies
  dkpro#1473 - Upgrade dependencies
  dkpro#1471 - Hunpos binaries do not run on OS X Catalina
  dkpro#1469 - Plain text versions of CC license files have disappeared from CC website
  dkpro#1469 - Plain text versions of CC license files have disappeared from CC website
  dkpro#1463 - Upgrade to OpenNLP 1.9.2
  dkpro#1463 - Upgrade to OpenNLP 1.9.2
  dkpro#1461 - BinaryCasWriter does not support BINARY_TSI
  dkpro#1455 - Several format examples cannot be used copy-paste
  dkpro#1455 - Several format examples cannot be used copy-paste
@reckart reckart modified the milestones: 1.13.0, Feature backlog Jan 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants