-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement/make brat reader more forgiving take2 #1448
base: 1.12.x
Are you sure you want to change the base?
Improvement/make brat reader more forgiving take2 #1448
Conversation
Can one of the admins verify this patch? |
There is something wrong with this PR. I'm sure there shouldn't be over 3000 changed files in it. |
Changed target branch from |
Oups, sorry about that. I am such a cluts with git.
…On Sun, Dec 22, 2019 at 9:33 AM Richard Eckart de Castilho < ***@***.***> wrote:
Changed target branch from 1.2.x to 1.12.x ;)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1448?email_source=notifications&email_token=AAIMA4ARIFYYAAKJSRLKLALQZ53EBA5CNFSM4J6L7542YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHPRNGY#issuecomment-568268443>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIMA4BIM7VFY7MAZDIGWLDQZ53EBANCNFSM4J6L754Q>
.
|
I figured out the problem with Mapping.merge(). As it turns out, I wasn't generating the default mappings correctly (I had the wrong order for the arguments when constructing TypeMapping instances). I will continue with the rest of the work and let you know when the branch is ready for pulling. |
I have a question about the part regarding the whole globbing stuff and also your claim that AFAIK, The normal use-case if you wanted to read multiple files would be to either use
My understanding is that you want to avoid having to specify |
On Mon, Dec 23, 2019 at 10:16 AM Richard Eckart de Castilho < ***@***.***> wrote:
I have a question about the part regarding the whole globbing stuff and
also your claim that IOTestRunner would add a .ann extension and thus
work around the issues you are facing.
AFAIK, IOTestRunner does not add anything. The unit tests in
BratReaderWriterTest call it with a specific .ann file.
I'll have a second look. But I am pretty sure that when I tried to write a
test where I invoked testOneWay() with a reader that received a directory
path in PARAM_SOURCE_LOCATION (without a wild card), the test passed. Yet,
when I did the same test using my testReadWrite(), the test failed. I
didn't completely follow execution to figure out exactly what testOneWay()
was doing but I will do and report to you.
The normal use-case if you wanted to read multiple files would be to
either use PARAM_SOURCE _LOCATION with a wildcard, e.g.
annotations_folder/*.ann or to use PARAM_SOURCE _LOCATION and
PARAM_PATTERNS together, e.g.
CollectionReader reader = createReader(
BratReader,
BratReader, "annotations_folder",
BratReader, "*.ann");
My understanding is that you want to avoid having to specify *.ann all
the time - is that correct?
That is correct.
Alain
… |
Hum... not sure how I had come to that conclusion but I confirm that if you pass a directory path without a *.ann pattern, the old BratReader raises an exception. So I CAN write a failing test with testOneWay(). However, I then run into the problem that testOneWay() expects the aExpectedFile and aFile to be individual files, not directories. I tried to modify IOTestRunner so it checks whether or not those are individual files or directories, but I run into all sorts of difficulties. Is it OK if I just write my own tests using my testReadWrite method()? |
I decided to bite the bullet and try to test my improvements using
testOneWay(). I am pretty successful so far, but I have just discovered
something that I think is a major design flaw in that method.
Basically, testOneWay() uses a location in the project's target folder for
the PARAM_TARGET_LOCATION. This location will be exactly the same for every
execution of a test with a given name. That means that successive execution
of a same test ARE NOT INDENPENDANT of each other. I encountered that
problem just now and it took me a while to find the problem.
Essentially, there was a bug in my code that caused an extra file to be
written to PARAM_TARGET_LOCATION. My test failed because it checked the
content of the PARAM_TARGET_LOCATION directory and found that it had this
one unexpected file. I then fixed the bug, but the test still failed,
because that file was left there from the previous run.
As a general rule, if I write a unit tests that needs to write to a
directory, I make it write to a new temp directory everytime to avoid this
sort of thing.
Is OK if I modify testOneWay accordingly?
|
This is intentional because often I need to look at the output after a test has run to verify by eye or to pick the output up and drop it into Normally, it should be the |
The default mappings are hardcoded. PARAM_TYPE_MAPPINGS is used only to ADD to those hard coded defaults.
Mostly done! I just pushed my final draft changes to this branch. I will now create a series of threads on this page that describe
Please take a look at those and respond with concerns and suggestions. |
Impr 1: BratReader.PARAM_SOURCE_LOCATION can be the path of a brat dir without *.ann suffix
|
Impr 2: BratReader.PARAM_SOURCE_LOCATION can be a single .txt file BratReader will automatically find the corresponding .ann file and process it. |
Impr 3: If a brat file to be processed only has a .txt file it the BratReader will still work
|
Impr 4: BratReader comes with pre-activated default mappings
|
Impr 5: BratWriter always has some default mappings, and those are always enabled by default If the user provies additional mappings, those are PRE-PENDED to the default ones, in such a way that the user defined mappings will take precedence over the default ones. PARAM_ENABLE_MAPPINGS defaults to true, which means that the defaults are always enabled. On the face of it, this sounds like it could introduce some problems for legacy code which assume that PARAM_ENABLE_MAPPINGS defaults to 'false'. Let's look at some cases. Case 1: Code that does not provide mappingsSuch code does not provide either PARAM_ENABLE_MAPPINGS nor any of the mappings PARAM. In this case, the default mappings will be loaded and be active. Since no user-defined mappings are The code will behave a bit differently in the sense that the .ann file will use short names for Case 2: Code that DOES provide custom mappings
|
Impr 6: BratReader won't fail if it encounters an unmapped, attribute-less brat annotation If BratReader encounters a brat label that is neither:
then it will capture it as a new class of annotation called BratTag. This type has a Note that this only works for annotation that do not have attributes. |
Impr 7: DKProTestContext now separates input and output directories In the course of testing PARAM_SOURCE_LOCATION=/path/to/a/directory/, I found Since I didn't want to delete -ref. from the src/test/resources directory In order to support this, I changed DKProContext so that it generates two directories |
TODO 1: Get rid of the FileGlog and FileCopy classes These are homegrown classes I use in my various projects for manipulating It would be nice if someone with more knoledge of dkpro-core looked at the places |
TODO 2: Reset failOnMissingMetaData=true in dkpro-core/pom.xml I had to it to 'false' in dkpro-core/pom.xml |
TODO 3: Refactor code that deals with un-mapped entities I made a change so that if BratReader encounters a brat label that is
then the annotation is stored in a generic BratLabel, with its 'label' attribute There are a couple of places in the code where I had to introduce special if (TypeMappings.isGenericBratTag(x)) { etc... } This smells of bad design to me, but I couldn't figure out how to avoid those 'if's. Also, I couldn't make it work in situations where the unmapped brat annotation has some I gave it a shot and was able to read the attributes of an unmapped brat annoation into Maybe someone more experienced with the BratWriter can figure this out? |
TODO 4: Failing ImsCwbReaderWriterTest.testWacky This seems to be a discrepancy in the encoding used by the BratWriter versus Not sure how to deal with that. |
TODO 5: Some tests are unable to initialize the test workspace In several test cases, I had to add a setUp() method to initialize the test workspace:
For some reason that I cannot fathom, some test cases start with DKproTeestContext.context == null, There two testcases that I know if which have this problem:
|
TODO FINAL: Style checks There are many violations to style checking which make the build fail. I ignore those during development, and built with switches that ignore style checking. Once we have decided that the code is exactly the way we want it to be, I will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first look...
Can you please remind me: why do we need these test workspaces?
@@ -0,0 +1,21 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not having a reader/format-specific type here. Types should be generally useful and be defined in one of the API modules - not in component or I/O modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test test__SingleDir__ThatDoesContainAnnFiles__AssumesEmptyAnnFiles() (which should be renamed to move the AssumesEmpty etc.. bit) illustrates why I needed to modify the structure of the test workspaces.
If I run BratReader with PARAM_SOURCE_LOCATION set to one of the resoures directories, the reader sees some files of the form *-ref.ann and goes looking for its corresponding *-ref.txt file. Not finding them, it raises an exception.
In order to test situations where I pass a directory path in PARAM_SOURCE_LOCATION, I therefore need to create an input directory that only contains .ann files for which there is a corresponding .txt file. That means erasing all the *-ref.ann files from the input directory.
Obviously I don't want to erase them from src/resoources, so instead I copy those files over to a test workspace input dir, then erase the *-ref files from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not having a reader/format-specific type here. Types should be generally useful and be defined in one of the API modules - not in component or I/O modules.
I thought TypeMapping only applied to Brat files, hence the reason why I created a Brat-specific type for unmapped labels.
If the concept of TypeMapping is also used for other file formats, then I agree with you that we should have a generic, non-format-specific type that is used for un-mapped 'labels' seen in a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be possible to use PARAM_PATTERNS
: "[-]*-ref.ann"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would defeat the point of the test. The point of the test is to ensure that if you have a sound brat directory ("sound" meaning that every .ann file has a .txt file) then you can run the BratReader on that directory without having to specify a pattern that identifies which files need to be analyzed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case i would suggest to simply set up multiple subfolders of src/test/resources each with exactly one scenario instead of assembling scenarios by copying data to these test workspaces - if necessary with an input and reference folder inside them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case i would suggest to simply set up multiple subfolders of src/test/resources each with exactly one scenario instead of assembling scenarios by copying data to these test workspaces - if necessary with an input and reference folder inside them.
I think copying the original directory to a test input dir offers you the flexibility to "customize" a set of files for a particular test, without having to create a different directory of input files just to introduce a small test variant.
Also, I remembered that there was another reason for copying the files over. Basically, test__SingleDir__ThatDoesContainAnnFiles__AssumesEmptyAnnFiles() requires that you feed it a directory that contains .txt files, and no .ann files. But if this test uses the src/test/resources for PARAM_SOURCE_LOCATION, then it will end up creating empty .ann files in that directory. This means that the next time you run the same test, it will not be testing with the initial conditions that it's trying to test (because the .ann files are now present).
…__Take2 * 1.12.x: No issue. Upgrade checkstyle.
Jenkins, can you test this please? |
…__Take2 * 1.12.x: dkpro#1473 - Upgrade dependencies dkpro#1473 - Upgrade dependencies dkpro#1473 - Upgrade dependencies No issue. Disable tests using GermEvl 2014 because that is currently not downloadable. dkpro#1473 - Upgrade dependencies dkpro#1473 - Upgrade dependencies dkpro#1473 - Upgrade dependencies dkpro#1471 - Hunpos binaries do not run on OS X Catalina dkpro#1469 - Plain text versions of CC license files have disappeared from CC website dkpro#1469 - Plain text versions of CC license files have disappeared from CC website dkpro#1463 - Upgrade to OpenNLP 1.9.2 dkpro#1463 - Upgrade to OpenNLP 1.9.2 dkpro#1461 - BinaryCasWriter does not support BINARY_TSI dkpro#1455 - Several format examples cannot be used copy-paste dkpro#1455 - Several format examples cannot be used copy-paste
This is Take2 of my attempt to improve BratReader. Please ignore the previous pull request.
I am not done yet but I am struggling with the method Mapping.merge(). Right now, there is a line at the start of the method:
boolean justCustomMapping = true;
If you leave it at true, all the tests pass, but only because the merge() method basically just returns a mapping that is equivalent to the first of the two arguments.
If you set the first line to false, a bunch of tests start failing and I have no idea why.
If someone can resolve that mystery, I should be OK for doing the rest of the work which is: