-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Nuclei detection on cellSens VSI fails #255
Comments
I would think this would fail: |
Just mentioning that I am now able to run inference with a deep learning model using my own custom plugin (https://github.com/andreped/FP-dsa-plugin), but the end users mostly have their images in the VSI format. Hence, having good support for the format is detrimental. |
Basically, the worker is passed "the" image file, but VSI is a multi file format, and the worker doesn't have all of it to work with inside the docker. As most of our use to date has been on single file WSI formats, this hasn't been addressed -- it probably isn't that much work, but the worker would need to be updated to properly pull references to the other files used in the WSI. We are using the largest of the multiple stacks through the bioformats reader. I'd have to hunt through the metadata that bioformats exposes to find where it exposes the microns per pixel or the magnification. Sadly, bioformats is not as consistent between formats as one could wish, so this still ends up having some code for each format read. |
That explains it. There exists other WSI formats which also stores their data in this folder structure, sadly, even in OpenSlide. Hence, it is a feature I guess others also would welcome. At least if they wish to use the power of HistomicsTK to run algorithms on their own images through the web UI.
As I mentioned, we were also unable to find the magnification level from the cellSens VSI format. However, we made our own custom reader in C++, as BioFormats is terribly supported in C++. QuPath uses BioFormats, and I would think from here one should be able to find how to get the correct magnification level from the cellSens VSI format. |
Just tried running the Nuclei Detection algorithm which is part of HistomicsTK on the full CMU-1.tiff WSI from the OpenSlide test data. With default settings without setting the bounding box or anything. I observed the same issue as for my own plugin (which is heavily inspired by the NucleiDetection built in method). The job is successful, but the annotations are not made available as annotations nor are they rendered. The analysis took about 7-8 minutes to run. This was observed using the Chromium browser on Ubuntu 18.04, on the same machine where DSA is running. I have attached the full terminal logs from the DSA docker-compose up, from the start of running the nuclei detection method to success (amit no stored annotations), I also refreshed to see if the annotations would show up (which was also parts of the logs at the end). |
Can you look in the actual girder log file (info.log) rather than the docker logs (which are the stdout / stderr of the docker container). If you used the reference deployment, these will be at devops/dsa/logs/info.log. Otherwise, the web api can be used to fetch parts of the log without knowing where it is stored. |
Below I have shared you the girder log file. Seems like a GirderException was raised due to From this it looks like there is a 16 MB limit when reading from a single JSON file. Right now all annotations are stored in a single JSON file (from the plugin). Same applies to the |
No -- it is an unintended limitation. I've made an issue to track it: girder/large_image#1073, but may not get to it until next week. |
See girder/large_image#1075 for a fix but not a total solution. Apparently that restriction wasn't there before we made a change to better support S3 assetstores -- it could just use too much memory, though. |
The solution you suggested makes sense to me. Looks like you are reading in the annotations in chunks. But I guess if the annotation file itself is extremely large, there could be memory issues, but should work fine for my application. Thanks a lot! Just tried pulling the latest DSA image and it seems to work well! At least for the built in nuclei detection method in HistomicsTK :] Will try my other custom pipelines now, but I believe it should work. Of course the method itself does funny stuff in some regions, but I believe this method was just there to demonstrate a pipeline. The large annotation storage reading issue was more related to a separate issue: I therefore closed that issue, but this issue remains open until we are able to parse the magnification level information from the cellSens VSI format. |
Oh, note that after the job is done, it takes quite long until the annotation becomes available under Anyways, ran a deep learning model I have on an image that is about 160k x 120k packed with nuclei, and at least annotations are stored and I can render the result :] |
If you look at the logs you'll see the time it takes for various parts of the annotation json parsing and ingest. Currently, we use the orjson package to read the json. This means that the file and the decoded json are in memory at the same time. Using something like ijson would reduce the memory footprint, but actually be substantially slower (maybe by a factor of 3 to 5 in this stage). We validate the json against the schema -- this step could be parallelized. We then insert the data into the database; this step IS parallelized based on the CPU count. On my faster desktop where I ingested ~320,000 nuclei polygons, decoding takes 16s, validation less than 10s, ingest 115s. On a slower machine, the decoding is 38s, validation 20s, and ingest 141s. The ingest is limited to mongo's speed (we should have used Postgres, but that is a big refactor). I think if an iterative json parser was used, the slowness of that parser would be amortized over the multiple threads, but because mongo is the limiting factor currently, looking in to how to speed up those inserts would have the most impact on speed (but we already do unsorted inserts, so I'm not sure how to improve things). |
Thank you for sharing your performance benchmarks. They seem similar to what I am observing on the desktop I was testing this on. If I understand it correctly, are all the objects inserted into the database - that is every single annotation object with corresponding boundaries? I guess that is done to be able to easily modify single objects and whatnot, but wouldn't it be a lot faster to simply store the JSON file externally and referring to it from the database? Similarly as done when importing WSIs to a collection. But I guess for practical reasons, such as for annotation work and for versioning, it is better to have all objects stored in the database, or am I missing something? Anyways, I'm fine with the current delay. I'm more interested in having good support for the cellSens VSI format. |
As you said .. every object can be versioned as well as edited or relabeled
..that was by design
…On Mon, Mar 6, 2023, 1:05 PM André Pedersen ***@***.***> wrote:
Thank you for sharing your performance benchmarks. They seem similar to
what I am observing on the desktop I was testing this on.
If I understand it correctly, are all the objects inserted into the
database - that is every single annotation object with corresponding
boundaries? I guess that is done to be able to easily modify single objects
and whatnot, but wouldn't it be a lot faster to simply store the JSON file
externally and referring to it from the database? Similarly as done when
importing WSIs to a collection. But I guess for practical reasons, such as
for annotation work and for versioning, it is better to have all objects
stored in the database, or am I missing something?
Anyways, I'm fine with the current delay. I'm more interested in having
good support for the cellSens VSI format.
—
Reply to this email directly, view it on GitHub
<#255 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFODTRS22BSDIILFBPWWALW2YROJANCNFSM6AAAAAAVIDO5IE>
.
You are receiving this because you are subscribed to this thread.Message
ID: <DigitalSlideArchive/digital_slide_archive/issues/255/1456668520@
github.com>
|
@manthey I found a way to parse the magnification level from the cellSens VSI format. See gist. I parse the OME-XML metadata of the VSI file to get the For convenience, I also shared the implementation below:
|
See girder/large_image#1074 which uses data exposed a different way but should work. |
Oh, OK. I was not aware that you have found a fix for that issue. Then I guess the final thing that remains is being able to upload the corresponding |
What's the status on this issue? That is having proper cellSens VSI support during compute with HistomicsTK and girder worker. @MarkusDrange who made the issue #257, also needs this to work for his Master's thesis project. |
I just tried to run the Nuclei Detection algorithm that exist in HistomicsTK.
I observed this for the OS-1.vsi which is openly available as part of the OpenSlide test data (see here).
I don't know if this just affects the nuclei detection algorithm, but I would guess "all" algorithms do the same trick when processing this format. I also observed the same for my own custom plugin, but it was greatly inspired by the nuclei detection method, hence, hard to say. I also observed the same on different WSIs in the same cellSens VSI format.
In the log, I can see this:
First of all, both DSA and HistomicsTK renders the WSI just fine. It is just inference/analysis with the
.vsi
format that fails.Second, it seems like DSA fails to parse the magnification/resolution information of the WSI. That is not that surprising. We have the same problem in FastPathology with this format. There we just assume the WSI is 40x if we are unable to parse it and hope for the best.
However, as you are using bioformats to parse, which QuPath uses, I would think you should be able to parse the magnification level, no?
Lastly, since the format handler fails to find the
.ets
file maybe it assumes the wrong folder structure during analysis? From reading the logs, I would guess it did assume that the.ets
file would be directly within the_OS-1_
and not one level deeper.The correct structure is like so:
which I believe is standard for the Olympus format.
Alternatively, perhaps it fails as there are two stacks? If I recall correctly, the two correspond to the WSI (512 MB) and thumbnail (9 MB) images. Could be that HistomicsTK chooses the wrong one accidentally? In this case, I believe the
stack10001/
correspond to the WSI.The text was updated successfully, but these errors were encountered: