Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Nuclei detection on cellSens VSI fails #255

Open
andreped opened this issue Feb 25, 2023 · 19 comments
Open

Bug: Nuclei detection on cellSens VSI fails #255

andreped opened this issue Feb 25, 2023 · 19 comments

Comments

@andreped
Copy link
Contributor

andreped commented Feb 25, 2023

I just tried to run the Nuclei Detection algorithm that exist in HistomicsTK.
I observed this for the OS-1.vsi which is openly available as part of the OpenSlide test data (see here).

I don't know if this just affects the nuclei detection algorithm, but I would guess "all" algorithms do the same trick when processing this format. I also observed the same for my own custom plugin, but it was greatly inspired by the nuclei detection method, hence, hard to say. I also observed the same on different WSIs in the same cellSens VSI format.

In the log, I can see this:

>> Reading input image ... 

2023-02-25 22:47:21,991 - distributed.core - INFO - Starting established connection to tcp://172.18.0.5:59936
INFO:distributed.core:Starting established connection to tcp://172.18.0.5:59936
INFO:large_image:Using memcached for large_image caching
INFO:large_image:Started JVM for Bioformats tile source.
22:47:24.117 [Thread-0] WARN loci.formats.FormatHandler - Missing expected .ets files in /mnt/girder_worker/678342f3d939490db8b0ce5e893c23a5/_OS-1_
{
  "levels": 1,
  "sizeX": 262,
  "sizeY": 512,
  "tileWidth": 262,
  "tileHeight": 512,
  "magnification": null,
  "mm_x": null,
  "mm_y": null
}

First of all, both DSA and HistomicsTK renders the WSI just fine. It is just inference/analysis with the .vsi format that fails.

Second, it seems like DSA fails to parse the magnification/resolution information of the WSI. That is not that surprising. We have the same problem in FastPathology with this format. There we just assume the WSI is 40x if we are unable to parse it and hope for the best.

However, as you are using bioformats to parse, which QuPath uses, I would think you should be able to parse the magnification level, no?

Lastly, since the format handler fails to find the .ets file maybe it assumes the wrong folder structure during analysis? From reading the logs, I would guess it did assume that the .ets file would be directly within the _OS-1_ and not one level deeper.

The correct structure is like so:

    └── WSI
        ├── OS-1.vsi
        └── _OS-1_/
               ├── stack1/
               |      └── frame_t.ets
               └── stack10001/
                      └── frame_t.ets

which I believe is standard for the Olympus format.

Alternatively, perhaps it fails as there are two stacks? If I recall correctly, the two correspond to the WSI (512 MB) and thumbnail (9 MB) images. Could be that HistomicsTK chooses the wrong one accidentally? In this case, I believe the stack10001/ correspond to the WSI.

@andreped andreped changed the title Nuclei detection on cellSens VSI fails Bug: Nuclei detection on cellSens VSI fails Feb 25, 2023
@andreped
Copy link
Contributor Author

@andreped
Copy link
Contributor Author

Just mentioning that I am now able to run inference with a deep learning model using my own custom plugin (https://github.com/andreped/FP-dsa-plugin), but the end users mostly have their images in the VSI format. Hence, having good support for the format is detrimental.

Screenshot from 2023-02-26 13-02-46

@manthey
Copy link
Contributor

manthey commented Feb 28, 2023

Basically, the worker is passed "the" image file, but VSI is a multi file format, and the worker doesn't have all of it to work with inside the docker. As most of our use to date has been on single file WSI formats, this hasn't been addressed -- it probably isn't that much work, but the worker would need to be updated to properly pull references to the other files used in the WSI.

We are using the largest of the multiple stacks through the bioformats reader.

I'd have to hunt through the metadata that bioformats exposes to find where it exposes the microns per pixel or the magnification. Sadly, bioformats is not as consistent between formats as one could wish, so this still ends up having some code for each format read.

@andreped
Copy link
Contributor Author

andreped commented Mar 1, 2023

Basically, the worker is passed "the" image file, but VSI is a multi file format, and the worker doesn't have all of it to work with inside the docker. As most of our use to date has been on single file WSI formats, this hasn't been addressed -- it probably isn't that much work, but the worker would need to be updated to properly pull references to the other files used in the WSI.

That explains it. There exists other WSI formats which also stores their data in this folder structure, sadly, even in OpenSlide. Hence, it is a feature I guess others also would welcome. At least if they wish to use the power of HistomicsTK to run algorithms on their own images through the web UI.

I'd have to hunt through the metadata that bioformats exposes to find where it exposes the microns per pixel or the magnification. Sadly, bioformats is not as consistent between formats as one could wish, so this still ends up having some code for each format read.

As I mentioned, we were also unable to find the magnification level from the cellSens VSI format. However, we made our own custom reader in C++, as BioFormats is terribly supported in C++. QuPath uses BioFormats, and I would think from here one should be able to find how to get the correct magnification level from the cellSens VSI format.

@andreped
Copy link
Contributor Author

andreped commented Mar 1, 2023

Just tried running the Nuclei Detection algorithm which is part of HistomicsTK on the full CMU-1.tiff WSI from the OpenSlide test data. With default settings without setting the bounding box or anything.

I observed the same issue as for my own plugin (which is heavily inspired by the NucleiDetection built in method). The job is successful, but the annotations are not made available as annotations nor are they rendered. The analysis took about 7-8 minutes to run.

This was observed using the Chromium browser on Ubuntu 18.04, on the same machine where DSA is running.

I have attached the full terminal logs from the DSA docker-compose up, from the start of running the nuclei detection method to success (amit no stored annotations), I also refreshed to see if the annotations would show up (which was also parts of the logs at the end).
nuclei_detection_full_wsi_logs.txt

@manthey
Copy link
Contributor

manthey commented Mar 1, 2023

Can you look in the actual girder log file (info.log) rather than the docker logs (which are the stdout / stderr of the docker container). If you used the reference deployment, these will be at devops/dsa/logs/info.log. Otherwise, the web api can be used to fetch parts of the log without knowing where it is stored.

@andreped
Copy link
Contributor Author

andreped commented Mar 2, 2023

Can you look in the actual girder log file (info.log)

Below I have shared you the girder log file. Seems like a GirderException was raised due to Read exceeds maximum allowed size.
info.log

From this it looks like there is a 16 MB limit when reading from a single JSON file. Right now all annotations are stored in a single JSON file (from the plugin). Same applies to the NucleiDetection example. Was I supposed to store it differently?

@manthey
Copy link
Contributor

manthey commented Mar 2, 2023

No -- it is an unintended limitation. I've made an issue to track it: girder/large_image#1073, but may not get to it until next week.

@manthey
Copy link
Contributor

manthey commented Mar 2, 2023

See girder/large_image#1075 for a fix but not a total solution. Apparently that restriction wasn't there before we made a change to better support S3 assetstores -- it could just use too much memory, though.

@andreped
Copy link
Contributor Author

andreped commented Mar 3, 2023

See girder/large_image#1075 for a fix but not a total solution. Apparently that restriction wasn't there before we made a change to better support S3 assetstores -- it could just use too much memory, though.

The solution you suggested makes sense to me. Looks like you are reading in the annotations in chunks. But I guess if the annotation file itself is extremely large, there could be memory issues, but should work fine for my application. Thanks a lot!

Just tried pulling the latest DSA image and it seems to work well! At least for the built in nuclei detection method in HistomicsTK :] Will try my other custom pipelines now, but I believe it should work. Of course the method itself does funny stuff in some regions, but I believe this method was just there to demonstrate a pipeline.


The large annotation storage reading issue was more related to a separate issue:
DigitalSlideArchive/HistomicsTK#977 (comment)

I therefore closed that issue, but this issue remains open until we are able to parse the magnification level information from the cellSens VSI format.

Screenshot from 2023-03-03 07-59-02

@andreped
Copy link
Contributor Author

andreped commented Mar 3, 2023

Oh, note that after the job is done, it takes quite long until the annotation becomes available under Annotations. Is the JSON format threadsafe? Could it be possible to speed up loading by using multiple workers?

Anyways, ran a deep learning model I have on an image that is about 160k x 120k packed with nuclei, and at least annotations are stored and I can render the result :]

Screenshot from 2023-03-03 08-19-10

@manthey
Copy link
Contributor

manthey commented Mar 3, 2023

If you look at the logs you'll see the time it takes for various parts of the annotation json parsing and ingest. Currently, we use the orjson package to read the json. This means that the file and the decoded json are in memory at the same time. Using something like ijson would reduce the memory footprint, but actually be substantially slower (maybe by a factor of 3 to 5 in this stage). We validate the json against the schema -- this step could be parallelized. We then insert the data into the database; this step IS parallelized based on the CPU count.

On my faster desktop where I ingested ~320,000 nuclei polygons, decoding takes 16s, validation less than 10s, ingest 115s. On a slower machine, the decoding is 38s, validation 20s, and ingest 141s. The ingest is limited to mongo's speed (we should have used Postgres, but that is a big refactor). I think if an iterative json parser was used, the slowness of that parser would be amortized over the multiple threads, but because mongo is the limiting factor currently, looking in to how to speed up those inserts would have the most impact on speed (but we already do unsorted inserts, so I'm not sure how to improve things).

@andreped
Copy link
Contributor Author

andreped commented Mar 6, 2023

Thank you for sharing your performance benchmarks. They seem similar to what I am observing on the desktop I was testing this on.

If I understand it correctly, are all the objects inserted into the database - that is every single annotation object with corresponding boundaries? I guess that is done to be able to easily modify single objects and whatnot, but wouldn't it be a lot faster to simply store the JSON file externally and referring to it from the database? Similarly as done when importing WSIs to a collection. But I guess for practical reasons, such as for annotation work and for versioning, it is better to have all objects stored in the database, or am I missing something?

Anyways, I'm fine with the current delay. I'm more interested in having good support for the cellSens VSI format.

@dgutman
Copy link
Contributor

dgutman commented Mar 6, 2023 via email

@andreped
Copy link
Contributor Author

andreped commented Mar 6, 2023

I'd have to hunt through the metadata that bioformats exposes to find where it exposes the microns per pixel or the magnification. Sadly, bioformats is not as consistent between formats as one could wish, so this still ends up having some code for each format read.

@manthey I found a way to parse the magnification level from the cellSens VSI format. See gist.

I parse the OME-XML metadata of the VSI file to get the NominalMagnification values of all the Objective lenses used. However, I was surprised to find that for two lenses, the final value was the total magnification level, which in this case was 20 (at least according to what QuPath is reporting). I was expecting the values to be 2 and 10, such that the total magnification was 2 x 10 = 20. However, I guess thats what they mean by nominal. Hence, it seems like consecutive lenses have magnifications for the current state given the previous lenses used. Hence, getting the magnification of the final lens (or just the maximum magnification) should yield the magnification used to capture the image.

For convenience, I also shared the implementation below:

import javabridge as jb
import bioformats as bf
import xmltodict


def get_vsi_magnification(wsi_path):

    # get OME XML from VSI format
    ome_xml = bf.get_omexml_metadata(wsi_path)
  
    # convert XML to human readable format (dict)
    metadata = dict(xmltodict.parse(ome_xml))
  
    # extract objective information (can be more than one lens/objective)
    curr = metadata["OME"]["Instrument"]["Objective"]
  
    # extract magnification for each objective
    magnifications = [eval(obj["@NominalMagnification"]) for obj in curr]
  
    # assumption: the highest magnification value is the total magnification
    return max(magnifications)


if __name__ == "__main__":

    # start virtual machine
    jb.start_vm(class_path=bf.JARS, max_heap_size="4G")
  
    magn = get_vsi_magnification("/path/to/image.vsi")
    print(magn)
  
    # kill virtual machine
    jb.kill_vm()

@manthey
Copy link
Contributor

manthey commented Mar 7, 2023

See girder/large_image#1074 which uses data exposed a different way but should work.

@andreped
Copy link
Contributor Author

andreped commented Mar 8, 2023

See girder/large_image#1074 which uses data exposed a different way but should work.

Oh, OK. I was not aware that you have found a fix for that issue. Then I guess the final thing that remains is being able to upload the corresponding _ID_/ directory with the ID.vsi file somehow.

@andreped
Copy link
Contributor Author

What's the status on this issue? That is having proper cellSens VSI support during compute with HistomicsTK and girder worker.

@MarkusDrange who made the issue #257, also needs this to work for his Master's thesis project.

@andreped
Copy link
Contributor Author

andreped commented May 4, 2023

Minor bump @manthey @dgutman.

This is a feature that is of great interest for our project as most of our WSIs are stored in the .vsi format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants