Inquiry related to the size of the Mimic-CXR V2.0.0 dataset #1474

Debodeep94 · 2023-01-30T20:50:10Z

Debodeep94
Jan 30, 2023

I am working with the Mimic-CXR dataset. I have already gained access. The problem is with size. The dataset size is 4.6 TB. Thus it becomes difficult to download the data. Is there any way where we can use a subset of the data in our local machine? Or if there is any other suggestion that you can help me with? Thank you.

alistairewj · 2023-03-28T18:00:17Z

alistairewj
Mar 28, 2023
Maintainer

You can download a subset of records, e.g. just data for 100 subjects. I think clever use of wget ~~with the -A flag should work~~, as the website has an index which lists all the files.

edit: it wasn't -A, it was -I, see below

2 replies

Asaad-Pak Jan 28, 2024

Hi @alistairewj this command you told above will download the first 100 subjects? Can you give a proper command to download a small subset of this data?

alistairewj Feb 10, 2024
Maintainer

Something like this would work:

wget -r -N -c -np --user YOUR_PHYSIONET_USERNAME --ask-password -I /files/mimic-cxr-jpg/2.0.0/files/p10/p10000032,/files/mimic-cxr-jpg/2.0.0/files/p10/p10000764 https://physionet.org/files/mimic-cxr-jpg/2.0.0/files/p10/

Here the -I command is specifying two patients to download data for, but you can extend it to however many you want. The options I used here are:

-r: Recursive download.
-N: Skip re-downloading a file if the timestamp matches.
-c: Continue getting a partially-downloaded file.
-np: Do not follow links to parent directories.
--user: Specify the username for authentication.
--ask-password: Prompt for a password for authentication.
-I: Include directories in the list of directories to follow.

Note how you need to specify the full path to the directory relative to the base URL, otherwise it won't work. Note I also modified the base URL otherwise include directories won't work.

tusharagg1 · 2023-11-20T06:31:55Z

tusharagg1
Nov 20, 2023

"** MIMIC-IV-CXR is over 4.7 TB, almost entirely due to the size of the DICOMs. Users should strongly consider not downloading the data, and instead using it within Google Cloud Platform (GCP), which we support natively. GCP does not charge for data transfer within a region in GCP (see this page for more details about network charges.)."

I have access to the MIMIC-IV-CXR dataset on Google Cloud Storage but I am having issues in accessing the data. I am not sure how can I directly READ from the dataset stored at (https://console.cloud.google.com/storage/browser/mimic-cxr-2.0.0.physionet.org).
If I try accessing the dataset by any means, for example on Google Colab, I receive the following error:

$ !gcloud storage ls gs://mimic-cxr-2.0.0.physionet.org/

ERROR: (gcloud.storage.ls) HTTPError 400: Bucket is a requester pays bucket but no user project provided.

$ !gsutil ls gs://mimic-cxr-2.0.0.physionet.org/

BadRequestException: 400 Bucket is a requester pays bucket but no user project provided.

0 replies

tompollard · 2023-11-20T14:13:10Z

tompollard
Nov 20, 2023
Maintainer

PhysioNet covers storage costs for datasets, but is unable to cover all compute/usage costs for the research community. We therefore use the Requestor Pays option on Google Cloud.

The error message you are seeing indicates that the Google Cloud Storage bucket you are trying to access is set up as a "Requester Pays" bucket. This means that the requester (in this case, you) must provide a billing project to be charged for the data access and egress fees.

To fix this error, you need to specify your billing project when using the gsutil command. You can do this by adding the -u flag followed by your project ID. Here's how you can modify your command:

!gcloud storage ls -u [YOUR_PROJECT_ID] gs://mimic-cxr-2.0.0.physionet.org/

Replace [YOUR_PROJECT_ID] with your actual Google Cloud project ID. Make sure that the Google Cloud account you're using has the necessary permissions to access the bucket and that billing is enabled for your Google Cloud project. Be aware that cloud compute can sometimes be costly, especially for very large datasets. You can find information on pricing at: https://cloud.google.com/pricing

0 replies

Asaad-Pak · 2024-02-11T13:51:52Z

Asaad-Pak
Feb 11, 2024

Hi, @alistairewj @tompollard, can I use split csv "mimic-cxr-2.0.0-split.csv" and download only those jpeg images that belong to the test dataset using the wget command from mimic cxr jpg version? There are almost 5000 jpeg files. If it is possible can you share the command?

1 reply

alistairewj Feb 11, 2024
Maintainer

Yes you can. If you want to go that route, then prepare a list of URLs in a text file, and use --input-file:

--input-file=file

Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.)

If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html is not specified, then file should consist of a series of URLs, one per line.

If you use this, you don't need the -I flag from the above command, or -i. Something like wget -N -c --user YOUR_PHYSIONET_USERNAME --ask-password --input-file=YOUR_TEXT_FILE would work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry related to the size of the Mimic-CXR V2.0.0 dataset #1474

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Inquiry related to the size of the Mimic-CXR V2.0.0 dataset #1474

Debodeep94 Jan 30, 2023

Replies: 4 comments · 3 replies

alistairewj Mar 28, 2023 Maintainer

Asaad-Pak Jan 28, 2024

alistairewj Feb 10, 2024 Maintainer

tusharagg1 Nov 20, 2023

tompollard Nov 20, 2023 Maintainer

Asaad-Pak Feb 11, 2024

alistairewj Feb 11, 2024 Maintainer

Debodeep94
Jan 30, 2023

Replies: 4 comments 3 replies

alistairewj
Mar 28, 2023
Maintainer

alistairewj Feb 10, 2024
Maintainer

tusharagg1
Nov 20, 2023

tompollard
Nov 20, 2023
Maintainer

Asaad-Pak
Feb 11, 2024

alistairewj Feb 11, 2024
Maintainer