ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset #4796

NielsRogge · 2022-08-05T12:41:19Z

Describe the bug

When adding a Pillow image to an existing Dataset on the hub, add_item fails due to the Pillow image not being automatically converted into the Image feature.

Steps to reproduce the bug

from datasets import load_dataset
from PIL import Image

dataset = load_dataset("hf-internal-testing/example-documents")

# load any random Pillow image
image = Image.open("/content/cord_example.png").convert("RGB")

new_image = {'image': image}
dataset['test'] = dataset['test'].add_item(new_image)

Expected results

The image should be automatically casted to the Image feature when using add_item. For now, this can be fixed by using encode_example:

import datasets

feature = datasets.Image(decode=False)
new_image = {'image': feature.encode_example(image)}
dataset['test'] = dataset['test'].add_item(new_image)

Actual results

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB size=576x864 at 0x7F7CCC4589D0> with type Image: did not recognize Python value type when inferring an Arrow data type

The text was updated successfully, but these errors were encountered:

NielsRogge · 2022-08-12T13:28:34Z

@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:

from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL

# we need to define the features ourselves
features = Features({
    'a': Value(dtype='int32'),
    'b': Image(),
})

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

df = pd.DataFrame({"a": [1, 2], 
                   "b": [image, image]})

dataset = Dataset.from_pandas(df, features=features)

results in

ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')

Will the PR linked above also fix that?

mariosasko · 2022-08-12T14:57:06Z

I would expect this to work, but it doesn't. Shouldn't be too hard to fix tho (in a subsequent PR).

darraghdog · 2022-08-19T12:09:51Z

Hi @mariosasko just wanted to check in if there is a PR to follow for this. I was looking to create a demo app using this. If it's not working I can just use byte encoded images in the dataset which are not displayed.

mariosasko · 2022-08-19T12:41:29Z

Hi @darraghdog! No PR yet, but I plan to fix this before the next release.

stas00 · 2022-09-20T15:49:14Z

I was just pointed here by @mariosasko, meanwhile I found a workaround using encode_example like so:

from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [] for k in ds1[99].keys()},
                       features=ds1.features
)
for i in range(2):
    # could add several representative items here
    row = ds1[99]
    row_encoded = ds2.features.encode_example(row)
    ds2 = ds2.add_item(row_encoded)

stas00 · 2022-09-20T17:16:13Z

Hmm, interesting. If I create the dataset on the fly:

from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [v]*2 for k, v in ds1[99].items()},
                        features=ds1.features)

it doesn't fail with the error in the OP, as from_dict performs encode_batch.

However if I try to use this dataset it fails now with:

Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 524, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2775, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2347, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "debug_leak2.py", line 235, in split_pack_and_pad
    images.append(image_transform(image.convert("RGB")))
AttributeError: 'dict' object has no attribute 'convert'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "debug_leak2.py", line 418, in <module>
    train_loader, val_loader = get_dataloaders()
  File "debug_leak2.py", line 348, in get_dataloaders
    dataset = dataset.map(mapper, batch_size=32, batched=True, remove_columns=dataset.column_names, num_proc=4)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2500, in map
    transformed_shards[index] = async_result.get()
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
AttributeError: 'dict' object has no attribute 'convert'

but if I create that same dataset one item at a time as in the previous comment's code snippet it doesn't fail.

The features of this dataset are set to:

{'texts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 
'images': Sequence(feature=Image(decode=True, id=None), length=-1, id=None)}

MaxxTr · 2023-02-12T20:08:06Z

@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:

from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL

# we need to define the features ourselves
features = Features({
    'a': Value(dtype='int32'),
    'b': Image(),
})

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

df = pd.DataFrame({"a": [1, 2], 
                   "b": [image, image]})

dataset = Dataset.from_pandas(df, features=features)

results in

ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')

Will the PR linked above also fix that?

It looks like the problem still exists.
Any news ? Any good workaround ?

Thank you

MaxxTr · 2023-02-13T11:25:27Z

There is a workaround:
Create a loader python scrypt and upload the dataset to huggingface.

Here is an example how to do that:

https://huggingface.co/datasets/jamescalam/image-text-demo/tree/main

and Here are videos with explanations:

https://www.youtube.com/watch?v=lqK4ocAKveE and https://www.youtube.com/watch?v=ODdKC30dT8c

NielsRogge · 2023-02-15T09:24:01Z

cc @mariosasko gentle ping for a fix :)

chumpblocckami · 2023-03-31T17:40:35Z

Any update on this? I'm still facing this issure. Any workaround?

umarpreet1 · 2023-04-11T16:04:34Z

I was facing the same issue. Downgrading datasets from 2.11.0 to 2.4.0 solved the issue.

chumpblocckami · 2023-04-13T13:38:30Z

Any update on this? I'm still facing this issure. Any workaround?

I was able to resolve my issue with a quick workaround:

from collections import defaultdict
from datasets import Dataset
  
data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
    img = dataloader[idx]
    data['image'].append(img)
    data['text'].append(f"{img_{idx}})
  
dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)
  
dataset.push_to_hub(path-to-repo', private=False)

Hope it helps!
Happy coding

thinh-huynh-re · 2023-04-26T11:20:25Z

Any update on this? I'm still facing this issure. Any workaround?

I was able to resolve my issue with a quick workaround:

from collections import defaultdict
from datasets import Dataset
  
data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
    img = dataloader[idx]
    data['image'].append(img)
    data['text'].append(f"{img_{idx}})
  
dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)
  
dataset.push_to_hub(path-to-repo', private=False)

Hope it helps! Happy coding

It works!!

LanceGao97 · 2023-05-31T14:05:34Z

how did this work, how to use this script or where to paste it?

Ekhao · 2024-04-11T02:23:07Z

I had a similar issue to @NielsRogge where I was unable to create a dataset from a Pandas DataFrame containing PIL.Images.

I found another workaround that works in this case which involves converting the DataFrame to a python dictionary, and then creating a dataset from said python dictionary.

This is a generic example of my workaround. The example assumes that you have your data in a Pandas DataFrame variable called "dataframe" plus a dictionary of your data's features in a variable called "features".

import datasets

dictionary = dataframe.to_dict(orient='list')
dataset = datasets.Dataset.from_dict(dictionary, features=features)

NielsRogge · 2024-04-11T06:59:31Z

cc @mariosasko this issue has been open for 2 years, would be great to resolve it :)

tanyav2 · 2024-04-21T19:28:10Z

I have the same issue, my current workaround is saving the dataframe to a csv and then loading the dataset from the csv. Would also appreciate it a fix :)

hunter2009pf · 2024-11-25T09:26:30Z

data = defaultdict(list)

awesome, it really works~

NielsRogge added the bug Something isn't working label Aug 5, 2022

mariosasko self-assigned this Aug 8, 2022

mariosasko linked a pull request Aug 11, 2022 that will close this issue

Support PIL Image objects in add_item/add_column #4828

Open

mariosasko added this to the 3.0 milestone Apr 12, 2023

mariosasko mentioned this issue Oct 12, 2023

Dataset.from_pandas with a DataFrame of PIL.Images #6288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset #4796

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset #4796

NielsRogge commented Aug 5, 2022 •

edited

Loading

NielsRogge commented Aug 12, 2022 •

edited

Loading

mariosasko commented Aug 12, 2022

darraghdog commented Aug 19, 2022

mariosasko commented Aug 19, 2022

stas00 commented Sep 20, 2022

stas00 commented Sep 20, 2022 •

edited

Loading

MaxxTr commented Feb 12, 2023

MaxxTr commented Feb 13, 2023

NielsRogge commented Feb 15, 2023

chumpblocckami commented Mar 31, 2023

umarpreet1 commented Apr 11, 2023

chumpblocckami commented Apr 13, 2023 •

edited

Loading

thinh-huynh-re commented Apr 26, 2023

LanceGao97 commented May 31, 2023

Ekhao commented Apr 11, 2024

NielsRogge commented Apr 11, 2024

tanyav2 commented Apr 21, 2024

hunter2009pf commented Nov 25, 2024

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset #4796

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset #4796

Comments

NielsRogge commented Aug 5, 2022 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

NielsRogge commented Aug 12, 2022 • edited Loading

mariosasko commented Aug 12, 2022

darraghdog commented Aug 19, 2022

mariosasko commented Aug 19, 2022

stas00 commented Sep 20, 2022

stas00 commented Sep 20, 2022 • edited Loading

MaxxTr commented Feb 12, 2023

MaxxTr commented Feb 13, 2023

NielsRogge commented Feb 15, 2023

chumpblocckami commented Mar 31, 2023

umarpreet1 commented Apr 11, 2023

chumpblocckami commented Apr 13, 2023 • edited Loading

thinh-huynh-re commented Apr 26, 2023

LanceGao97 commented May 31, 2023

Ekhao commented Apr 11, 2024

NielsRogge commented Apr 11, 2024

tanyav2 commented Apr 21, 2024

hunter2009pf commented Nov 25, 2024

NielsRogge commented Aug 5, 2022 •

edited

Loading

NielsRogge commented Aug 12, 2022 •

edited

Loading

stas00 commented Sep 20, 2022 •

edited

Loading

chumpblocckami commented Apr 13, 2023 •

edited

Loading