Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset #4796

Open
NielsRogge opened this issue Aug 5, 2022 · 18 comments · May be fixed by #4828
Open
Assignees
Labels
bug Something isn't working
Milestone

Comments

@NielsRogge
Copy link
Contributor

NielsRogge commented Aug 5, 2022

Describe the bug

When adding a Pillow image to an existing Dataset on the hub, add_item fails due to the Pillow image not being automatically converted into the Image feature.

Steps to reproduce the bug

from datasets import load_dataset
from PIL import Image

dataset = load_dataset("hf-internal-testing/example-documents")

# load any random Pillow image
image = Image.open("/content/cord_example.png").convert("RGB")

new_image = {'image': image}
dataset['test'] = dataset['test'].add_item(new_image)

Expected results

The image should be automatically casted to the Image feature when using add_item. For now, this can be fixed by using encode_example:

import datasets

feature = datasets.Image(decode=False)
new_image = {'image': feature.encode_example(image)}
dataset['test'] = dataset['test'].add_item(new_image)

Actual results

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB size=576x864 at 0x7F7CCC4589D0> with type Image: did not recognize Python value type when inferring an Arrow data type
@NielsRogge NielsRogge added the bug Something isn't working label Aug 5, 2022
@mariosasko mariosasko self-assigned this Aug 8, 2022
@mariosasko mariosasko linked a pull request Aug 11, 2022 that will close this issue
@NielsRogge
Copy link
Contributor Author

NielsRogge commented Aug 12, 2022

@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:

from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL

# we need to define the features ourselves
features = Features({
    'a': Value(dtype='int32'),
    'b': Image(),
})

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

df = pd.DataFrame({"a": [1, 2], 
                   "b": [image, image]})

dataset = Dataset.from_pandas(df, features=features) 

results in

ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')

Will the PR linked above also fix that?

@mariosasko
Copy link
Collaborator

I would expect this to work, but it doesn't. Shouldn't be too hard to fix tho (in a subsequent PR).

@darraghdog
Copy link

Hi @mariosasko just wanted to check in if there is a PR to follow for this. I was looking to create a demo app using this. If it's not working I can just use byte encoded images in the dataset which are not displayed.

@mariosasko
Copy link
Collaborator

Hi @darraghdog! No PR yet, but I plan to fix this before the next release.

@stas00
Copy link
Contributor

stas00 commented Sep 20, 2022

I was just pointed here by @mariosasko, meanwhile I found a workaround using encode_example like so:

from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [] for k in ds1[99].keys()},
                       features=ds1.features
)
for i in range(2):
    # could add several representative items here
    row = ds1[99]
    row_encoded = ds2.features.encode_example(row)
    ds2 = ds2.add_item(row_encoded)

@stas00
Copy link
Contributor

stas00 commented Sep 20, 2022

Hmm, interesting. If I create the dataset on the fly:

from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [v]*2 for k, v in ds1[99].items()},
                        features=ds1.features)

it doesn't fail with the error in the OP, as from_dict performs encode_batch.

However if I try to use this dataset it fails now with:

Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 524, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2775, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2347, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "debug_leak2.py", line 235, in split_pack_and_pad
    images.append(image_transform(image.convert("RGB")))
AttributeError: 'dict' object has no attribute 'convert'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "debug_leak2.py", line 418, in <module>
    train_loader, val_loader = get_dataloaders()
  File "debug_leak2.py", line 348, in get_dataloaders
    dataset = dataset.map(mapper, batch_size=32, batched=True, remove_columns=dataset.column_names, num_proc=4)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2500, in map
    transformed_shards[index] = async_result.get()
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
AttributeError: 'dict' object has no attribute 'convert'

but if I create that same dataset one item at a time as in the previous comment's code snippet it doesn't fail.

The features of this dataset are set to:

{'texts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 
'images': Sequence(feature=Image(decode=True, id=None), length=-1, id=None)}

@MaxxTr
Copy link

MaxxTr commented Feb 12, 2023

@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:

from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL

# we need to define the features ourselves
features = Features({
    'a': Value(dtype='int32'),
    'b': Image(),
})

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

df = pd.DataFrame({"a": [1, 2], 
                   "b": [image, image]})

dataset = Dataset.from_pandas(df, features=features) 

results in

ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')

Will the PR linked above also fix that?

It looks like the problem still exists.
Any news ? Any good workaround ?

Thank you

@MaxxTr
Copy link

MaxxTr commented Feb 13, 2023

There is a workaround:
Create a loader python scrypt and upload the dataset to huggingface.

Here is an example how to do that:

https://huggingface.co/datasets/jamescalam/image-text-demo/tree/main

and Here are videos with explanations:

https://www.youtube.com/watch?v=lqK4ocAKveE and https://www.youtube.com/watch?v=ODdKC30dT8c

@NielsRogge
Copy link
Contributor Author

cc @mariosasko gentle ping for a fix :)

@chumpblocckami
Copy link

Any update on this? I'm still facing this issure. Any workaround?

@umarpreet1
Copy link

I was facing the same issue. Downgrading datasets from 2.11.0 to 2.4.0 solved the issue.

@mariosasko mariosasko added this to the 3.0 milestone Apr 12, 2023
@chumpblocckami
Copy link

chumpblocckami commented Apr 13, 2023

Any update on this? I'm still facing this issure. Any workaround?

I was able to resolve my issue with a quick workaround:

from collections import defaultdict
from datasets import Dataset
  
data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
    img = dataloader[idx]
    data['image'].append(img)
    data['text'].append(f"{img_{idx}})
  
dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)
  
dataset.push_to_hub(path-to-repo', private=False)

Hope it helps!
Happy coding

@thinh-huynh-re
Copy link

Any update on this? I'm still facing this issure. Any workaround?

I was able to resolve my issue with a quick workaround:

from collections import defaultdict
from datasets import Dataset
  
data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
    img = dataloader[idx]
    data['image'].append(img)
    data['text'].append(f"{img_{idx}})
  
dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)
  
dataset.push_to_hub(path-to-repo', private=False)

Hope it helps! Happy coding

It works!!

@LanceGao97
Copy link

how did this work, how to use this script or where to paste it?

@Ekhao
Copy link

Ekhao commented Apr 11, 2024

I had a similar issue to @NielsRogge where I was unable to create a dataset from a Pandas DataFrame containing PIL.Images.

I found another workaround that works in this case which involves converting the DataFrame to a python dictionary, and then creating a dataset from said python dictionary.

This is a generic example of my workaround. The example assumes that you have your data in a Pandas DataFrame variable called "dataframe" plus a dictionary of your data's features in a variable called "features".

import datasets

dictionary = dataframe.to_dict(orient='list')
dataset = datasets.Dataset.from_dict(dictionary, features=features)

@NielsRogge
Copy link
Contributor Author

cc @mariosasko this issue has been open for 2 years, would be great to resolve it :)

@tanyav2
Copy link

tanyav2 commented Apr 21, 2024

I have the same issue, my current workaround is saving the dataframe to a csv and then loading the dataset from the csv. Would also appreciate it a fix :)

@hunter2009pf
Copy link

data = defaultdict(list)

awesome, it really works~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.