-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
# Pass Image color channels information to Transformers #2846
base: master
Are you sure you want to change the base?
Conversation
Background: In Huggingface Transformers' image processor, e.g. CLIPImageProcessor, the constructor requires input of input_data_format, which gives the Image's color channels being in the first or the last position in its shape. For example, if an image's shape is (512, 512, 3), it means its resolution is 512*512 pixels, and it has RBG, 3 color channels. In this case, input_data_format is ImageChannelDimension.LAST or ChannelDimension.LAST in Transformers. Sometimes, people would use customized Image format in a shape of (3, 512, 512) for performance purpose. Transformers requires users to point it out, or it would infer to tell it from its shape. Generally, an image would have 1 or 3 color channels representing Gray or RGB. So, the inferring algorithm in Transformers looks for 1 or 3 values in the image's shape. If your input images are in the shape of (3, xxx, 1) or (1, xxx, 3), the inferring algorithm would get confused, and raise following exception: 'The channel dimension is ambiguous. Got image shape (1, xxx, 3). Assuming channels are the first dimension.' 'ValueError: mean must have 1 elements if it is an iterable, got 3' Fix: 1. Add a class ImageChannelDimension to define 2 possible Image color channels position in an Image's shape 2. Input this information in model.encode method, and pass it to Tokenizer and image processor from Transformers.
With the fix we can tokenize image in the shape of (xxx, xxx, 3) like:
And tokenize image in the shape of (3, xxx, xxx) like:
|
@tomaarsen Hi Tom, would you help to check my PR? Thank you. |
@fpgmaas Hi Florian, would you take some time to review my PR? Thank you. |
@@ -28,6 +28,12 @@ | |||
from sentence_transformers.cross_encoder.CrossEncoder import CrossEncoder | |||
from sentence_transformers.SentenceTransformer import SentenceTransformer | |||
|
|||
class ImageChannelDimension(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be an Enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is copied from Transformers' repo. It is defined like this there. Because the string defined in the class is needed by Transformers' image-processor. If we use Enum, I think, we'll get integer values? And we need to convert to string before passing to Transformers?
Maybe |
Hey @davychxn , I am not a maintainer of the project and I lack in-depth knowledge to judge your proposed changes, so I'm afraid I cannot approve the PR. However, I left a few review comments anyways to help the PR along. |
1. Add doc-string for newly added 'image_channel_dimension' parameter of 'encode' function. 2. Changed the parameter's name from 'input_data_format' to 'image_channel_dimension'.
@fpgmaas Fixed. |
Thank you for your great help, Florian. I'll try to reach Tom separately. |
1. To make the 'tokenize' interface compatible between Texts and Images.
Hi @tomaarsen , would you review my change? Any improvement we'll need here? Thank you. |
This PR is needed while writing code to this project: https://github.com/Immortalise/SearchAnything I submitted the initial fix PR to Transformers, but @amyeroberts suggested fix in your repo. The closed PR is: huggingface/transformers#31950 |
Background:
In Huggingface Transformers' image processor, e.g. CLIPImageProcessor, the constructor requires input of input_data_format, which gives the Image's color channels being in the first or the last position in its shape.
For example, if an image's shape is (512, 512, 3), it means its resolution is 512*512 pixels, and it has RBG, 3 color channels. In this case, input_data_format is ImageChannelDimension.LAST or ChannelDimension.LAST in Transformers.
Sometimes, people would use customized Image format in a shape of (3, 512, 512) for performance purpose. Transformers requires users to point it out, or it would infer to tell it from its shape.
Generally, an image would have 1 or 3 color channels representing Gray or RGB. So, the inferring algorithm in Transformers looks for 1 or 3 values in the image's shape.
If your input images are in the shape of (3, xxx, 1) or (1, xxx, 3), the inferring algorithm would get confused, and raise following exception: 'The channel dimension is ambiguous. Got image shape (1, xxx, 3). Assuming channels are the first dimension.' 'ValueError: mean must have 1 elements if it is an iterable, got 3'
Fix: