Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avfilter/tonemapx: fix more range handling #477

Merged
merged 1 commit into from
Oct 9, 2024

Conversation

gnattu
Copy link
Member

@gnattu gnattu commented Oct 9, 2024

It appears that the fix for range handling in #472 was incomplete, and there was a misunderstanding about the use of the mysterious 28672. This value wasn't related to limited range, but rather an implementation borrowed from vf_colorspace. The decision not to use the full int16 range in that code was to provide buffer space for overflows and underflows, which I also encountered during debugging. While this approach works for upstream conversion, it's not straightforward and makes it impossible to use the CPU's saturate instructions for efficient clamping.

Also, even though limited range videos in 10-bit generally use the ranges 64-940 for Luma and 64-960 for Chroma, some videos extend below black (sub-blacks) and above nominal peak values (super-whites), providing signals ranging from 4 to 1019. These values are still considered TV range but can cause underflows and overflows if not handled carefully. The Samsung video sample we've tested has signals in such extended limit range.

To fix these issues, this PR implements a more direct approach: all input values, whether limited or full range, will be scaled to a 0-32767 range after the YUV-to-RGB conversion, and the reverse function will always expect the 0-32767 range. Previously, the scaling for limited range was set incorrectly, leading to over-bright outputs when converting from full to limited range, and also overflow on the output end. Overflow and underflow issues are already naturally handled in the AVX and SSE code paths after the range scaling fix, but significant changes were needed for the NEON code path.

Previously, the NEON implementation used half-width int16 to perform YUV-to-RGB conversion. This no longer works with the updated coefficients and causes overflow in cases like extended limited signals. Now, YUV-to-RGB operations are performed using full-width int32 values like AVX and SSE and saturated back to int16. This makes the NEON code path slightly slower, but the performance impact on a fast CPU (M1) is marginal (~3% on average, with a worst case of ~5%), which should be acceptable.

Additionally, the coefficient matrix now uses full-width int32 because int16 values can't hold the matrix values when scaling to 32767 from limited to full range. Most code paths already interpret the value as full-width integers, so there’s no observable performance impact, and theoretically only a slight increase in memory usage.

Changes

  • use full int32 for neon yuv2rgb and the matrix
  • fix range scaling in yuv2rgb/rgb2yuv matrix derivation
  • clean up some unused branching after p016 support removal

Issues

@gnattu gnattu requested a review from a team October 9, 2024 04:14
@gnattu gnattu force-pushed the fix-more-range-tonemapx branch from d02fa6b to eecdbb4 Compare October 9, 2024 05:57
@gnattu gnattu force-pushed the fix-more-range-tonemapx branch from eecdbb4 to 2fdd1c5 Compare October 9, 2024 06:13
@nyanmisaka
Copy link
Member

         * General design:
         * - yuv2rgb converts from whatever range the input was ([16-235/240] or
         *   [0,255] or the 10/12bpp equivalents thereof) to an integer version
         *   of RGB in psuedo-restricted 15+sign bits. That means that the float
         *   range [0.0,1.0] is in [0,28762], and the remainder of the int16_t
         *   range is used for overflow/underflow outside the representable
         *   range of this RGB type. rgb2yuv is the exact opposite.

Now that I remember it, I had similar overflow with my custom impl before switching to the one in vf_colorspace.

@nyanmisaka
Copy link
Member

Just a nitpick, one drawback of our vf_tonemapx (420p*) compared to the original vf_tonemap (gbrp*) is that we don't use a more proper upscaling method for 4:2:0 chroma -> 4:4:4 chroma but simply do AVG. This may cause aliasing, which is more noticeable in red. Might be worth improving in the future.

@gnattu
Copy link
Member Author

gnattu commented Oct 9, 2024

Just a nitpick, one drawback of our vf_tonemapx (420p*) compared to the original vf_tonemap (gbrp*) is that we don't use a more proper upscaling method for 4:2:0 chroma -> 4:4:4 chroma but simply do AVG. This may cause aliasing, which is more noticeable in red. Might be worth improving in the future.

But the upstream one is also using averaging?

@gnattu gnattu merged commit ac98a44 into jellyfin Oct 9, 2024
27 checks passed
@gnattu gnattu deleted the fix-more-range-tonemapx branch October 9, 2024 13:18
@nyanmisaka
Copy link
Member

Just a nitpick, one drawback of our vf_tonemapx (420p*) compared to the original vf_tonemap (gbrp*) is that we don't use a more proper upscaling method for 4:2:0 chroma -> 4:4:4 chroma but simply do AVG. This may cause aliasing, which is more noticeable in red. Might be worth improving in the future.

But the upstream one is also using averaging?

Oh my bad, it's not the AVG in the YUV output but the YUV input - four luma share the same one chroma without using any interp method.

But the yuv420p10 -> gbrpf32 conversion is done before vf_tonemap, which includes a chroma upscaling step.

It's noticeable on the edge of the red lines. Video players usually apply an upscaling algorithm so it's only noticeable in the browser.
1
2

@gnattu
Copy link
Member Author

gnattu commented Oct 9, 2024

Oh my bad, it's not the AVG in the YUV output but the YUV input - four luma share the same one chroma without using any interp method.

I know zimg can use bilinear interpolation but I'm a bit concerned about doing something like this in our software filter as the performance penalty could be high, and out GPU implementation already using some sort of averaging sampler which performs similar to bilinear interpolation.

@nyanmisaka
Copy link
Member

Oh my bad, it's not the AVG in the YUV output but the YUV input - four luma share the same one chroma without using any interp method.

I know zimg can use bilinear interpolation but I'm a bit concerned about doing something like this in our software filter as the performance penalty could be high, and out GPU implementation already using some sort of averaging sampler which performs similar to bilinear interpolation.

This is what I am worried about, there is no free image sampler on the CPU. For now we may have to deal with it. Our GPU impl uses no interp for Y and linear samplers for UV. That's why it looks okayish.
3
4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants