-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Convert() with chroma scale in single pass #354
Comments
In my opinion introducing cache control or thread and core number dependency is not something which should be introduced to Avisynth core just because of a YV12 to RGB32 conversion (or in general any other conversion). Way too special, and though it can be parameterized it may behave differently in each and every different (and real-world) script, memory type and processor configuration. |
Yes - and second close question: How to make colorspace conversion plugin ? I tried to make some tech demo of single pass YV12 to RGB24 conversion but currently stuck at the creating RGB24 colorspace output: Is it at GetFrame() or at plugin init somewhere ? Trying to look into avsresize.cpp at https://github.com/TomArrow/avsresize/blob/master/avsresize/avsresize.cpp but still not understand where it is switched. When trying to create new video frame with The dst returned as RGB24 line size but VirtualDub crash at load script and avsmeter still display it is YV12 colorformat (as input). So how to make different colorspace at plugin output ? I see a few plugins make colorformat conversion so it not very easy to find sample to modify. |
Well - the RGB-output plugin start working after setting vi.pixel_type= CS_RGBP8 in class constructor. But it turned was easier to start from Planar RGB output. Current tech demo of single pass and sort of point-resize for 2x upscale UV is https://github.com/DTL2020/ConvertYV12toRGB/blob/main/DecodeYV12toRGB.cpp . And when I tried to request pointers to planar planes of RGB planar frame - it return good ptr for PLANAR_R and zero ptrs to PLANAR_G and PLANAR_B. The PLANAR_R_ALIGNED (and G B ) also not working. |
Example of YV12 to RGB32 (interleaved) single-pass AVX2-based decoding - https://github.com/DTL2020/ConvertYV12toRGB/releases/tag/0.0.3 It run about 6x times faster single threaded in compare with ConvertToRGB32(). Though still uses sort of point-resize for UV planes upscale (draft/preview quality). The better implementation may support some close to Bilinear/Bicubic UV resize for better quality (or may be AVX512 for larger registerfile and better performance). Multi-threaded typically very fast saturates poor PCs host RAM bandwidth so full-cores running typically not required. The 2 threads close to saturate 6 ch DDR4 Xeon-based workstation (about 500..600 fps of UHD 4K in RGB32). |
Added control of load and store caching. https://github.com/DTL2020/ConvertYV12toRGB/releases/tag/0.2.1 . The load caching is hard to control to may be impossible (stream_load looks like required WC memory mode and it maybe not fast to switch with VirtualProtect ?). Adding prefetch with NTA hint ahead of current cacheline load sometime slightly helps and sometime make visibly slower (CPU-depend). Stream store generally work very nice and about double store to RAM performance with many writing threads. Also help in the intermediate script process (readback from RAM looks like still faster cached store + cached load). Example of AVSMeter at Xeon Gold 6134 (looks like 6ch of DDR4-2666 ?): |
You mentioned that "Though still uses sort of point-resize for UV planes upscale (draft/preview quality)". You could better match the original conversion process (for the sake of speed test) if you split it to explicitely use ConvertToYV24 first, then ConvertToRGB32. |
|
I not use planar RGB store now (looks no one need it) and with RGB32 format I also adjust integer param of GetWritePtr() manually to get right pointer to RGB32 frame buffer. For RGB32 buffer working ptr is |
At i3-9100T CPU (also 1ch of DDR4 RAM) LoadPlugin("DecodeYV12toRGB.dll") |
Hint: For non-planar (such as RGB32) colorspaces the convention is |
Yep, nice 6x speed for this setup. Anyway, what is the use case of this type of conversion? |
My benchmarks
|
I checked that yv24 to rgb32 conversion is still sse2-only, now I simply dropped the same code to a separate avx2 path, and will get it work to use 256 bit ymm registers instead of 128 bit xmms. I expect it's gonna be a bit quicker, but not that much. |
Profiling of ConvertToRGB32 shows also some visible time in the C++ library (about 20% of time in the vcruntime140). But if it make UV upsample and dematrix in the separate RAM scan - it will be significantly limited by RAM performance at 4K frame size and larger (when significant part of buffers to process not fit in L2/L3 cache). So it is good to make single-scan conversion functions for large frame sizes. At least supporting commonly used conversions like 4:4:4 and 4:2:0 with 8 and may be 16bit for HDR. |
I tried it as first - it return zero ptr. AviSynth+ 3.7.3 (r3940, master, x86_64) (3.7.3.0) |
It was request by gispos at doom9 forum https://forum.doom9.org/showthread.php?p=1986641#post1986641 - it looks he use it in the AVSpmod software to send frame to GDI to display. I not understand why he can not send YV12 or NV12 format to display accelerator but it looks RGB32 still used by some persons to send monitoring data to display. May be he can not do proper Win + DirectX (DirectDraw) programming to create window with some hardware accelerated YV12/NV12 decoding. In a perfect world for users of poors PCs with old and slow memory it is expected some AVS core filter to prepare data in NV12 formatted for DMA to display accelerator (may require special rows start address alignment and row pitch) so display driver may program DMA engine to grab data from host RAM to accelerator in fastest way. But it require to design some Windows-oriented API to display accelerators (using GDI ? or DirectX highly recommended by Microsoft). Also may be finally some direct monitoring window draw from AVS core. With some user-control like window RECT(). |
It looks in each family and generation intel change memory subsystem significantly (like auto hardware prefetchers tuning and may be cache design and other) so load caching skip may sometime helps and sometime make significant performance penalty - so it also param for fine tuning at each user compute host. Also as I know for real money earning users of mining software sometime disable auto hardware prefetchers in intel chips (using kernel driver or service registers programing from ring3 ?) . But it affect all software running so may decrease performance of all other filters. Intel may optimize the each chip generation for some marketing software pack currently at the market so it can not be tuned for all chips used with single setting. |
It must return a proper pointer. What happens if your parameter is PVideoFrame &dst (or PVideoFrame *dst ?), I suspect that GetWritePtr is returning a pointer only if the reference count is 1, but in this case you've 'cloned' the smart PVideoFrame pointer. |
You can have a look into sources if I do something not good with this C++ stuff - https://github.com/DTL2020/ConvertYV12toRGB/blob/bd88a2b5be6c84bb5650860eafb14fa6a4ea2716/DecodeYV12toRGB.cpp#L127 |
I mean not at the usage, but at the function parameter, no problem, I'm gonna try it out tomorrow if you won't succeed, |
Hello, hm... well then. AvsPmod is a bit older (20 years?) and is based on Python 2.7. I have been upgrading it for a few years. There is a lot of love and work not only from me but also from the developers at that time. DirectX Display should be very difficult to implement, AvsPmod is very extensive, many functions are based on existing. if dc:
There is the Avisynth function BitBlt... which I find quite superfluous, because then I can use the WinApi right away. I have a wish dream: |
This commit 3c26fd5 can help a little bit. AVX2 code path to YV24->RGB32/24 conversion. |
I added comments to commit text directly. About AvsPmod frame display I think it is better to post comments into doom9 thread about it ? |
Thanks to both of you, will take a look. |
When will it be available? Next test version? |
I didn't plan a new build today, maybe you can check if someone build it until then. I didn't have time to follow the doom9 conversation, but if I understand correctly, the final steps - before avspmod displays the result - are so time consuming, that they are comparable to the speed of the script itself? |
@DTL2020 Neither your method is using rounding before shifting back 6 bits to have the real integer final values. That is another speed gain, but against quality. So DecodeYV12toRGB cannot replace the original YUV->RGB conversion, and can only be used for monitoring purposes in present form. Which is still very good for just viewing the result of a YV12 clip in AvsPMod. |
" intermediate data of the matrix calculations must be extended into 32 bit register size during the process. That alone halves the available number of registers for storing the calculation results." If AVX2 registerfile of 512 bytes is not enough to process 64 sample positions per pass - it can be redesigned to AVX512 registerfile of 2048 bytes. It will support both 32bit precision and 128 samples per SIMD pass. Or 256 samples with 16bit intermediate precision. Also if 32bit intermediate precision is required - it can be redesigned to process 32 samples per pass with 32bit intermediates. Though I thought 16bit precision is enough. So we can have some family or performance/quality balanced conversions in AVX2/AVX512. The only reason I not plan to make AVX512 version very fast is because typically AVX2 2 threads/cores typically saturates 2ch DDR4/DDR5 cheap PCs and the AVX512 implementation will do it with 1 core/thread. But if high-performance and also high-32bit precision version required - it can be designed in AVX512. |
About precision with 8bit processing: It looks PC.709 (narrow range) matrix is somehow broken in AVS core - With script Output for colours (R,G,B): With full range (Rec709) conversion matrix the high and low (though low may be clipped to black) values run more evenly. |
Dunno, not very likely. Try doing the math manually, e.g. for blue. It's quite possible that the YUV representation of the colorbar cannot be so accurately given to have the exact RGB values in the result after the calculation. |
The poisondeathray comment at forum: The equivalent way in avisynth as what's used a Studio RGB NLE (like vegas) for 8bit for that example above would be ColorBarsHD(640, 480, pixel_type="YV24") Y 180,179,16 So it looks for some unknown (?) reason the PC-matrix not really 'keep range unchanged' but require additional range compression at input to create really mapped to 16..235 narrow range RGB. May be it is one more shadow of the past of AVS to keep compatibility with some other plugins and/or internals ? Aditional Levels() make precision and performance lower - may be add one more single-filter 'matrix' to ConvertToRGB() to create 16..235 RGB from 16..235 standard YUV ? |
"It's quite possible that the YUV representation of the colorbar cannot be so accurately given to have the exact RGB values in the result after the calculation." My finally fixed in math version with 16bit only immediates make process of ColorBarsHD() color bars into narrow RGB fitting 15..17 low and 179..181 high code values - https://github.com/DTL2020/ConvertYUVtoRGB/releases/tag/0.4.0 . It really required -16 for Y in all cases and also division of all UVs additionally to 1.02283 (224/219) because in standard digital YUV the UV is not in the same scale as Y. So it mostly looks like some special matrix coefficients not prepared for 'keep range unchanged' in AVS core. Also it looks for 8bit YUV the 16bit immediates is mostly enough and the standart narrow range of ColorBarsHD() in YV12 can not be decoded precisely to 16,180 because truncated to 8bit YUV already have some quantization nosie. May be only 10bit and more YUV can be converted to precise 16,180 8bit RGB. The 32bit integer immediate processing in YUV to RGB mostly required to handle 10..16 bit input/output. Also one more widely used use case for ConvertToRGB32() found - it looks the VirtualDub plugins only support input/output in RGB32 so users must convert to RGB32 and back if use VD plugin in the script. So that issue with special narrow to narrow levels preparation required may cause addtional distortions in processing (if VD plugin expect all RGB to be aligned to black and white levels ?) if user not know about requirements. |
I tried 2 more plugins for convert: avsresize and fmtc - they both output equal RGB in narrow range and can accept feed from ColorBarsHD() in YV24 directly: So it looks only AVS internal 'narrow' matrix is something special and of lowest precision if even feed by 'double-narrow' 8bit YV24. It may be good to either fix current PC-matrix or keep it for compatibility (?) and add one more matrix for 'narrow RGB' with direct accepting of 'standard YUV' in 16..235 range. |
Regarding the PC.709. http://avisynth.nl/index.php/Convert says: Let's see the following script:
ColorbarsHD sets the _ColorRange property to 1 (limited = 1/full = 0). This is true. zimg, when converting to RGB, defaults to RGB full-range, when we do not set the target color range. So when the source and target range is the same, then the bytes after conversion are the very same regardless of the limited/full setting. I don't know when zimg applies and deals with the extra limited->full conversion, but if range is not changed then it matches with Avisynth. And this is where Avisynth has a small bug, I think. Not in the conversion matrix or calculation. |
"Interleave(orig,avs,zimg_same_as_avs,zimg_same_as_avs2,zimg_auto,zimg)" It looks you not check the output values and only compare images on screen ? The main issue of PC.709 matrix is in its strangely low precision in compare with avsresize (and fmtc): Test script is: Output for colours (R,G,B): It looks something like narrow/limited range, but low values (at Yellow/Cyan/Green) and highs (at Magenta/Red/Blue) are much more than +-1 LSB error from 180/16 . If change to The output RGB Is in the +-1 LSB from expected ideal 180/16. So it looks PC.709 matrix is somehow not correctly build ? If AVS convert engine make calculation with high precision (32bit immediates ?) - why do we got such significant error ? Here is example of correction: Now RGBs are So it looks matrix coefficients for PC.709 (and may be other PC.X matrices) need to be checked and somehow adjusted. Currently dematrix to RGB performed with more errors in compare with other plugins and in compare with pre-corrected for saturation YUVs. Same apply to rec.2020 matrix: ConvertToRGB32(matrix="PC.2020") The ConvertToRGB32 RGB are: avsresize RGB are: With addition of ColorYUV(cont_u=-6, cont_v=-6) before ConvertToRGB32 the errors also become much lower. May be somewhere common error with UV scaling at dematrix for PC.x matrix calculation ? |
Yep, I look it visually, when range conversion happened, it was very visible, indeed. However, now I compared They are different. The full->full z_ConvertFormat method is giving exactly the same values as Avisynth's PC.709 result for most colorbar entries. so I suppose, there is another conversion in the z_ConvertFormat chain, when PC.709 is given to a limited-in limited-out flow. I suppose matrices are good, we just need to know what exactly happens in this case. (Regarding the 32 bit precision, the real precision is 13 fractional bits for YUV->RGB, which needs 32 bit integer intermediate. RGB to YUV is using 15 fractional bits) |
It looks I found the error in design - in full_scale=true mode of matrix creation the ratio of Digital UV and Digital Y scale (224/219) was missed. So patched function (in the convert_matrix.cpp) : `static void BuildMatrix_Yuv2Rgb_core(double Kr, double Kb, int int_arith_shift, bool full_scale, int bits_per_pixel, ConversionMatrix& matrix) if (bits_per_pixel <= 16) {
}
} /*
*/ const double mulfac = double(1 << int_arith_shift); // integer aritmetic precision scale const double Kg = 1. - Kr - Kb; if (bits_per_pixel <= 16) { double Srgb_f = bits_per_pixel == 32 ? 1.0 : ((1 << bits_per_pixel) - 1); Added Now Also may be same patch required for Rgb2yuv matrix calculation. Not checked yet the backward conversion. |
Good find, I was also just doing trial and error experiments, but did not find this one. O.K., now I just want to understand it. |
The commented-out equations only work correctly when all Y,U,V data in the same 'scale domain' . But in (some, 601/709/2020) Digital YUV standards the Y and UV data stored in different 'scale domains' integer numbers (may be to have lower quantization noise in UV data ?). So additional scale of 219/224 required before applying RGB equations. It looks was somehow implemented into full-RGB 'Rec709' (and other 'full') processing ways but was missed from rarely to use narrow/limited-RGB processing (PC.x matrices). For RGB2YUV same fix is required (with reversed correction multiplier value to 224.0f / 219.0f) - if (bits_per_pixel <= 16) {
}
} /*
*/ const double Kg = 1. - Kr - Kb; if (bits_per_pixel <= 16) {
} // for 16 bits, float is used, no unsigned 16 bit arithmetic Not add to special toYVY2 yet - not tested. Test script is With fixed version the residual error in UV channels reduced from up to +-2 LSB to +-0..1 LSB. |
I don't know yet the logic behind the additional scale of 219/224 factor. I changed Srgb (based on the logic of Sy and Suv) which must be 219 (235-16) for a limited RGB target.
Plus the limited range +16 offset must be added back after the matrix multiplication to the final R, G and B. Regarding the perfection; Now Cyan is 10 B4 B4 in Avisynth and 10 B4 B5 in zimg. At other bars they are equal to Avisynth or they are off by 1 lsb. |
It looks there are different ways possible to bring Y and UV to equal scale:
Both ways will make equations for RGB from YUV working correctly (but with integer math may produce a bit different rounding errors), but will produce result with different range. The source of 224 and 219 magic numbers is 3.4 in the same document https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.709-6-201506-I!!PDF-E.pdf - Quantization of analog RGB and YUV data and going into digital domain - the YUV domain data have different scale of 219 for Y and 224 for UV data in digital. As I compare these numbers are going the same from rec.601 to rec.2020 so can be set once (but must be re-checked for other possible systems ?). About +16 - yes. In my version of converter I also come to way of making computing always with black at zero (signed integer bipolar data). So to make narrow/limited range RGB output it require to add +16 at the end. The total computing in narrow/limited YUV to narrow/limited RGB looks like a bit overloaded with operations:
It was idea using of 'matrix' transform can somehow make computing more simple (less operartions so faster ?). The additional scale multiplier of 224/219 is really small enough of about 1.02283 so if it missed it may be not greatly visible in 8bit data but may give more visible errors with 16bit and float data. |
Maybe I'm going to reorganize the equations (http://avisynth.nl/index.php/Color_conversions) whether if the two methods are equivalent and bring us the same results, I guess, they are. Or if not, I'd like to understand, why. |
Also some more thinking may required if processing 'full-YUV' data. Do it simply additionally scaled UVs to the same as Y (255/219) ? Centered at 128, so some UVs possible in 'narrow/limited YUV' are clipped to 0 and 255 ? So in full-YUV the Digital UV is (224 * (255/219) * E'cr/cb + 128) * 2^(n-8) ? The Digital quantized data looks like additional 'compression domain' (integer and unipolar) and to go into some abstract 0..1 (float ?) data domain (bipolar) for conversion it must be performed 'decompression' (using different scale for Y and UV) - next is matrix-like YUV to RGB (or back) and next is 'compression' back into quantized (integer and unipolar) domain (using different scale for Y and UV in case of YUV output). I not good in computing math but may be that 3 steps processing may be somehow made more simple in computing using 'matrix' operation ? I currently not see the benefits of 'matrix' operation (using 2 zero redundant operations) over the 3 steps separated processing:
|
Luckily the final integer-arithmetic rounding (shifting right before returning to integer domain: +4096) can easily be extended to not only add the rounding but add (rounding and final offset). So there is no additional cost of returning to the limited RGB range. I just finished this parts (--> planar/packed rgbs, 8-14 bits, C, SSE2, SSE4.1, AVX2, uhhh), 16 bit planar is using floating point arithmetic inside, that's one more addition per channel (but I will template it to ignore the final zero addition when converting to full-range RGB, thus saving three floating point add_ps) |
" SSE2, SSE4.1, AVX2" It looks it take more time to see commits so I hope you will implement processing with enough 'workunit size' for SIMD so it can use all possible speedups by design - superscalarity (several no-data dependant operations at several dispatch ports at parallel) and several equal operations with no-dependent data at Throughput rate (not at Latency). " (shifting right before returning to integer domain: +4096) can easily be extended to not only add the rounding but add (rounding and final offset)" It may be dangerous for integer overflow. Though typically required offset for output RGB is 16/255 is not large. But the YUV to RGB already have danger to overflow because as I read YUV color space is larger in compare of RGB and some valid 0..255 YUV triplets may produce significantly out of range RGB triplets. So may be in best case each implementation of integer YUV to RGB decode need to be tested at some simple YUV 3x(0..255) integers full triplets walking generator test program and to see if not overflow with data inversion occur somewhere at any input YUV triplet. Or it may be documented that high-precision implementation only work valid with some 'valid YUV to RGB' data. To handle all possible cases (addition of bias/offset +some 'out of range' YUV decoding to single output sum before final rounding shift may require to use lower multiplier and it cause lower precision of computing. As I see the most suffering from lowered precision with 32bit integer immediates are 10..14 bit computing modes ? I not sure if ColorBarsHD test data generator cover all possible YUV triplets to decode. So we may have some balance of performance/quality in processing program design - if implementation can handle all possible YUV data and with max possible precision of integer computing (using max possible multipliers) - it may require additional 'add offset' operation at the end so total performance may degrade a bit. Though as I see if SIMD program is designed with max usage of register file and all possible performance features of SIMD - it is already much faster and typically saturate RAM bandwidth very fast (until we do not have next-gen XeonMax-like general purpose compute platforms widely used with HBM RAM integrated). |
... not to mention MMX (yes!), cases to implement multiplied by supporting packed rgbs, planar rgb, along with YUY specialities. Offset is never adjusted on 8 bit data, for 8-16 bit videos, its already 16-32 bit where this happens. Or even float, some parts of 16 bit input is using float inside because it either does not fit into integer arithmetic or it has so much overhead to do the "tricks" that float is much better. The plus one step of beginning offset removal affects only "limited" RGB ranges. Then there is establishing the input/output _ColorRange format: the source format can come from frame property / direct hint / assumed by being RGB or YUV / hint from matrix name (e.g. PC.xxx keeps the input range) Then the same matrices and syntax is used in ConvertToY, GreyScale, with their own MMX/SSE2/SSE4.1/... routines. |
This case was always handled of course. |
…709" or "PC.601"; Recognize studio rgb source (_ColorRange=1 limited RGB) in Greyscale and ConvertToYxx
It looks I understand why AVS core op
YV12 -> RGB32 with ConvertToRGB32() is not fully optimized:
The operation of
ColorBars(3840, 2160, pixel_type="YV12")
ConvertToRGB32()
Is equal in speed to
ColorBars(3840, 2160, pixel_type="YV12")
ConvertToYV24()
ConvertToRGB32()
So it looks very complex Convert() core functions is sometime a sequence of Convert() so make 2 or more close to full-frame RAM scan and so performance of YV12 -> ConvertToRGB32() is 2x slower in compare with BicubicResize of 2x size for UV planes. Also performance close to independent of resampler kernel (and support) used because memory transfer penalty is main limiting speed factor. Also it somehow visibly benefit from faster RAM hosts.
The really top performance YV12 to RGB32 SIMD function in single pass (3 RAM read streams and 1 write stream) is about the next:
So may be add some alternative single-pass higher performance Convert() functions for mostly frequent used colorspaces. May be using draft-quality resizer (point and bilinear). May be many users will not see great difference between bilinear and higher quality chroma upsampling (also even best upsampling linear methods still not perfect for bugged by design chroma subsampled compression systems) but will be more happy with much better performance of single-pass processing. Expecting performance benefit frrom single-pass YV12->RGB (interleaved) conversion may be > 2x.
Also as experiment add user-controllable option for high-performance filters about load and store cache control for tuning of performance of complex scripts (best setting may depend on frame size and cores/threads number and host RAM performance and cache size). Like params cl and cs (cache load and cache store). Or single param with bit fields 0,1,2,3. To make C-program text shorter it may be templated function with bool params for cache load and cache store (different intrinsics used for cached and uncached operations).
The text was updated successfully, but these errors were encountered: