Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to generate kernel_params, and go beyond 6 of them? #2

Open
rickbrew opened this issue Aug 26, 2022 · 48 comments
Open

How to generate kernel_params, and go beyond 6 of them? #2

rickbrew opened this issue Aug 26, 2022 · 48 comments

Comments

@rickbrew
Copy link

rickbrew commented Aug 26, 2022

Context: https://twitter.com/rickbrewPDN/status/1562499054952534018

The kernel parameters listed at https://github.com/mikepound/convolve/blob/master/complex_kernels.py are a set of 6 hardcoded 4-vector arrays. I've been having difficulty determining how these are actually generated, and how I might generate kernel parameters beyond the 6 that are listed here (as in, how to calculate for 7, 8, 9, etc.). The explanations I've found are not close enough to my comprehension of the math involved ("it's just a Gaussian kernel with a complex width" ... ??!! 😂 from: http://yehar.com/blog/?p=1495). Every example of Bokeh blur code I've found online lists the same set of pregenerated kernel values, with no code showing how to actually generate them.

My motivation is to experiment with higher "quality" settings in Paint.NET's new Bokeh blur filter. "Quality" is a slider in the UI that controls which set of kernel parameters are used, which is then tied to the number of rendering passes.

image

Ringing artifacts are present when there's an imbalance between quality, radius, and the gamma exponent (default is 3). I'm hoping to extend the range at which these values can be combined before artifacts are noticeable (or even eliminating them, if possible). My rendering pipeline is fully 32-bit floating point per pixel component (128-bits total), so there's a lot of room for extra precision vs. the typical 8-bit RGBA32.

My dream solution here would be a function that takes in an integer for the # of components, and returns an array of 4 vectors for the kernel parameters. For example, generate_kernel_parameters(2) would return [[0.886528, 5.268909, 0.411259, -0.548794], [1.960518, 1.558213, 0.513282, 4.56111]] (the numbers from https://github.com/mikepound/convolve/blob/master/complex_kernels.py). I'm not sure how feasible or computationally expensive that is though.

cc @Sergio0694 @saucecontrol @JimBobSquarePants

@mikepound
Copy link
Owner

This is a very interesting problem :) But it's also turning out to be quite a challenge! Not having the facility to do this myself was something that bugged me at the time of writing my original code, so this has been fun. This said, I have little in the way of optimisation experience, so this was a steep learning curve!

TL;DR

After some moderately successful attempts of my own, I actually found the original source code for the optimisation at the end of the blog post. With my own optimisation in Scipy I was fairly close, but unable to obtain the extraordinary accuracy of the original kernels beyond about 4 components. I think there are a couple of reasons, but primarily python is terribly slow for this, maybe 1000 times slower than the C++ implementation. I could probably close the gap a bit with some careful optimisation, but it doesn't seem like a good use of time. I'm going to focus instead on cleaning up and adapting the C++ implementation - it is super quick.

The best results I have so far is running the C++ global optimiser (more on this below) then passing the result through the python local optimiser for a polish. This also has the side effect of increasing the precision of the decimals we have for all component sizes.

Optimisation of kernel parameters

Prior to finding the original code, I came up with my own optimisation approach. As it happens this is fairly close to the one used in the C++ implementation. The function we're trying to fit is some kind of finite impulse response. It's generally zero outside of the range -1 to 1 (called the stop band), and 1 within this range (pass band). There is a small transition between 0 and 1, it looks like this:

target

To optimise you need to code a cost function that measures how good a given set of parameters are. My cost function ended up being pretty similar to the C++ one, which is a sum of squared error from the computed kernel, and this ideal curve. I also had some success in measuring the actual ripple in both the pass and stop bands. The problem with the optimisation is that it's extremely easy to be stuck in local minima, i.e. a broadly good fit to the data but not the actual minimum ripple possible. I tried many different optimisation algorithms, but because it takes a long time in python, if they are finding their way to a great solution, it's too slow to spot!

Here is an example of a 6 component kernel that fits pretty well, but the ripple is just a little too big:

ok6

I also made use of the SciPy differential evolution approach that is used in the C++ version, but it's too slow to converge in python. My best result was about 50% bigger than the original kernel (which in the grand scheme of things is nothing, but it's not good enough). I could probably match the performance of the original if I worked at it, but I've come to the conclusion my time would be better spend adding features to the C++ version. For example, donut bokeh, exploring different transition band sizes and so on.

C++ Optimiser

The original source code is right at the bottom of the blog. Most of it is aimed at reading and writing PNGs and doing the complex kernel convolutions (which here is implemented via fourier transforms - probably faster but I don't think in PDN case it's a huge issue either way), does PDN use FFT convolution?). I've stripped out all of the code to do with images and left only the kernel optimisation in. I ran the 7 component kernel on our servers overnight for 1 billion (lol) iterations, and it's absolutely nailed it. I then took the parameters and tweaked them using a local search (DE and others are global search, better for getting out of local minima, but need fine tuning at the end). The result is great:

7vs6

It's small, but you can basically see the ripples of the 6 component above and below the 7. The original 6 component model has a ripple size of 0.0012 compared to the new 7 component ripple size of 0.0038, 3x smaller!

Next Steps

I'm running 8, 9, and 10 component models now on our servers. I'll add some numbers when they come in. I'm not sure what's causing the ringing in PDN's implementation, but I suspect that when you use a high gamma / kernel width, this essentially scales up the kernels massively, causing even small ripples to become a bigger deal. We will see I guess if higher component sizes help. There may be other issues such as the size of the kernel (in pixels) compared to the mathematical one within it, but that's probably something to look at later.

I'm going to re-optimise the lower component counts too as I think I can push the accuracy higher than the original post through a combination of running it longer, and fine tuning. For now, here is the 7 component model:

[a b A B]
[5.635755002716984, 2.0161846499938942, -127.67050821204298, 189.13366250400748]
[6.2265180958586, 6.010948636588568, 255.34251414243556, 37.55094949608352]
[6.189230711552051, 8.269383035533139, -132.2590521372958, -101.7059257653572]
[4.972166727344845, 12.050001393751478, -0.1843113559893084, 27.06823846423038]
[4.323578237784037, 16.00101043380645, 5.837168074459592, 0.3359847314948253]
[3.6920668221834534, 19.726797144782385, 0.010115759114852045, -1.091291088554394]
[2.2295702188720004, 23.527764286361837, -0.07655024461742256, 0.01001768577317681]

It's probably worth putting this in and seeing if the banding goes down in extreme cases. It was noted by the original author that pushing the components up and ripple down has the side effect of increasing some of the component weights (A, B). This probably isn't an issue at the precision you're using, but with gamma we might want to go over some numbers to make sure there isn't a chance of overflow in extreme cases.

I'll post again when I have more numbers, and I'll start messing around adapting the C++ code, will post that code up soon.

@rickbrew
Copy link
Author

rickbrew commented Aug 27, 2022

Wow, thanks for the detailed explanation! Well if it takes server(s) running overnight to generate these, that certainly answer my question about whether these could be generated at runtime: no :)

I added the 7th component and it does reducing ringing a little bit:

(with 6)
image

(with 7)
image

The first 6 kernels I have are only specified with 6 decimal points of precision:

// 3 components
new Vector4Double[]
{
    new Vector4Double(2.17649, 5.043495, 1.621035, -2.105439),
    new Vector4Double(1.019306, 9.027613, -0.28086, -0.162882),
    new Vector4Double(2.81511, 1.597273, -0.366471, 10.300301)
},

Do you have the first 6 with these higher precision values like you do for the 7th? That might also help the ringing a little. It may be prudent for me to cap the gamma at a lower value as well. Edit: I plugged in the set of 6 higher-precision numbers from the blog and it produced a null output (all zeros), so those numbers seem incompatible.

Using the high gamma values may require 10 or more passes to get rid of the ringing for good, but we'll see as the numbers come in

@rickbrew
Copy link
Author

Also, I assume that the kernel scale for the 7th set of numbers is also 1.2?

private static IReadOnlyList<double> KernelScales
{
    get;
} = new[] { 1.4, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2 };

@mikepound
Copy link
Owner

That looks much better, I agree that (assuming ripple continues to decrease) 9-10 components should see very negligible ringing. How long do these effects take to compute on the GPU? There are a lot of convolutions happening here, but I consistently underestimate GPUs :) Are we talking "live preview" kind of speed? Or just acceptable for a click and see how it looks.

I will re-run the optimisation with different component amounts and try to extract the higher precision numbers, they should all support this but it'll probably only make a notable difference at 5+ components.

My guess is the width multiplier will be the same for 7+ components, but we should probably check e.g. a 64 width and see if point sources are approx 64 pixels across once it's done. I can't remember exactly why I did this, but I think I tried to match the width to what looked good because the edge of the disc has a kind of slope to it. Maybe I decided that the blurred edge of the disc should count within the width, rather than outside it - a judgement call for PDN I think. I suppose the main goal is that regardless of which kernel quality you use, you want the actual blur to be roughly the same size. I think the reason that most of the kernels are the same size and only the 1 component is different is simply because it's a much worse shape, and so the ring ends up smaller and needs to be scaled.

Passable kernels probably appear fairly quickly even on a laptop, but I don't think it'll be possible to optimise at high quality sufficiently quickly. If we wanted to do donuts, tighter and blurrier edges and so on, regrettably I think the best thing would be to do a massive grid search for all the parameters we want at all combinations of inputs.

I do think arbitrary kernels computed dynamically might be possible with the sum table approach, which I hope to have a go at over the coming weeks - at least as a proof of concept. The idea is you turn each convolution into a series of summations which can be made much faster by calculating an integral image. I did a video on viola jones which uses a similar shortcut here: https://www.youtube.com/watch?v=uEJ71VlUmMQ, about 6 mins in. The challenge won't be the image stuff really, it'll be creating nice rectangular regions from arbitrary kernels. But this is essentially a rasterisation problem, I'm sure there's a nice solution somewhere. The other thing is that these kinds of kernels would work mostly with hard edges, and I don't know if that looks better or worse.

@rickbrew
Copy link
Author

How long do these effects take to compute on the GPU?

For a 4,896 x 3,264 px image (this one: https://unsplash.com/photos/Ys-DBJeX0nE ), at default settings (radius=25, gamma=3, quality=3), it takes maybe half a second to run on an NVIDIA RTX A4000 16GB (approx. GeForce 3070). Bumping quality up to 6 makes it take about twice as long. Not real-time enough for gaming, but real-time enough for a photo editing app. Bumping the radius up to the max (300) gives us ~2.5s at quality=3, and about ~5s at quality=6 (these are just rough timings with a stopwatch). Gamma doesn't affect performance.

Performance could be improved by utilizing compute shaders instead of pixel shaders. A pixel shader, at least in Direct2D, can only emit one value -- it can't write to multiple render targets. So the real and imaginary components for the vertical convolution are processed separately, then plugged in as the inputs for the horizontal convolution shader. A compute shader should be able to drop 3 shaders down to 2, in other words. This is at best about a 30% speedup; worth doing, but we (@Sergio0694 and I) don't have the infrastructure in place for it yet. That'll be a 2023 project.

It's easy to underestimate GPU performance, but also just as easy to overestimate it. I ported some fractals over and they were 100-1000x faster. They work great because they're output-only shaders. However, in some cases the GPU can actually be slower. This usually happens when the CPU rendering code benefits from the ability to randomly access pixels without any real penalty, whereas on the GPU the rendering has to be split up into tiles and multiple passes that can each fit within the rendering engine's maximum intermediate texture size (for Direct_2D_ that's 4096 x 4096 pixels, but Direct_3D_'s ultimate maximum is 16384 x 16384 anyway so you'd run into this regardless). I have many effects that are immensely faster on small to large images, but start getting a lot slower once we get into enormous images (like 32K x 32K). The rendering engine has to buffer, subdivide, and shuffle things around a lot more and it isn't efficient anymore.

@Sergio0694
Copy link

Sergio0694 commented Aug 27, 2022

"Performance could be improved by utilizing compute shaders instead of pixel shaders. A pixel shader, at least in Direct2D, can only emit one value -- it can't write to multiple render targets. So the real and imaginary components for the vertical convolution are processed separately, then plugged in as the inputs for the horizontal convolution shader."

Yeah, this is a compromise the D2D version of the bokeh blur effect had to make, I don't think there's a way around that due to us using pixel shaders right now. My compute shader version of the bokeh blur effect, which uses DX12 compute shaders, doesn't have this limitation, and in theory we should be able to convert this to D2D compute shaders as well. Like Rick said though, that'll require significant work as right now I haven't added the necessary infrastructure to support that (tracked at Sergio0694/ComputeSharp#351), nor Rick has added support for this on the Paint.NET side either. Definitely a 2023 work item, not sooner. But, even with that, I don't see how it'd bring us more than a ~30% improvement or so. It'd pretty much just allow the pixel shader version of bokeh to do 2 1D convolutions per pixel instead of 3, per component. Which is definitely not bad (on paper), but not something that would fully address the scaling issue with more and more kernels and with larger radius. But then again I guess it's also fine given we don't really care about real time support here 🙂

"It's small, but you can basically see the ripples of the 6 component above and below the 7. The original 6 component model has a ripple size of 0.0012 compared to the new 7 component ripple size of 0.0038, 3x smaller!"

I'm a bit confused by that point, did you write those two numbers backwards or am I missing something? As in, if using 7 kernels results in a smaller ripple size, shouldn't that be 0.0012 compared to the 0.0038 we had before? 😅

Anyway, really awesome work on all this Mike, and thank you so much for helping! It's so cool to work together on something, and at least for me, not something I'd have imagined ever happening when I first watched your bokeh video years ago and I was just starting to mess with the original Python code you wrote to port it to C# and ImageSharp. This all makes me so happy 😄

@mikepound
Copy link
Owner

Yes a typo :)

It's no problem - Actually the last 6 months have been so busy with academic stuff it's nice to do something a little different!

Optimising the 8 kernel has gone well - 2x smaller than at 7!

[6.6430131554059075, 2.33925731610851, -665.7557728544768, 445.83362839529286]
[8.948432332999396, 5.775418437190626, 1130.5906034230607, 15.626805026300797]
[6.513143649767612, 8.05507417830653, -419.50196449095665, -9.275778572724292]
[6.245927989258722, 12.863350894308521, -100.85574814870866, 79.1599400003683]
[6.713191682126933, 17.072272272191718, 36.65346659449611, 118.71908139892597]
[7.071814347005397, 18.719212513078034, 21.63902100281763, -77.52385953960055]
[4.932882961391405, 22.545463415981025, -1.9683109176118303, 3.0163201264848736]
[3.456372395841802, 26.088356168016503, 0.19835893874241894, 0.08089803872063023]

9 and 10 are looking a little more tricky. The optimisation process is random, it's fairly predictable for lower component counts, but a lot more prone to being stuck in local minima for higher component counts. What this means essentially is the weighting of the components are adjusted into a position where they cannot be improved, and the true optimal position is so wildly different that it cannot jump to this. My first two trainings didn't work, producing accuracy around the level of a 7 component filter. I think the solution is simply to run about 5 of each and see which ones hit, I'm doing this now and at least one is looking good.

I've discovered another reason this optimisation works better than my own is that there's some peculiar code within the cost function that actually seems to go in and overwrite parameters within the optimisation. This isn't usually how it works, normally the optimisation comes up with parameters, uses the cost function to evaluate them, and then makes decisions on what to do next. Here the cost function is actually changing the parameters too. My guess is the author (who seems to have a background in signal processing) is controlling the values for some of the parameters to ensure they stay within reasonable ranges, to coax the optimsation into the right direction. The good news obviously is that this works, the less good news is that it might make this more difficult to port to other bokeh types.

@mikepound
Copy link
Owner

mikepound commented Aug 28, 2022

Optimised 9 kernel:

[7.393797857697906, 2.4737002456790207, -1796.6881230069646, 631.9043430000561]
[13.246479495224113, 6.216076882495199, 3005.0995149934884, 169.0878309991149]
[7.303628653874887, 7.783952969919921, -1058.5279460078423, 459.6898389991668]
[8.154742557454425, 13.430399706116823, -1720.108330007715, 810.6026949975844]
[8.381657431347698, 14.90360902110027, 1568.5705749924186, 285.01830799719926]
[6.866935986644192, 20.281841043506173, 90.55436499314388, -59.610040004419275]
[9.585395987559902, 21.80265398520623, -93.26089100639886, -111.18596800373774]
[5.4836869943565825, 25.89243600015612, 5.110650995956478, 0.009999997374460896]
[5.413819000655994, 28.96548499880915, 0.2499879943861626, -0.8591239990799346]

Current ripple ripple sizes are:

6: 0.003890
7: 0.001285
8: 0.000694
9: 0.000417

We're nearly at the limit I think, will focus on 10 to finish. Notice also the multipliers are getting quite big, e.g 3000 for A component 2. I think you're using high precision behind the scenes so this is probably fine. Imagine a pixel at max 255 brightness, scaled by maximum gamma, then multiplied by 3000, that's the kind of max range we need to handle.

@mikepound
Copy link
Owner

This one wasn't easy :) I ran the optimisation about 50 times for 10 components. 49 of them failed to find a decent optima, but one made it! Ripple is 0.00029.

[7.148185516924103, 2.4927349464132558, -1242.6335320030207, 435.1383849993189]
[13.97011128857182, 3.895575022534836, 1392.3314399989267, 638.7307769993176]
[6.8989770619005375, 7.476990508899397, -375.25432200238583, 485.81362499651027]
[7.0295872483302295, 12.172942127710726, 50.24416900119528, 358.28113199585056]
[7.602767849720905, 15.65397904113909, 251.22341800281518, -98.10971700247157]
[6.624758031834333, 18.75368396970372, -70.80337099593586, -33.03615600114058]
[5.841447009473684, 23.267768013476132, -13.118463995076524, 11.5558290020063]
[6.836643992578203, 25.436927996489974, 9.560691003252611, 3.0842850032656344]
[4.31348099995382, 28.96548400078732, -0.5601639992634236, 0.010000007078592587]
[3.0536740000345017, 32.57196899995972, 0.009999996171068724, 0.03756300474029547]

I've also marginally improved on some of the others. The complete list is:

optim4 = [4.894547282763508, 0.929376567571601, 0.714407681087016, 84.512659879888560, 3.666703743886918, 4.336932586332215, 2.801952481408896, -17.838314975955342,2.626272224663539, 8.135593719310219, -2.537157440802193, 0.589049137537433, 1.290766926824529, 12.347721342563711, 0.010000000000002, 0.227466315879772]
optim5 = [4.716883447930945, 1.7358319178135824, -21.69215371604072, 72.39689418968186, 4.617433689271565, 5.182588501065085, 35.32177160227391, -21.0988424879328, 4.142932558052223, 8.360699330164213, -13.251867272261515, -3.711126526016351, 2.987295576935829, 11.904916752786518, 0.47666554573981107, 1.9589101025300553, 1.5300037016315535, 16.110869110006078, 0.1415257483220136, -0.009987947127719764]
optim6 = [4.8250215031288795, 1.863061689623832, -36.782352280221346, 81.36352349123085, 4.770507936049109, 5.703686744609449, 53.4515131597662, 0.010211273746183322, 5.036597150326645, 9.380850085354819, -12.064235688807356, -32.14410659609454, 4.696499100343814, 12.165429883418184, -5.184640171161125, 12.868037804173614, 3.3515804060339978, 15.658860435220012, 1.589545666571149, -0.4191098554782176, 1.8025717988071053, 19.850286335894605, -0.011516257341898792, -0.09291397508712977]
optim7 = [5.635755002716984, 2.0161846499938942, -127.67050821204298, 189.13366250400748, 6.2265180958586, 6.010948636588568, 255.34251414243556, 37.55094949608352, 6.189230711552051, 8.269383035533139, -132.2590521372958, -101.7059257653572, 4.972166727344845, 12.050001393751478, -0.1843113559893084, 27.06823846423038, 4.323578237784037, 16.00101043380645, 5.837168074459592, 0.3359847314948253, 3.6920668221834534, 19.726797144782385, 0.010115759114852045, -1.091291088554394, 2.2295702188720004, 23.527764286361837, -0.07655024461742256, 0.01001768577317681]
optim8 = [6.6430131554059075, 2.33925731610851, -665.7557728544768, 445.83362839529286, 8.948432332999396, 5.775418437190626, 1130.5906034230607, 15.626805026300797, 6.513143649767612, 8.05507417830653, -419.50196449095665, -9.275778572724292, 6.245927989258722, 12.863350894308521, -100.85574814870866, 79.1599400003683, 6.713191682126933, 17.072272272191718, 36.65346659449611, 118.71908139892597, 7.071814347005397, 18.719212513078034, 21.63902100281763, -77.52385953960055, 4.932882961391405, 22.545463415981025, -1.9683109176118303, 3.0163201264848736, 3.456372395841802, 26.088356168016503, 0.19835893874241894, 0.08089803872063023]
optim9 = [7.393797857697906, 2.4737002456790207, -1796.6881230069646, 631.9043430000561, 13.246479495224113, 6.216076882495199, 3005.0995149934884, 169.0878309991149, 7.303628653874887, 7.783952969919921, -1058.5279460078423, 459.6898389991668, 8.154742557454425, 13.430399706116823, -1720.108330007715, 810.6026949975844, 8.381657431347698, 14.90360902110027, 1568.5705749924186, 285.01830799719926, 6.866935986644192, 20.281841043506173, 90.55436499314388, -59.610040004419275, 9.585395987559902, 21.80265398520623, -93.26089100639886, -111.18596800373774, 5.4836869943565825, 25.89243600015612, 5.110650995956478, 0.009999997374460896, 5.413819000655994, 28.96548499880915, 0.2499879943861626, -0.8591239990799346]
optim10 = [7.148185516924103, 2.4927349464132558, -1242.6335320030207, 435.1383849993189, 13.97011128857182, 3.895575022534836, 1392.3314399989267, 638.7307769993176, 6.8989770619005375, 7.476990508899397, -375.25432200238583, 485.81362499651027, 7.0295872483302295, 12.172942127710726, 50.24416900119528, 358.28113199585056, 7.602767849720905, 15.65397904113909, 251.22341800281518, -98.10971700247157, 6.624758031834333, 18.75368396970372, -70.80337099593586, -33.03615600114058, 5.841447009473684, 23.267768013476132, -13.118463995076524, 11.5558290020063, 6.836643992578203, 25.436927996489974, 9.560691003252611, 3.0842850032656344, 4.31348099995382, 28.96548400078732, -0.5601639992634236, 0.010000007078592587, 3.0536740000345017, 32.57196899995972, 0.009999996171068724, 0.03756300474029547]

I couldn't match the performance on 2 or 3 components, and I don't think the precision is necessary anyway, so I recommend leaving them as they are.

@rickbrew
Copy link
Author

Fantastic 🥳 Here's the results at each quality level. I think we're seeing diminishing returns at 9-10, so that's a good place to stop.

First, the input image, cropped from https://unsplash.com/photos/Ys-DBJeX0nE :
image

For each, radius=25 (default), gamma=8 (max, with 3 being the default).

Quality = 1, obviously not great, but expected
image

Quality = 2
image

Quality = 3
image

Quality = 4
image

Quality = 5
image

Quality = 6
image

Quality = 7
image

Quality = 8
image

Quality = 9
image

Quality = 10. Finally the ringing around the bright spot near the top-left is almost gone
image

Once we're at high values for quality, the rippling only seems to be apparent at high gamma values (>=5), so this should be perfect.

@rickbrew
Copy link
Author

Now, if we take this photo, https://unsplash.com/photos/jV-SWhl7eHM , and abuse the gamma value, we still have quite a bit of ringing at max radius, but I don't think there's really anything that can be done about that since the values are just blown way out to begin with (due to the gamma exponent). At reasonable gamma values everything looks fantastic.

Radius=300, Gamma=8, Quality=6
image

Radius=300, Gamma=8, Quality=10
image

@rickbrew
Copy link
Author

rickbrew commented Aug 30, 2022

I did also implement an experimental "Max Quality" mode, where the effect switches to a regular (non-separated) convolution kernel implementation. Ringing is absent in that version. Performance, as you'd expect, was fine at small radii but then at large/max radii it was bad ... really really bad. So bad, in fact, that setting the radius to 300 immediately caused a GPU driver timeout (soft crash) 😂 I implemented tiling for the convolution kernel (that is, split the kernel up into 64x64 tiles and run them separately, then sum them all up) and it worked better.

But when I switched to testing on a lower-end GPU (a Quadro P2000 -- still a quite decent card!) it refused to run and Windows forced things to run on the higher-end GPU through some kind of silent fallback mechanism (well, and a user notification).

image

@mikepound
Copy link
Owner

Haha - well it was worth a go :) This looks great though, I think massively pushing the gamma is going to do strange things unless we can ultimately implement a blur that doesn't have any ripple at all. It also might be worse on the image you're testing on due to the dark background.

There won't be a separable kernel to solve this, but along the lines of your "maximum" I'm going to have a stab at implementing the rectangular integral image version, which I think should have similar computational requirements regardless of the kernel size - although maybe not in version 1 :) Don't worry you aren't compelled to use it, I've wanted to implement it for a while and add it to this repo, so if it ends up being useful, great, if not no worries. Quick question regarding computational requirements, as it informs how I implement it:

An integral image (done separately for R, G, B) is an image in which each pixel contains the sum of all the pixels above and to the left. For example for this image and its integral

image

The sum of the middle 4 pixels is 24 - 6 - 6 + 1. Once the integral image is calculated, any rectangular region can be calculated using 4 lookups. You can see how this would be helpful for circular kernel convolutions where all of the weights are 1, as we have here, and also for arbitrarily shaped kernels. The main thing I need to work on is creating an efficient rectangle configuration from any kernel - I'll start with circles and go from here.

I guess it's not a big deal because you're already creating a number of intermediate component outputs in the existing version, but is it acceptable to create and store one of these integral images for each channel? In an ideal world we'd also store the 1d horizontal and vertical versions, where each pixel is the sum of only those to the left or above, rather than both. For rectangles of height or width 1 we can reduce the problem to 2 lookups if we store these representations too, at the cost of 3 representations instead of 1. Is this acceptable?

@rickbrew
Copy link
Author

rickbrew commented Aug 31, 2022

Well let's figure out how much memory that would take. We'll likely need a full 32-bit integer value for each pixel component, which can handle an image 16.7M x 16.7M pixels (2^24 x 2^24) where every pixel is full brightness (A=R=G=B=255). Dropping to 16-bit integers reduces the maximum size we can handle to 256x256, which doesn't seem like a reasonable limitation. So that's a full 128-bits / 16-bytes per pixel, or 96-bits / 12-bytes if we don't care about alpha, which is then 3-4x the size of the image we're starting with. So it'll take a lot of memory on the CPU, which is where we'd do the calculations for the integral table, and then it'd also require a lot of memory on the GPU to store so that the shaders can access it.

The next thing is that Direct2D can only really take inputs of float or byte, the latter of which is normalized (e.g. value / 255.0f so the range is actually [0, 1]). We can't use uint so we'd have to use float, and precision would suffer once we have an image larger than 65,536 x 65,536, because 65536 x 256 = 16,777,216, which is where float stops being able to represent every integer value.

Alright, so that basically limits us to 65,536 x 65,536 images for a fully general purpose algorithm, which is probably fine in practice. Especially since we likely won't be working with all white images so the actual limit is much higher.

For one of these sample images, like the cat, it's 5,028 x 3,652 pixels. That takes up ~73.5MB of memory CPU-side to store at 32-bit BGRA. For the integral table, quadruple that to about 293.8MB, for a total consumption of ~367.2MB. That's not too bad, but it is a lot, especially since GPUs have relatively small pools of memory compared to CPUs.

For larger images, or just up-sized versions of existing ones, let's calculate it for 20,112 x 14,608 (that same image at 400%): that's 1.175GB for the image and 4.701GB for the integral table, or ~5.876GB total. That's about the maximum we can handle with a modern consumer card such as a GeForce 1060 w/ 6GB. A lot of those 1060s were sold with only 3GB. They can still work with 6GB data sets but performance will be significantly reduced -- essentially it has to page, but instead of CPU memory <--> disk, it's GPU memory <--> CPU memory.

The unknown here is how Direct2D will handle all of the intermediate buffers that are necessary for the whole transform graph, which will depend a lot on how (where) we need to read from the image. Direct2D is happiest when it only needs to handle "simple" inputs -- where the input and output coordinates are 1:1 and there's no need (or ability) to read from neighboring pixels. A simple brightness/contrast adjustment is an example of a "simple" transform.

"Complex" inputs allow you to sample elsewhere, but you pay a price, in memory usage but especially in flexibility. If you only need to read from a small neighborhood of pixels, like with a convolution matrix transform, you're fine. But, if you need to sample from anywhere, it doesn't matter how many samples you need because Direct2D needs to bind a texture for the bounding rectangle and it'll refuse to go above 4096 x 4096 pixels. So the real price you pay is related to locality.

I have found ways to get around this, but it depends on what you need to do with the pixels, and how much you're willing to pay in terms of performance. I have something called a SampleMapRenderer that takes an input image and a special "sample map" pixel shader whose output is interpreted as [X, Y, alpha, (unused)] values. Then, through multiple rendering passes, it gathers all of the requested pixels, sums up the partials, and produces an output. So if the sample map shader emits [ 3, 4, 0.5, 0 ] for its output pixel at x=0, y=0, then the output image will have input[3, 4] * 0.5 at its [0, 0] pixel location.

Here's an example of a sample map at work to produce an image through a polar inversion transform:

image
image

The performance cost of an effect like this is O(n*m), where n is the number of pixels, and m is the number of 4K x 4K tiles that the image must be divided into to fit within Direct2D's 4K x 4K intermediate texture limit (the hardware's limit is 16K x 16K so we'd still hit a limit even w/ OpenGL, Vulkan, or Direct3D11 / 12). Each of the m rendering passes must process n pixels, so it's basically quadratic, since m is derived from n.

Each rendering pass of the sample map renderer gathers the pixels from 1 of those 4Kx4K tiles -- but the sample map shader can "place" those pixels anywhere, so each rendering pass produces a full-size image. Then the output of each rendering pass is summed to create the final image. Thankfully you don't need to store each rendering pass's output until the very end, you can sum as you go. For an image less than 4Kx4K, we only need 1 rendering pass, but for 16Kx16K we need 16, and 64Kx64K needs 256 rendering passes. Adding in multisample antialiasing with subpixel offsets will multiply that by 1, 4, 16, 64, etc. so we can achieve a nice looking output.

I use this sample map renderer for all of the effects in Paint.NET's "Distort" submenu. They perform really quite good when there's only 1 tile, but at large sizes -- let's say 30K pixels on either dimension -- they start dropping below the performance of the old CPU-based implementations. And that's on a 24GB GeForce 3090! So they have limits, but luckily those limits are right around the size of images we'd actually want to work with anyway (outside of my own debugging and testing).

Anyway, my point is not "let's use a sample map renderer", but rather that where you need to read pixels from is much more important than how many samples you need, and that sometimes you can work around it, but when you can it still has quite a large cost. So I'd need much more information about that in order to evaluate whether what you are interested in doing is feasible for GPU processing.

@rickbrew
Copy link
Author

From watching your video where you talk about the integral tables, it sounds like you do need to access these values from "anywhere," even though you only need to read a few of them. In order to get this sort of thing to work on the GPU, the work must be subdividable. Sometimes you can subdivide into tiles, sometimes you can subdivide into multiple rendering passes, sometimes you need both. If we're lucky, these subdivisions won't be overwhelming and we can actually get some work done.

So if you want to start figuring out how/if this can work on the GPU, consider how you would break up your algorithm such that you can only ever read a maximum of 1000 pixels away from the current location. (I know I said 4K x 4K, but if every output pixel needs to read from a 4K x 4K region then the pixel shader has to be run once for every output pixel, with a 4K x 4K input, which means that the intermediate has to be prepared for every pixel and.... it doesn't work. It's kind of complicated...)

@Sergio0694
Copy link

Just wanted to say that all this back and forth also makes me think we could have a lot of fun if Mike also ever wanted to try out writing his own plugins for Paint.NET, especially now that they can use the GPU too through ComputeSharp.D2D1 😄

It's especially interesting (and challenging) at first because due to how shaders work, it's a completely different way of thinking compared to the traditional "sequential, scanline-like" way of programming some CPU-powered image effect. Just a note eheh

@rickbrew
Copy link
Author

I can think of several tricks for transforming all of the data into forms that Direct2D and a GPU can actually process, but I need to know more about how the integral image is sampled. How do you determine which pixel from the integral image to read from? For the output pixel at (x,y), do you only need data stemming from the integral image at (x,y)? Or do you need to jump around more? Hmm ... maybe some pseudo-code that shows how 1 output pixel is calculated would be helpful!

@mikepound
Copy link
Owner

I may have to have a go at writing a plugin sometime :)

I think if we're careful the version with only one integral image shouldn't be much slower, this is something to experiment with. The good news is the sampling strategy is quite easy to predict: Given an integral image and a rectangle x0,y0,x1,y1, the sampled points are (x1,y1), (x0-1,y1), (x1,y0-1) and (x0-1, y0-1). Basically the maximum distance you'll need to sample is the size of the kernel + 1. If you limited your kernel size to 300px as you've done in the implementation above, then that is approximately the distance we're talking. Given you're also sampling around some point, then 150px from the centre of the kernel.

The viola jones face detector takes the concept further because faces could take up a huge amount of the image, so it has to sample from anywhere, we can avoid this.

@rickbrew
Copy link
Author

How timely, I just saw this pop up in my GitHub feed from my friend @alexreinking
image

gpufilter stands for GPU Recursive Filtering (paper 1). Additionally, filtering can also be regarded as Prefix Scans or Sums and Summed-area Tables (paper 1) or Integral Images.

@rickbrew
Copy link
Author

rickbrew commented Aug 31, 2022

Alex pointed me to the paper that gpufilter is based on: https://hhoppe.com/recursive.pdf

(referring back to the post above ...) Is it possible to split these calculations into two passes, i.e. is it separable? I'm still catching up on morning coffee and I haven't been able to tackle this just yet, but presumbly if you have an integral table

A-----B
|     |
|     |
C-----D

And the result you want to calculate in a pixel shader is f = D - C - B + A, you can hopefully fold that into horizontal and vertical passes. This would not change the result of the calculation (modulo some precision, possibly), but it would provide Direct2D with easier to manage input/output rectangles for buffering through intermediate textures

@mikepound
Copy link
Owner

mikepound commented Aug 31, 2022

That paper looks like a cool way to produce the initial summed area table, which is a large proportion of the processing required to perform a filter. After that it's a little clever indexing, I hope not too challenging.

You could turn a rectangle into e.g. a series of horizontal 1px rows, or vertical columns, but I don't think we'd necessarily need to. The big advantage we have is that at the kernel is flat, i.e. all pixels that are summed have identical weight of 1. This basically means it doesn't matter what position each pixel is in, only that they are included in the sum. This isn't true of the complex kernel, and it's not true of e.g. gaussian blur. When calculating a gaussian blur you need not only the pixel values but also their associated weights, which means you have to do one pixel at a time, multiply it by its weight at the corresponding position of the kernel, and then sum it up. That gaussians are separable (g(x,y) = g(x) * g(y)) is the only reason it's practical.

In this case there is essentially no kernel, at least there are no weights, you only need the sums and you can shortcut this using the summed area table. The blog highlighted one strategy for fitting rectangles to a circle:

image

Each rectangle needs 4 positions A B C D in order to calculate the sum of all pixels within that rectangle. There are 15 in this image that comprise the circle, so that's ~60 reads. Here is the same image where i've highlighted the pixels that would need reading for the big black rectangle, and another smaller white one:

image

If you use the column and row sum tables too, then 1px rectangles are 2 reads not 4, at the cost that you need to calculate and store those additional tables.

The process for filtering an image with a circular blur then would be something like this:

  1. Calculate a set of rectangles that fill the circle of desired radius, preferably as few large ones as possible. The rectangle coordinates are relative to the centre of the kernel. E.g. negative coords ok.
  2. Calculate the summed (integral) image for each channel.
  3. for all pixels (x,y):
    ..sum = 0
    ..for all rectangles:
    ....sum += D - B - C + A // Where these points are the rectangle points + (x,y)
    ..sum /= total size of all rectangles

For a gaussian or complex blur the number of reads is the width / height of the kernel * 2 passes. For this it's ~60 reads and add them up - having already expended the effort to calculate the integral image, which is not insignificant.

There are a couple of nice advantages:

  1. The edge is razor sharp, but there might be aliasing issues so I don't know how this will look until we tried it. I have some ideas for aliasing but it's hard.
  2. It scales well for any size of kernel as long as the rectangles are computed fairly efficiently (if all rectangles are 1x1 pixel it's disastrously bad!). Edit: Oh and yes while it scales well, the larger your kernel the larger the rectangles, and so theoretically the more spread out your A B C D points might be. But they'd never be further apart than your max kernel width / height.
  3. It handles circles, hexagons, octagons, PDN logo, any kind of bokeh shape you want, as long as there is an algorithm to fit rectangles to that shape. You would create either a pixel-wise kernel that you then fitted rectangles to, or you could do it faster algorithmically for some common shapes.

The main hindrance is fitting nice rectangles to the bokeh shapes you want. The circle is I think fairly easy, arbitrary shapes would need some careful thought to ensure they didn't degenerate into tiny rectangles which becomes inefficient.

@rickbrew
Copy link
Author

Once we have an integral table, the remainder of what you describe is very doable. From a data transfer standpoint, it's pretty similar to a convolution matrix shader.

For generating the rectangles to sum, we can use standard polygon scan-conversion over on the CPU. This list of rectangles can be fed to the shader through a resource texture containing a 1D array of 4-vectors that describe each rectangle.

So in a way this is a convolution matrix transform, but with clever RLE compression.

For antialiasing, I think it might be possible, but it requires more thought, and it might be best achieved with some approximation (cheating). I do think we'd have aliasing artifacts without some kind of antialiasing or feathering.

The main obstacle then is ensuring that the integral table can be calculated efficiently. I'm still thinking about how that can be done on the GPU instead of the CPU.

@mikepound
Copy link
Owner

I agree, this is absolutely a convolution, just with only weights of zero or one which makes summation much faster. Actually there is a suggestion for aliasing on the original blog:

"To get an anti-aliased edge for the circle, the edge pixels can be summed or subtracted with weights separately, doable in the ready-for-prefix-sum representation."

I'm not sure how fast this would be, but if we could calculate scalar weights for the anti-aliased pixels around the edge, we could simply include these in the original sum. Pixels that were weighted down would not be included in any rectangles. The blog also notes that everything here could be precomputed and stored for every possible radius, but I think that's less appropriate here where there are hundreds of possible radii, and where arbitrary kernel shapes are possible.

Scan conversion was definitely the right keyword I wasn't aware of - loads of resources! I'm not super clued up on graphics etc., but it looks like there are quick algorithms for common polygons and the general case. Presumably also aliased versions?

Just fired up the alpha build, it looks really good! How did you do the original "unfocus" feature?

@mikepound
Copy link
Owner

Ooh, subtracting with weights actually is really neat, saves messing about removing these pixels from the rectangles. For an aliased pixel of weight t you subtract val*(1-t) from the overall sum.

@rickbrew
Copy link
Author

For anti-aliasing, I agree that we would treat them separately. The pixel shader would take 2 input arrays (via resource texture)

  • a 1D array of float4 for each of the solid rectangles, encoded as { dx, dy, w, h }
  • a 1D array of float4 encoded as { dx, dy, alpha, <unused> }

The second array need not even use the integral table -- it could sample from the regular input image. Although, if we can make it work via the integral table, that's one less buffer for Direct2D to marshal around, which could be better.

These two arrays can be generated fairly easily on the CPU side:

  • Using Direct2D and a CPU-side bitmap, render a filled ellipse (or any kind of geometry) with antialiasing enabled to an Alpha8 target (can also use BGRA32 and ignore BGR)
  • Scan the bitmap and RLE-encode the pixels with A=255. This generates the first array. (You could also just tesselate the geometry to a polygon, then scan-convert the polygon -- I can do this easily in the PDN codebase)
  • Scan the bitmap for pixels with A between 1 and 254, which generates the second array

As for Unfocus, it's one of the few effects that I am not sure how to port to the GPU. It uses a "local histogram" of the surrounding pixels, one histogram per channel. This is tough to do on the GPU because I need 4x 256-wide arrays. On the CPU I can incrementally process this by subtracting the left edge and adding the right edge as I move from x to x+1. But a pixel shader can only output 1 pixel and it can't really share any information for the neighboring pixel.

@mikepound
Copy link
Owner

mikepound commented Sep 1, 2022

I've been messing around with a function to return some rectangles from the RLE of a bitmap. It's ok for a first attempt I think, works with a variety of shapes. Rendered in random colours for viewing, as you suggested the aliasing is ignored but is there (which is why the circles don't look quite circular).

image

It does RLE per row, then merges down rectangles as it can. I could do further merging passes easily enough but actually I'm not sure it makes it any more efficient. You could make larger rectangles within some of these objects, but doing so would split the existing columns, which may only need 2 reads anyway. In fact it might be more efficient for me to re-split a couple of them so that it's all either rectangles or columns, and there are now rows or single pixels.

@rickbrew
Copy link
Author

rickbrew commented Sep 3, 2022

So from what I understand, we don't actually need a full integral table. The full integral table is problematic to calculate in Direct2D because it means that calculating its bottom-right pixel ultimately requires access to the whole input image. It'll work up to 4096x4096 pixels, and refuse to render for anything larger.

I think this is what the paper was describing, but we can get away with tiled/partial integral tables. That is, we split the image into tiles of size b x b, and each tile contains the sums as if the image had been cropped to that tile. If we choose a tile size equal to the size of our "convolution matrix", then the number of accesses we need is still constant: pixel for the bottom of the rectangle, pixel from the bottom of the tile above, and then the pixel for the top of the rectangle within that tile. (and likewise horizontally)

So, for 4x4 tiles, for a 1 pixel-wide, 7-pixel tall rectangle extending from B to D, assuming B->D spans two tiles, we do D + C - A:

+----+----+
|    |  A |
|    |  B |
|    |    |
|    |  C |
+----+----+
|    |    |
|    |    |
|    |    |
|    |  D |
+----+----+

There would be 4 cases to consider: whether the rectangle spans 1 or 2 tiles horizontally, and then vertically. So there'd be some branching in the shader. Might be able to mask that away with some clever use of sign() and multiplication.

Is that correct? The inner loop would be slower, but all the work to get there should be much faster and simpler. But, I'm not confident in my math. This would limit the size of our RLE convolution matrix, but should still allow calculations to be separable into H and V passes.

@rickbrew
Copy link
Author

rickbrew commented Sep 3, 2022

For a rect that spans all 4 tiles (expanded to 5x5 tiles for legibility), we'd have:

I + H - G + F - E + D - C - B + A

Note that only I is one of the rectangle's vertices

+-----+-----+
|     |     |
| A  B|   E |
|  +------+ |
|  |  .   | |
| C| D.   F |
+--|..+...|-+
|  |  .   | |
|  |  .   | |
|  |  .   | |
| G+-H----I |
|     |     |
+-----+-----+

@rickbrew
Copy link
Author

rickbrew commented Sep 3, 2022

Also I think the shader can take 4 sets of RLE data:

  • rectangles
  • vertical strips
  • horizontal strips
  • pixels

Each of these has a slightly different implementation strategy, and the latter 3 can be optimized a bit. At least this way if we have an irregular shape that has a lot of horizontal or vertical strips, we aren't using the full rectangle evaluator.

Might actually use 4 different shaders, which would permit omitting any that aren't needed for a particular convolution. It might also just be easier to write the code that way. But, that's something I'd let the performance dictate.

@mikepound
Copy link
Owner

mikepound commented Sep 3, 2022

My lack of knowledge of Direct2D is showing :) I agree that I don't think you need access to the full table as long as you can visit neighbouring tiles, and if the size of the tiles matched the max convolution, you'd only need two tiles horizontally or vertically. Here are some questions so I can understand it better:

  1. Suppose for the sake of argument calculating large integral tables was cost free, and that the max size of the convolution operator was e.g. 300px by 300px. In such a case theoretically the maximum distance between read points would be 300px, but are you saying that this still wouldn't work without tiling, because to do this across the image the shader would eventually need access to anything?

  2. How did you deal with this in the complex kernel implementation? In my python implementation I lazily just held the whole image and all intermediate convolutions in RAM as full size, did you tile it?


I've continued fiddling about with the rectangle generation code, should have another implementation of it shortly that fixes a couple of niggles in the ones above. My original implementation somewhat overcomplicated things. Having played with it a bit I'm not convinced it's worth the effort to utilise all three of rectangle, horizontal and vertical integral images.

image

This is basically vertical RLE followed by merging of aligned columns. I'm trying to decide if it's worth doing more here to maximise the size of the rectangles in use. Theoretically there are some large "wide" rectangles that could span most of the kernel, for instance the blue and yellow ones here:

image

I did those two by hand simply to illustrate. Actually this configuration is almost 2x less efficient, because all those low cost vertical columns are now split in two. To make this kind of thing work we'd then have to reorient many of the vertical sums into horizontal - but not the ones at the side! This is a lot more complex as an algorithm, and I'm not sure how effective it becomes when your kernel is very complex in shape. For simple shapes maybe, but at that point you'd be better of hand crafting an optimal for every possible shape.

One advantage of having only rectangles and 1px vertical columns is that you no longer need to store the horizontal integral map at all, saving 1/3 of the memory? I think super optimising the rectangles vs horizontal vs vertical components for each kernel would maybe get you 50% savings at most, but I think this approach is many fewer than currently required in a complex convolution using e.g. 10 components, which is still pretty quick.

@rickbrew
Copy link
Author

rickbrew commented Sep 3, 2022

  1. Suppose for the sake of argument calculating large integral tables was cost free, and that the max size of the convolution operator was e.g. 300px by 300px. In such a case theoretically the maximum distance between read points would be 300px, but are you saying that this still wouldn't work without tiling, because to do this across the image the shader would eventually need access to anything?

If the integral table was provided separately -- let's say we calculated it on the CPU, stored it in a bitmap, and then sent it over to the GPU in the same manner as the regular image -- then no, this would work fine. You'd only need access to the pixels you originally described, which is tightly bound by the size of the convolution. It's easy to do all sorts of random access on the CPU with huge bitmaps. I can access (x=0, y=0) and (x=300,000, y=25,000) in two adjacent instructions with no trouble at all (assuming monolithic bitmap storage).

The difficulty is when we want to calculate the integral table on the GPU. In that case our input image needs to run through an effect with two pixel shaders, separated into horizontal and vertical passes. Pixel shaders do calculations one pixel at a time and cannot share state between pixels, which means output pixel [x=width-1, y=0] would need to read all the way back to [x=0, y=0]. So, some form of tiling and some other magic (as described in that paper, I believe) would be necessary to get Direct2D to render this.

  1. How did you deal with this in the complex kernel implementation? In my python implementation I lazily just held the whole image and all intermediate convolutions in RAM as full size, did you tile it?

The bokeh blur is implemented as a series of transforms. Here's the code over in ComputeSharp.D2D1: https://github.com/Sergio0694/ComputeSharp/blob/main/tests/ComputeSharp.D2D1.Tests/Effects/BokehBlurEffect.cs#L415 . Paint.NET's code looks very different but is basically doing the same thing, also using Direct2D.

// Build the effect graph:
//
// [INPUT] ---> [BORDER] --> [GAMMA EXPOSURE] ---> [VERTICAL REAL] ----
//                                 \                                   \
//                                  -----> [VERTICAL IMAGINARY] ----> [HORIZONTAL] ---> [INVERSE GAMMA EXPOSURE] ---> [OUTPUT]

(The 'border' node provides a means for a sampling clamp at the borders, so that trying to read from, say, [-3, -3] produces the same result as [0, 0]. Sometimes this is necessary or not, and you can use other border modes like Wrap and Mirror, and I have also implemented Transparent. There are some pictures here that illustrate how it works: https://docs.microsoft.com/en-us/windows/win32/direct2d/border )

Each one of these nodes in the effect graph is essentially a function that maps from N input images (1 or 2 in this case) and M configurable property values to 1 output image. Each node has a pixel shader for the GPU-side processing. Each node also two internal functions, MapInputRectsToOutputRects and MapOutputRectToInputRects, that define the area of the output based on the areas of the inputs, and then which areas of the inputs are needed to produce areas of the output.

There is no manual management of buffers or bitmaps. I produce the GPU-side [INPUT] image from a CPU-side bitmap. I then create the other transforms and connect them together. I have a device context, aka render target, and once I have [OUTPUT], I call DrawImage() to draw the effect graph to the render target. The render target is attached to a GPU-side bitmap that I can then present to the display or copy back to a CPU-side bitmap for storage or other processing.

(BTW, "image" vs "bitmap" is an important distinction. A bitmap is a monolithic allocation that holds pixels and provides direct access. An "image" is an object that defines an output area and that can copy pixels into your buffer. A bitmap is an image, in the OOP sense. See also IWICBitmapSource and IWICBitmap for the CPU-side equivalents. Also, an effect is an image too. Effect inputs are images.)

(Also, I'm referring to "effects" and "transforms" loosely as if they're the same thing. They're not, but that's a lower-level detail. Effects are connected together to form an effect graph. Each effect internally defines a transform graph. The effect graph is "flattened" to produce the full transform graph, and that's what Direct2D actually renders. Effects are a high-level construct for regular folks who are doing image processing tasks, and transforms are utilized by effect implementors. One transform = one pixel shader.)

During DrawImage(), Direct2D does all the work of asking the nodes in the transform graph about how all the input regions and output regions are mapped, and then it figures out a way to buffer and shuffle pixels from one node to the next. It executes the pixel shaders in the manner prescribed by the transform graph and the input/output mappings. If any node's output requires too much area from an input, Direct2D will refuse to render the effect graph. Direct2D will try to subdivide rendering into tiles, splitting output areas to try and get input areas that are small enough, but it can only go so far and it can't do magic.

The gamma exposure nodes will say that its input/output rects have a 1:1 mapping. Each output pixel is strictly a function of the input pixel at the same coordinate. Easy. Direct2D refers to this as a "simple" transform.

The "VERTICAL" convolution nodes will say that any given output rect maps to { input.Left, input.Top - kernelLength, input.Right, input.Bottom + kernelLength }, which defines the area that the pixel shader can read from. Similarly for the "HORIZONAL" convolution node, but along the other axis. Direct2D refers to these as "complex" transforms.

A pixel shader can (try to) read from areas outside of the areas declared by MapOutputRectToInputRects. Out-of-bounds accesses on the GPU do not produce an error, but essentially produce undefined results. You'll either get 0's (transparent black) or colors that are leftover from other rendering passes (gibberish, in other words).

Direct2D may determine that it can do everything in one rendering pass and bind the whole input image to the pixel shader's inputs. This will be true for smaller images, probably 4K x 4K or less. Otherwise it'll copy into intermediate textures and render across multiple passes using only sub-regions of the image.

@rickbrew
Copy link
Author

rickbrew commented Sep 3, 2022

One advantage of having only rectangles and 1px vertical columns is that you no longer need to store the horizontal integral map at all, saving 1/3 of the memory? I think super optimising the rectangles vs horizontal vs vertical components for each kernel would maybe get you 50% savings at most, but I think this approach is many fewer than currently required in a complex convolution using e.g. 10 components, which is still pretty quick.

I don't think 3 integral tables are necessary, just the 1 that has sums for both vertical and horizontal.

I think for any given convolution shape, optimization of its rectangles should not try too hard. Scan-convert it to horizontal rects, scan convert it to vertical rects, and use whichever one has fewer rects. Of course there's opportunity to spend more time optimizing it, but I think that's a separate research topic. The optimal scan-conversion strategy may very well be GPU-dependent based on how the hardware implements and optimizes memory access, and how parallel execution amongst cores is structured.

I think the most I'd do is identify disjoint regions (polygons) and optimize them separately. There's no reason we can't have a convolution shape like this, for instance, as it's just polygons that need scan-conversion:
image

@mikepound
Copy link
Owner

mikepound commented Sep 4, 2022

Thanks a lot for this, loads to think about! I've made some progress. I've got an initial implementation working, and have learned quite a bit! I've done calculation of integral images, rectangles, aliasing and so on. I've not sorted the border for the filter yet because it's quite complex, and I'm trying to think of the most efficient way to do it. At the moment it just remains black and I've cropped it off.

Here's a few examples:

Circle:
image

Octagon:
image

Circle:
image

Octagon (it is subtly different I swear):
image

Donut:
image

The natural blurriness of images helps somewhat with aliasing, it's more of an issue with sharp gradients coinciding with sharp kernel edges. Aliasing does make a difference:

Without / With
imageimage

imageimage

It's hard to gauge speed because this is python and the complex filters are also really slow, but I think it'll be fine. The integral image is quick on the smallish images I'm using, rectangle finding is almost instant. It scales ok, obviously larger kernels and more aliased pixels adds computation.

There are no ripples in this approach, so you can ramp up the gamma a bit:

Gamma 9
image

Not perfect, any imperfections in the circles become more pronounced and there are some odd areas probably where the extreme gamma has amplified dark areas and revealed issues, but to me it looks pretty good.

@rickbrew
Copy link
Author

rickbrew commented Sep 6, 2022

I've not sorted the border for the filter yet because it's quite complex, and I'm trying to think of the most efficient way to do it.

You can expand the bitmap by the size of the needed border and fill it using transparent pixels or with clamping or mirroring. This is how we do it in Direct2D using the BorderEffect -- although we get the nice benefit that, while it affects the size of the image, it doesn't affect its coordinate space. The pixel at (0,0) is the same, the border extends to infinity from all edges.

Direct2D's Gaussian Blur effect documents that it uses a mirror border when you use a "hard" border mode. Using a "soft" border will treat all out-of-bounds pixels as if they're transparent black (all zeros), and will produce an output that is larger than the input by an amount equal to the radius (so they say, I've measured variance there). A hard border will produce an output image that is the same size/location as the input image, and out-of-bounds pixels will be pulled from inside the image, e.g. if (inputX < imageLeft) { inputX = imageLeft + (imageLeft - inputX); }

You'll probably need to create a copy of the bitmap to do this, but then you won't have to worry about anything in the rendering code. Just clamp your for-loops to only process the un-bordered portion of the bitmap. You should only have to expand the image by the size of the convolution. Direct2D can extend things to infinity because it virtualizes it -- it backtracks from the final output region to determine what input areas are needed, and only worries about those.

@mikepound
Copy link
Owner

Would you advocate adding any kind of border, mirroring etc prior to calculating the integral image? I can't quite wrap my head around border stuff when we're talking about an integral image, and rectangles that may sit fully outside the image, or partly outside the image. It could mean a lot of slow code for checking border / edge intersections. In essence: what is the correct sum for a rectangle which has some part of it lying outside the normal image region?

We've got exam resits this week and I'm exams officer so progress will slow for a few days while I sort through that mess. I'll get back to it soon!

@rickbrew
Copy link
Author

rickbrew commented Sep 7, 2022

Because the convolution will be sampling from the border region, it should be included in the integral image.

So you take your input image, apply a border of (kernel size) pixels (might be able to get away with kernel size - 1). Calculate the integral image from that. Run the convolution with the bordered image and the integral image that was calculated from it, and write to a new bitmap of the original size. The loop that calculates output pixels only needs to run on pixels inside the border (that is, not including the border), and won't need to worry about checking that the pixels it's reading are in-bounds.

So the input is (w x h), bordered and integral image are (w + k*2) x (h + k*2), output is (w x h). The rendering loop can run on pixels (left=k, top=k, right=w+k, bottom=h+k).

@Sergio0694
Copy link

Applying this border is also a very common practice in deep learning frameworks such as TensorFlow, where convolutional nodes give you the ability to choose between different padding modes. Traditionally, common ones are no padding, which means the convolution only computes values for an inner area of the image excluding the outer frame of thickness (kernel size - 1), and the result is smaller than the original image along each axis. What Rick is suggesting instead maps to "same" padding, whereas the image is pre-padded with an extra frame filled appropriately (eg. could be all 0s, or integral values here), so that the convolution code doesn't need to worry about bounds checks. This is generally a very good idea because preparing the padding is usually cheap, and is worth the tradeoff considering the performance boost you get from removing bounds checks (ie. eliding a ton of conditional branches) from the inner convolution code, which is the hot path for the operation 🙂

@rickbrew
Copy link
Author

rickbrew commented Sep 7, 2022

BTW here's a link to the paper by Michael A. Shantzis, "A model for efficient and flexible image computing," that describes the algorithm used in D2D's imaging/effects system: https://dl.acm.org/doi/10.1145/192161.192191 ... it's a Postcript file though, I don't even have an app installed that can load that right now. The paper is linked from this page in D2D's documentation for ID2D1DrawTransform: https://docs.microsoft.com/en-us/windows/win32/api/d2d1effectauthor/nn-d2d1effectauthor-id2d1drawtransform

I tried downloading this a long time ago but it was behind a paywall or some other confusion.

@mikepound
Copy link
Owner

Thanks, very interesting :) I found the PDF, might be a uni thing, a lot of these paywalls seem to disappear depending on the machine I use.

I should be able to test out both mirror border, repeating border etc fairly easily. It definitely sounds like adding some padding here is easier than messing about with rectangle intersections. I was trying to be clever with my coordinates but it was turning into a mess.

As a complete aside, did you see this?

https://twitter.com/nousr_/status/1563912133120647168

Maybe there's a future plugin to be written. They ended up implementing a locally listening python server that handles the stable diffusion. I've been using SD the last couple of days, getting it doing image 2 image is fairly easy.

@rickbrew
Copy link
Author

Yeah the StableDiffusion stuff is ... a whole new thing that's going to be very interesting to watch.

On another note, I know you said you weren't able to improve on the values for kernel parameters 1 through 4, but did you generate them?

I've run into what seems to be a bug ... if I set it to quality=1, radius=0.9, the output is .... blank! The reason is that the sum of the generated kernel values is negative, then the sqrt gives us NaN and everything rolls downhill from there. A value of radius=0.85 produces some kind of sharpened output, as does even radius=1

Original:
image

Radius=1, Quality=1
image

Radius=0.85, Quality=1 (you can type 0.85 into PDN, and it'll display 0.9 but it is using 0.85)
image

And r=0.9 is just blank due to the NaN.

Soooooo my question is, did you generate those values? If so, what were they? I could plug them in and see if they work around this bug. If I can't find a resolution to this I may just have to set the min radius to 1. Which isn't terrible, but it's nice to be able to pull the slider all the way down to 0 for easily comparing against the unprocessed image.

@rickbrew
Copy link
Author

rickbrew commented Sep 26, 2022

If I had to take a guess at what's going on, it's that the kernel values are only reasonable when sampled within +/- radius from the center. Since the kernel must be generated for integer pixel offsets, grabbing the value at dx=-1 (and dx=+1 of course) when r<1 we end up with values that dip way out of bounds.

And, PDN's code is special in this case because it actually permits non-integer radius values. Everything else I've seen is using integers.

@rickbrew
Copy link
Author

(btw that cat picture was generated with Stable Diffusion, I found it on the Discord server by user Jim|MaximumJames|40 https://discord.com/channels/1002292111942635562/1002300536906858637/1021288145049886730 )

@mikepound
Copy link
Owner

It takes no time to generater 1-4 kernel parameters, so I'll do this tomorrow. I'd never considered what happens with fractional or very small radius values, are you using similar code to mine to map radius to kernel? I may need to plot these on a graph to see what happens, but it could also be the normalisation code - it's been a while since I looked at that bit.

The sum under the kernel should be close to 1, otherwise the image will get brighter or darker when you convolve. In the normalisation function, I pass over the 2D coords of the kernel and calculate the kernel value at that position, summing these all up. You also have to do it across all components.

I'm not sure quite what happens when r<1, I assume that the actual kernel you produce has some minimum radius, e.g. 3x3? The sharpening is also interesting. My guess is the same scaling down is happening, but rather than the kernel being negative, you're ending up with something where the peaks and troughs lie above and below zero, and it accidentally becomes a kernel that amplifies edges.

@rickbrew
Copy link
Author

Yes I believe the minimum kernel size is 3, even for r<1. It's not clear to me how this should be done. Maybe I need to just clamp values so that so that non-integer radii don't sample the "nonsense" values. But then when r<1, we would only get 1 real sample since the first and last values of the 3-width kernel would just be 0. An integral of some kind might be the proper approach, but I'm not sure how to approach that. Might be able to get it good enough by checking that the kernel value being generated is outside of the radius, clamping to that radius, using that value instead, and then multiplying by its fractional coverage within that "pixel".

Mapping radius to kernel size is just 1 + (int)ceiling(2 * r). So for r=0.9 we get 3.

I imagine the problem is something like this. The values outside of [-r, +r] "make no sense" because its a function that approximates what we need in the range [-r, +r], and values outside that range should really be treated as "undefined".
image

@mikepound
Copy link
Owner

mikepound commented Oct 2, 2022

I got a bit delayed because I forgot my laptop got reset, and I had to track down some ssh keys! My head keeps switching between radius and width, so apologies if I say 1 somewhere and mean 0.5 or vice versa. Also this post ended up more meandering and messy than I hoped!

If we had some infinite number of components the function would be perfect, and any value outside -r - +r would be zero, with any number inside would be 1. I've been thinking about it and I don't think there is any mathematically sound way of computing this function with a circular diameter less than 1 pixel. In this case the centre pixel should have a weight of 1, and all other pixels should have a weight of zero. Once the width of the kernel extends beyond 1 then it does make sense to compute the function. So personally I would simply skip processing for any radii less than 0.5. This is not the same as a gaussian, where a standard deviation of e.g. 0.5 still has some small weighting from neighbouring pixels, it just approaches 0. Here is is actually 0.

The radius of the 1 component kernel is about 1. I scaled it slightly by 1.2 or 1.4? I think this is because the edge is very sloped, and I wanted the flat bit at the top to appear of radius 1. In the end then it's essentially it's radius 1, but there's a blurry edge. For this reason it makes sense to include the surrounding pixels, so a kernel of 3x3 or maybe even 5x5. I might clamp minimum kernel size as 5x5 to ensure everything of interest is included in all cases.

For the slightly noisier low component kernels you're probably right than an integral would calculate a more realistic size for the kernel, and thus a better normalisation. More at the end. I don't think this is worth doing for radius < 0.5. Mathematically, I think a circle width <1 would essentially represent an aperture so small that the amount of light passing through goes down, and actually darkens the image based on the size of the circle, I don't think you want to implement this as it's not really what the filter was for :) And it wouldn't work with your "compare with zero using the slider" thing.

I've recomputed components 1 - 4. I don't think the fits are quite as good as the numbers you have (it's marginal) because the author tweaked some parts of his algorithm for these fits. There's a few if statements and stuff commented out, and no documentation on which was enabled when. He may have simply run it 100 times and picked the best.

1
0.862714498622641, 1.624794263463742, 0.767551612601682, 1.862919511588507

2
3.155404442724232, 0.703198414661034, 0.920921046139557, 18.021201181546445
1.182294056719569, 4.600526022607052, -0.010000000000000, -1.260260937600167

3
3.953448779157470, 0.991843974081225, 0.943505668188858, 31.338597194679636
2.620116382571133, 4.389281049366630, 0.486142877584094, -5.467335340779987
1.175491238656623, 8.526000503448413, -0.459506784623476, 0.010028006959629

4
4.505962124055110, 1.429535015131939, -4.020358713665534, 53.040071780505855
3.800557009750039, 4.561837825928055, 7.869489653436118, -16.793808735808717
2.734472908424185, 8.161625744063489, -2.869958144842295, 0.424002509294800
1.324829807722022, 12.334896037853868, 0.010000202441957, 0.238695589424511

image

I've been thinking about this image of the 1 component kernel. I think the integral is an issue, as is the general shape of this kernel with it's highly negative values. At about 0.85 where you saw issues, I'm pretty sure the two negative dips at the edge would sit directly over the neighboring pixels, and this would cause the sharpening effect you noticed. Consider also when the centre sits over 0, and the two dips over -1 and +1 pixels. The sum gets calculated to determine by how much to normalised, but this measurement is not including the two peaks on the edge of the disk, and so it doesn't surprise me this ends up as negative. To solve this we could approximate an integral by sub-sampling the pixels, e.g. calculate at -0.5 and +0.5 also. This causes knock on effects, because again looking above, there is no sampling happening at the two peaks when the blur is computed, and this is going to do odd things. I think this is less of a problem for >2 components because the centre of the disc is much flatter, so this has less effect.

Does what I'm saying make sense? If so, maybe we can come up with something, I think maybe it might be a hack - <0.5 radius is zero blue, >1 radius is bokeh blur, and 0.5-1 is something else :)

@rickbrew
Copy link
Author

rickbrew commented Nov 5, 2022

Alright I'm back from a 3 week vacation in Europe (Romania and Ireland), so I'll be able to take a look at this again soon -- I'm entering the home stretch for PDN v5. Wrapping up a ton of proofreading for the new plugin APIs, and should be able to tackle this bug in Bokeh too :)

@rickbrew
Copy link
Author

@mikepound fyi I've released a public alpha for PDN v5: https://forums.getpaint.net/topic/121340-paintnet-50-alpha-build-8361-welcome-to-paintnet-50/

Feel free to share it out and show folks the new ultra-high quality Bokeh blur that you helped with :)

@rickbrew
Copy link
Author

rickbrew commented Dec 4, 2022

Okay I believe I have found a good fix for this.

There are two problems that come up when generating the convolution kernel values:

  1. For non-integer values of r, we end up generating kernel values outside the range +/- r. For instance, if r = 1.5, we are creating values with dx at -2, -1, 0, +1, and +2. However, those values for -2 and +2 are outside the "good" zone for the kernel, they dip below zero in ways that cause funky things to happen. e.g. the sharpening or "blank" effect I described above.

  2. The kernel is a continuous function that we are effectively point sampling. If r is 1, for instance, we end up with a triangle (I'm borrowing your image from above):
    image

As r becomes larger, the generated kernel values more closely approximate the true continuous kernel, but for small kernels the error can be brutal.

Ideally we'd take the integral of the continuous kernel weighted by its distance from the pixel center, for each pixel offset in the kernel. So if r=2, for kernel[0] (which is for dx=-2), we'd want the integral of the kernel from dx=[-2.5, -1.5] * w where w is a triangle function (0 at -0.5, 1 at 0, 0 at +0.5). But, that's ... well, I didn't want to figure out the integral stuff since I was pretty sure I'd get it wrong.

Instead what I'm doing is I'm generating the kernel values at small intervals (in the code below, 1024 samples per pixel, but that's probably overkill). When generating a kernel value at a non-integer offset, it must be spread across 2 kernel indices. The way we do this is kind of a reverse linear sampling: instead of telling the shader to sample a pixel at a non-integer coordinate (using linear sampling), we bake the weights into the kernel values themselves and use regular point sampling. In other words, telling the GPU to take a linear sample at x=0.25 is the same as having it take a point sample at x=0 and x=1 and weighting those two samples correctly.

So if r=2 and we're generating the kernel value at dx=1.25, we put 75% of the value into the slot for dx=1 and 25% of it into the slot for dx=2.

With this I'm able to have a great looking bokeh blur at non-integer r, and for r < 1. Dragging the slider through these values shows very smooth transitions between all values without any jumpy transitions in the output as the effective kernel switches from crude triangles to something that better approximates the true continuous kernel.

private static Complex[] CreateComplex1DKernel(
    double radius,
    int kernelSize,
    double kernelsScale,
    double a,
    double b)
{
    Complex[] kernel = new Complex[kernelSize];
    double radiusCeil = Math.Ceiling(radius);
    double invRadius = 1.0 / radius;

    double nGranularity = 1024.0; // this can probably be much smaller, but it does produce very good results
    double nInterval = 1.0 / nGranularity;

    // Instead of calculating a kernel value at 1-step intervals, we are calculating them at very small intervals.
    // When we are 'between' pixels (n and i are not integers), we spread the kernel value across the two adjacent
    // pixels using the same principles as (bi)linear sampling. For example, when iF=0.25, we calculate the kernel
    // value at n=(0.25-radiusCeil), and then put 75% of that into kernel[i] and 25% of it into kernel[i+1].
    // This approximates taking the integral of the kernel, rather than point sampling it. For small kernel radii,
    // especially for radius<2, this avoids ending up with a triangle-shaped kernel, and it also avoids a large
    // jump in blurring and quality between r<1 and r>=1.

    for (double n = -radiusCeil; n <= radiusCeil; n += nInterval)
    {
        if (n < -radius || n > radius)
        {
            // The kernel values outside of +/- r will often be goofy since we are approximating the kernel,
            // the parameters of which were generated only to promise good values within +/- r.
            // Values outside the range will often be negative and cause weird things to happen.
            continue;
        }

        double iF = n + radiusCeil;
        int iLeft = (int)Math.Floor(iF);
        int iRight = (int)Math.Ceiling(iF);
        double weightLeft = 1.0 - (iF - iLeft);
        double weightRight = 1.0 - weightLeft;

        double value = n * kernelsScale * invRadius;
        double valueSqr = value * value;
        double expA = Math.Exp(-a * valueSqr);
        double cosB = Math.Cos(b * valueSqr);
        double sinB = Math.Sin(b * valueSqr);

        Complex newKernelValue = new Complex(expA * cosB, expA * sinB);

        if (weightLeft > 0)
        {
            Complex oldKernelValueLeft = kernel[iLeft];
            kernel[iLeft] = new Complex(
                oldKernelValueLeft.Real + (newKernelValue.Real * weightLeft),
                oldKernelValueLeft.Imaginary + (newKernelValue.Imaginary * weightLeft));
        }

        if (weightRight > 0)
        {
            Complex oldKernelValueRight = kernel[iRight];
            kernel[iRight] = new Complex(
                oldKernelValueRight.Real + (newKernelValue.Real * weightRight),
                oldKernelValueRight.Imaginary + (newKernelValue.Imaginary * weightRight));
        }
    }

    return kernel;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants