-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to generate kernel_params, and go beyond 6 of them? #2
Comments
This is a very interesting problem :) But it's also turning out to be quite a challenge! Not having the facility to do this myself was something that bugged me at the time of writing my original code, so this has been fun. This said, I have little in the way of optimisation experience, so this was a steep learning curve! TL;DRAfter some moderately successful attempts of my own, I actually found the original source code for the optimisation at the end of the blog post. With my own optimisation in Scipy I was fairly close, but unable to obtain the extraordinary accuracy of the original kernels beyond about 4 components. I think there are a couple of reasons, but primarily python is terribly slow for this, maybe 1000 times slower than the C++ implementation. I could probably close the gap a bit with some careful optimisation, but it doesn't seem like a good use of time. I'm going to focus instead on cleaning up and adapting the C++ implementation - it is super quick. The best results I have so far is running the C++ global optimiser (more on this below) then passing the result through the python local optimiser for a polish. This also has the side effect of increasing the precision of the decimals we have for all component sizes. Optimisation of kernel parametersPrior to finding the original code, I came up with my own optimisation approach. As it happens this is fairly close to the one used in the C++ implementation. The function we're trying to fit is some kind of finite impulse response. It's generally zero outside of the range -1 to 1 (called the stop band), and 1 within this range (pass band). There is a small transition between 0 and 1, it looks like this: To optimise you need to code a cost function that measures how good a given set of parameters are. My cost function ended up being pretty similar to the C++ one, which is a sum of squared error from the computed kernel, and this ideal curve. I also had some success in measuring the actual ripple in both the pass and stop bands. The problem with the optimisation is that it's extremely easy to be stuck in local minima, i.e. a broadly good fit to the data but not the actual minimum ripple possible. I tried many different optimisation algorithms, but because it takes a long time in python, if they are finding their way to a great solution, it's too slow to spot! Here is an example of a 6 component kernel that fits pretty well, but the ripple is just a little too big: I also made use of the SciPy differential evolution approach that is used in the C++ version, but it's too slow to converge in python. My best result was about 50% bigger than the original kernel (which in the grand scheme of things is nothing, but it's not good enough). I could probably match the performance of the original if I worked at it, but I've come to the conclusion my time would be better spend adding features to the C++ version. For example, donut bokeh, exploring different transition band sizes and so on. C++ OptimiserThe original source code is right at the bottom of the blog. Most of it is aimed at reading and writing PNGs and doing the complex kernel convolutions (which here is implemented via fourier transforms - probably faster but I don't think in PDN case it's a huge issue either way), does PDN use FFT convolution?). I've stripped out all of the code to do with images and left only the kernel optimisation in. I ran the 7 component kernel on our servers overnight for 1 billion (lol) iterations, and it's absolutely nailed it. I then took the parameters and tweaked them using a local search (DE and others are global search, better for getting out of local minima, but need fine tuning at the end). The result is great: It's small, but you can basically see the ripples of the 6 component above and below the 7. The original 6 component model has a ripple size of 0.0012 compared to the new 7 component ripple size of 0.0038, 3x smaller! Next StepsI'm running 8, 9, and 10 component models now on our servers. I'll add some numbers when they come in. I'm not sure what's causing the ringing in PDN's implementation, but I suspect that when you use a high gamma / kernel width, this essentially scales up the kernels massively, causing even small ripples to become a bigger deal. We will see I guess if higher component sizes help. There may be other issues such as the size of the kernel (in pixels) compared to the mathematical one within it, but that's probably something to look at later. I'm going to re-optimise the lower component counts too as I think I can push the accuracy higher than the original post through a combination of running it longer, and fine tuning. For now, here is the 7 component model: [a b A B] It's probably worth putting this in and seeing if the banding goes down in extreme cases. It was noted by the original author that pushing the components up and ripple down has the side effect of increasing some of the component weights (A, B). This probably isn't an issue at the precision you're using, but with gamma we might want to go over some numbers to make sure there isn't a chance of overflow in extreme cases. I'll post again when I have more numbers, and I'll start messing around adapting the C++ code, will post that code up soon. |
Wow, thanks for the detailed explanation! Well if it takes server(s) running overnight to generate these, that certainly answer my question about whether these could be generated at runtime: no :) I added the 7th component and it does reducing ringing a little bit: The first 6 kernels I have are only specified with 6 decimal points of precision: // 3 components
new Vector4Double[]
{
new Vector4Double(2.17649, 5.043495, 1.621035, -2.105439),
new Vector4Double(1.019306, 9.027613, -0.28086, -0.162882),
new Vector4Double(2.81511, 1.597273, -0.366471, 10.300301)
}, Do you have the first 6 with these higher precision values like you do for the 7th? That might also help the ringing a little. It may be prudent for me to cap the gamma at a lower value as well. Edit: I plugged in the set of 6 higher-precision numbers from the blog and it produced a null output (all zeros), so those numbers seem incompatible. Using the high gamma values may require 10 or more passes to get rid of the ringing for good, but we'll see as the numbers come in |
Also, I assume that the kernel scale for the 7th set of numbers is also 1.2? private static IReadOnlyList<double> KernelScales
{
get;
} = new[] { 1.4, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2 }; |
That looks much better, I agree that (assuming ripple continues to decrease) 9-10 components should see very negligible ringing. How long do these effects take to compute on the GPU? There are a lot of convolutions happening here, but I consistently underestimate GPUs :) Are we talking "live preview" kind of speed? Or just acceptable for a click and see how it looks. I will re-run the optimisation with different component amounts and try to extract the higher precision numbers, they should all support this but it'll probably only make a notable difference at 5+ components. My guess is the width multiplier will be the same for 7+ components, but we should probably check e.g. a 64 width and see if point sources are approx 64 pixels across once it's done. I can't remember exactly why I did this, but I think I tried to match the width to what looked good because the edge of the disc has a kind of slope to it. Maybe I decided that the blurred edge of the disc should count within the width, rather than outside it - a judgement call for PDN I think. I suppose the main goal is that regardless of which kernel quality you use, you want the actual blur to be roughly the same size. I think the reason that most of the kernels are the same size and only the 1 component is different is simply because it's a much worse shape, and so the ring ends up smaller and needs to be scaled. Passable kernels probably appear fairly quickly even on a laptop, but I don't think it'll be possible to optimise at high quality sufficiently quickly. If we wanted to do donuts, tighter and blurrier edges and so on, regrettably I think the best thing would be to do a massive grid search for all the parameters we want at all combinations of inputs. I do think arbitrary kernels computed dynamically might be possible with the sum table approach, which I hope to have a go at over the coming weeks - at least as a proof of concept. The idea is you turn each convolution into a series of summations which can be made much faster by calculating an integral image. I did a video on viola jones which uses a similar shortcut here: https://www.youtube.com/watch?v=uEJ71VlUmMQ, about 6 mins in. The challenge won't be the image stuff really, it'll be creating nice rectangular regions from arbitrary kernels. But this is essentially a rasterisation problem, I'm sure there's a nice solution somewhere. The other thing is that these kinds of kernels would work mostly with hard edges, and I don't know if that looks better or worse. |
For a 4,896 x 3,264 px image (this one: https://unsplash.com/photos/Ys-DBJeX0nE ), at default settings (radius=25, gamma=3, quality=3), it takes maybe half a second to run on an NVIDIA RTX A4000 16GB (approx. GeForce 3070). Bumping quality up to 6 makes it take about twice as long. Not real-time enough for gaming, but real-time enough for a photo editing app. Bumping the radius up to the max (300) gives us ~2.5s at quality=3, and about ~5s at quality=6 (these are just rough timings with a stopwatch). Gamma doesn't affect performance. Performance could be improved by utilizing compute shaders instead of pixel shaders. A pixel shader, at least in Direct2D, can only emit one value -- it can't write to multiple render targets. So the real and imaginary components for the vertical convolution are processed separately, then plugged in as the inputs for the horizontal convolution shader. A compute shader should be able to drop 3 shaders down to 2, in other words. This is at best about a 30% speedup; worth doing, but we (@Sergio0694 and I) don't have the infrastructure in place for it yet. That'll be a 2023 project. It's easy to underestimate GPU performance, but also just as easy to overestimate it. I ported some fractals over and they were 100-1000x faster. They work great because they're output-only shaders. However, in some cases the GPU can actually be slower. This usually happens when the CPU rendering code benefits from the ability to randomly access pixels without any real penalty, whereas on the GPU the rendering has to be split up into tiles and multiple passes that can each fit within the rendering engine's maximum intermediate texture size (for Direct_2D_ that's 4096 x 4096 pixels, but Direct_3D_'s ultimate maximum is 16384 x 16384 anyway so you'd run into this regardless). I have many effects that are immensely faster on small to large images, but start getting a lot slower once we get into enormous images (like 32K x 32K). The rendering engine has to buffer, subdivide, and shuffle things around a lot more and it isn't efficient anymore. |
Yeah, this is a compromise the D2D version of the bokeh blur effect had to make, I don't think there's a way around that due to us using pixel shaders right now. My compute shader version of the bokeh blur effect, which uses DX12 compute shaders, doesn't have this limitation, and in theory we should be able to convert this to D2D compute shaders as well. Like Rick said though, that'll require significant work as right now I haven't added the necessary infrastructure to support that (tracked at Sergio0694/ComputeSharp#351), nor Rick has added support for this on the Paint.NET side either. Definitely a 2023 work item, not sooner. But, even with that, I don't see how it'd bring us more than a ~30% improvement or so. It'd pretty much just allow the pixel shader version of bokeh to do 2 1D convolutions per pixel instead of 3, per component. Which is definitely not bad (on paper), but not something that would fully address the scaling issue with more and more kernels and with larger radius. But then again I guess it's also fine given we don't really care about real time support here 🙂
I'm a bit confused by that point, did you write those two numbers backwards or am I missing something? As in, if using 7 kernels results in a smaller ripple size, shouldn't that be 0.0012 compared to the 0.0038 we had before? 😅 Anyway, really awesome work on all this Mike, and thank you so much for helping! It's so cool to work together on something, and at least for me, not something I'd have imagined ever happening when I first watched your bokeh video years ago and I was just starting to mess with the original Python code you wrote to port it to C# and ImageSharp. This all makes me so happy 😄 |
Yes a typo :) It's no problem - Actually the last 6 months have been so busy with academic stuff it's nice to do something a little different! Optimising the 8 kernel has gone well - 2x smaller than at 7! [6.6430131554059075, 2.33925731610851, -665.7557728544768, 445.83362839529286] 9 and 10 are looking a little more tricky. The optimisation process is random, it's fairly predictable for lower component counts, but a lot more prone to being stuck in local minima for higher component counts. What this means essentially is the weighting of the components are adjusted into a position where they cannot be improved, and the true optimal position is so wildly different that it cannot jump to this. My first two trainings didn't work, producing accuracy around the level of a 7 component filter. I think the solution is simply to run about 5 of each and see which ones hit, I'm doing this now and at least one is looking good. I've discovered another reason this optimisation works better than my own is that there's some peculiar code within the cost function that actually seems to go in and overwrite parameters within the optimisation. This isn't usually how it works, normally the optimisation comes up with parameters, uses the cost function to evaluate them, and then makes decisions on what to do next. Here the cost function is actually changing the parameters too. My guess is the author (who seems to have a background in signal processing) is controlling the values for some of the parameters to ensure they stay within reasonable ranges, to coax the optimsation into the right direction. The good news obviously is that this works, the less good news is that it might make this more difficult to port to other bokeh types. |
Optimised 9 kernel: [7.393797857697906, 2.4737002456790207, -1796.6881230069646, 631.9043430000561] Current ripple ripple sizes are: 6: 0.003890 We're nearly at the limit I think, will focus on 10 to finish. Notice also the multipliers are getting quite big, e.g 3000 for A component 2. I think you're using high precision behind the scenes so this is probably fine. Imagine a pixel at max 255 brightness, scaled by maximum gamma, then multiplied by 3000, that's the kind of max range we need to handle. |
This one wasn't easy :) I ran the optimisation about 50 times for 10 components. 49 of them failed to find a decent optima, but one made it! Ripple is 0.00029. [7.148185516924103, 2.4927349464132558, -1242.6335320030207, 435.1383849993189] I've also marginally improved on some of the others. The complete list is:
I couldn't match the performance on 2 or 3 components, and I don't think the precision is necessary anyway, so I recommend leaving them as they are. |
Fantastic 🥳 Here's the results at each quality level. I think we're seeing diminishing returns at 9-10, so that's a good place to stop. First, the input image, cropped from https://unsplash.com/photos/Ys-DBJeX0nE : For each, radius=25 (default), gamma=8 (max, with 3 being the default). Quality = 1, obviously not great, but expected Quality = 10. Finally the ringing around the bright spot near the top-left is almost gone Once we're at high values for quality, the rippling only seems to be apparent at high gamma values (>=5), so this should be perfect. |
Now, if we take this photo, https://unsplash.com/photos/jV-SWhl7eHM , and abuse the gamma value, we still have quite a bit of ringing at max radius, but I don't think there's really anything that can be done about that since the values are just blown way out to begin with (due to the gamma exponent). At reasonable gamma values everything looks fantastic. |
I did also implement an experimental "Max Quality" mode, where the effect switches to a regular (non-separated) convolution kernel implementation. Ringing is absent in that version. Performance, as you'd expect, was fine at small radii but then at large/max radii it was bad ... really really bad. So bad, in fact, that setting the radius to 300 immediately caused a GPU driver timeout (soft crash) 😂 I implemented tiling for the convolution kernel (that is, split the kernel up into 64x64 tiles and run them separately, then sum them all up) and it worked better. But when I switched to testing on a lower-end GPU (a Quadro P2000 -- still a quite decent card!) it refused to run and Windows forced things to run on the higher-end GPU through some kind of silent fallback mechanism (well, and a user notification). |
Haha - well it was worth a go :) This looks great though, I think massively pushing the gamma is going to do strange things unless we can ultimately implement a blur that doesn't have any ripple at all. It also might be worse on the image you're testing on due to the dark background. There won't be a separable kernel to solve this, but along the lines of your "maximum" I'm going to have a stab at implementing the rectangular integral image version, which I think should have similar computational requirements regardless of the kernel size - although maybe not in version 1 :) Don't worry you aren't compelled to use it, I've wanted to implement it for a while and add it to this repo, so if it ends up being useful, great, if not no worries. Quick question regarding computational requirements, as it informs how I implement it: An integral image (done separately for R, G, B) is an image in which each pixel contains the sum of all the pixels above and to the left. For example for this image and its integral The sum of the middle 4 pixels is 24 - 6 - 6 + 1. Once the integral image is calculated, any rectangular region can be calculated using 4 lookups. You can see how this would be helpful for circular kernel convolutions where all of the weights are 1, as we have here, and also for arbitrarily shaped kernels. The main thing I need to work on is creating an efficient rectangle configuration from any kernel - I'll start with circles and go from here. I guess it's not a big deal because you're already creating a number of intermediate component outputs in the existing version, but is it acceptable to create and store one of these integral images for each channel? In an ideal world we'd also store the 1d horizontal and vertical versions, where each pixel is the sum of only those to the left or above, rather than both. For rectangles of height or width 1 we can reduce the problem to 2 lookups if we store these representations too, at the cost of 3 representations instead of 1. Is this acceptable? |
Well let's figure out how much memory that would take. We'll likely need a full 32-bit integer value for each pixel component, which can handle an image 16.7M x 16.7M pixels (2^24 x 2^24) where every pixel is full brightness (A=R=G=B=255). Dropping to 16-bit integers reduces the maximum size we can handle to 256x256, which doesn't seem like a reasonable limitation. So that's a full 128-bits / 16-bytes per pixel, or 96-bits / 12-bytes if we don't care about alpha, which is then 3-4x the size of the image we're starting with. So it'll take a lot of memory on the CPU, which is where we'd do the calculations for the integral table, and then it'd also require a lot of memory on the GPU to store so that the shaders can access it. The next thing is that Direct2D can only really take inputs of Alright, so that basically limits us to 65,536 x 65,536 images for a fully general purpose algorithm, which is probably fine in practice. Especially since we likely won't be working with all white images so the actual limit is much higher. For one of these sample images, like the cat, it's 5,028 x 3,652 pixels. That takes up ~73.5MB of memory CPU-side to store at 32-bit BGRA. For the integral table, quadruple that to about 293.8MB, for a total consumption of ~367.2MB. That's not too bad, but it is a lot, especially since GPUs have relatively small pools of memory compared to CPUs. For larger images, or just up-sized versions of existing ones, let's calculate it for 20,112 x 14,608 (that same image at 400%): that's 1.175GB for the image and 4.701GB for the integral table, or ~5.876GB total. That's about the maximum we can handle with a modern consumer card such as a GeForce 1060 w/ 6GB. A lot of those 1060s were sold with only 3GB. They can still work with 6GB data sets but performance will be significantly reduced -- essentially it has to page, but instead of CPU memory <--> disk, it's GPU memory <--> CPU memory. The unknown here is how Direct2D will handle all of the intermediate buffers that are necessary for the whole transform graph, which will depend a lot on how (where) we need to read from the image. Direct2D is happiest when it only needs to handle "simple" inputs -- where the input and output coordinates are 1:1 and there's no need (or ability) to read from neighboring pixels. A simple brightness/contrast adjustment is an example of a "simple" transform. "Complex" inputs allow you to sample elsewhere, but you pay a price, in memory usage but especially in flexibility. If you only need to read from a small neighborhood of pixels, like with a convolution matrix transform, you're fine. But, if you need to sample from anywhere, it doesn't matter how many samples you need because Direct2D needs to bind a texture for the bounding rectangle and it'll refuse to go above 4096 x 4096 pixels. So the real price you pay is related to locality. I have found ways to get around this, but it depends on what you need to do with the pixels, and how much you're willing to pay in terms of performance. I have something called a Here's an example of a sample map at work to produce an image through a polar inversion transform: The performance cost of an effect like this is O(n*m), where Each rendering pass of the sample map renderer gathers the pixels from 1 of those 4Kx4K tiles -- but the sample map shader can "place" those pixels anywhere, so each rendering pass produces a full-size image. Then the output of each rendering pass is summed to create the final image. Thankfully you don't need to store each rendering pass's output until the very end, you can sum as you go. For an image less than 4Kx4K, we only need 1 rendering pass, but for 16Kx16K we need 16, and 64Kx64K needs 256 rendering passes. Adding in multisample antialiasing with subpixel offsets will multiply that by 1, 4, 16, 64, etc. so we can achieve a nice looking output. I use this sample map renderer for all of the effects in Paint.NET's "Distort" submenu. They perform really quite good when there's only 1 tile, but at large sizes -- let's say 30K pixels on either dimension -- they start dropping below the performance of the old CPU-based implementations. And that's on a 24GB GeForce 3090! So they have limits, but luckily those limits are right around the size of images we'd actually want to work with anyway (outside of my own debugging and testing). Anyway, my point is not "let's use a sample map renderer", but rather that where you need to read pixels from is much more important than how many samples you need, and that sometimes you can work around it, but when you can it still has quite a large cost. So I'd need much more information about that in order to evaluate whether what you are interested in doing is feasible for GPU processing. |
From watching your video where you talk about the integral tables, it sounds like you do need to access these values from "anywhere," even though you only need to read a few of them. In order to get this sort of thing to work on the GPU, the work must be subdividable. Sometimes you can subdivide into tiles, sometimes you can subdivide into multiple rendering passes, sometimes you need both. If we're lucky, these subdivisions won't be overwhelming and we can actually get some work done. So if you want to start figuring out how/if this can work on the GPU, consider how you would break up your algorithm such that you can only ever read a maximum of 1000 pixels away from the current location. (I know I said 4K x 4K, but if every output pixel needs to read from a 4K x 4K region then the pixel shader has to be run once for every output pixel, with a 4K x 4K input, which means that the intermediate has to be prepared for every pixel and.... it doesn't work. It's kind of complicated...) |
Just wanted to say that all this back and forth also makes me think we could have a lot of fun if Mike also ever wanted to try out writing his own plugins for Paint.NET, especially now that they can use the GPU too through ComputeSharp.D2D1 😄 It's especially interesting (and challenging) at first because due to how shaders work, it's a completely different way of thinking compared to the traditional "sequential, scanline-like" way of programming some CPU-powered image effect. Just a note eheh |
I can think of several tricks for transforming all of the data into forms that Direct2D and a GPU can actually process, but I need to know more about how the integral image is sampled. How do you determine which pixel from the integral image to read from? For the output pixel at (x,y), do you only need data stemming from the integral image at (x,y)? Or do you need to jump around more? Hmm ... maybe some pseudo-code that shows how 1 output pixel is calculated would be helpful! |
I may have to have a go at writing a plugin sometime :) I think if we're careful the version with only one integral image shouldn't be much slower, this is something to experiment with. The good news is the sampling strategy is quite easy to predict: Given an integral image and a rectangle x0,y0,x1,y1, the sampled points are (x1,y1), (x0-1,y1), (x1,y0-1) and (x0-1, y0-1). Basically the maximum distance you'll need to sample is the size of the kernel + 1. If you limited your kernel size to 300px as you've done in the implementation above, then that is approximately the distance we're talking. Given you're also sampling around some point, then 150px from the centre of the kernel. The viola jones face detector takes the concept further because faces could take up a huge amount of the image, so it has to sample from anywhere, we can avoid this. |
How timely, I just saw this pop up in my GitHub feed from my friend @alexreinking
|
Alex pointed me to the paper that gpufilter is based on: https://hhoppe.com/recursive.pdf (referring back to the post above ...) Is it possible to split these calculations into two passes, i.e. is it separable? I'm still catching up on morning coffee and I haven't been able to tackle this just yet, but presumbly if you have an integral table
And the result you want to calculate in a pixel shader is f = D - C - B + A, you can hopefully fold that into horizontal and vertical passes. This would not change the result of the calculation (modulo some precision, possibly), but it would provide Direct2D with easier to manage input/output rectangles for buffering through intermediate textures |
That paper looks like a cool way to produce the initial summed area table, which is a large proportion of the processing required to perform a filter. After that it's a little clever indexing, I hope not too challenging. You could turn a rectangle into e.g. a series of horizontal 1px rows, or vertical columns, but I don't think we'd necessarily need to. The big advantage we have is that at the kernel is flat, i.e. all pixels that are summed have identical weight of 1. This basically means it doesn't matter what position each pixel is in, only that they are included in the sum. This isn't true of the complex kernel, and it's not true of e.g. gaussian blur. When calculating a gaussian blur you need not only the pixel values but also their associated weights, which means you have to do one pixel at a time, multiply it by its weight at the corresponding position of the kernel, and then sum it up. That gaussians are separable (g(x,y) = g(x) * g(y)) is the only reason it's practical. In this case there is essentially no kernel, at least there are no weights, you only need the sums and you can shortcut this using the summed area table. The blog highlighted one strategy for fitting rectangles to a circle: Each rectangle needs 4 positions A B C D in order to calculate the sum of all pixels within that rectangle. There are 15 in this image that comprise the circle, so that's ~60 reads. Here is the same image where i've highlighted the pixels that would need reading for the big black rectangle, and another smaller white one: If you use the column and row sum tables too, then 1px rectangles are 2 reads not 4, at the cost that you need to calculate and store those additional tables. The process for filtering an image with a circular blur then would be something like this:
For a gaussian or complex blur the number of reads is the width / height of the kernel * 2 passes. For this it's ~60 reads and add them up - having already expended the effort to calculate the integral image, which is not insignificant. There are a couple of nice advantages:
The main hindrance is fitting nice rectangles to the bokeh shapes you want. The circle is I think fairly easy, arbitrary shapes would need some careful thought to ensure they didn't degenerate into tiny rectangles which becomes inefficient. |
Once we have an integral table, the remainder of what you describe is very doable. From a data transfer standpoint, it's pretty similar to a convolution matrix shader. For generating the rectangles to sum, we can use standard polygon scan-conversion over on the CPU. This list of rectangles can be fed to the shader through a resource texture containing a 1D array of 4-vectors that describe each rectangle. So in a way this is a convolution matrix transform, but with clever RLE compression. For antialiasing, I think it might be possible, but it requires more thought, and it might be best achieved with some approximation (cheating). I do think we'd have aliasing artifacts without some kind of antialiasing or feathering. The main obstacle then is ensuring that the integral table can be calculated efficiently. I'm still thinking about how that can be done on the GPU instead of the CPU. |
I agree, this is absolutely a convolution, just with only weights of zero or one which makes summation much faster. Actually there is a suggestion for aliasing on the original blog: "To get an anti-aliased edge for the circle, the edge pixels can be summed or subtracted with weights separately, doable in the ready-for-prefix-sum representation." I'm not sure how fast this would be, but if we could calculate scalar weights for the anti-aliased pixels around the edge, we could simply include these in the original sum. Pixels that were weighted down would not be included in any rectangles. The blog also notes that everything here could be precomputed and stored for every possible radius, but I think that's less appropriate here where there are hundreds of possible radii, and where arbitrary kernel shapes are possible. Scan conversion was definitely the right keyword I wasn't aware of - loads of resources! I'm not super clued up on graphics etc., but it looks like there are quick algorithms for common polygons and the general case. Presumably also aliased versions? Just fired up the alpha build, it looks really good! How did you do the original "unfocus" feature? |
Ooh, subtracting with weights actually is really neat, saves messing about removing these pixels from the rectangles. For an aliased pixel of weight t you subtract val*(1-t) from the overall sum. |
For anti-aliasing, I agree that we would treat them separately. The pixel shader would take 2 input arrays (via resource texture)
The second array need not even use the integral table -- it could sample from the regular input image. Although, if we can make it work via the integral table, that's one less buffer for Direct2D to marshal around, which could be better. These two arrays can be generated fairly easily on the CPU side:
As for Unfocus, it's one of the few effects that I am not sure how to port to the GPU. It uses a "local histogram" of the surrounding pixels, one histogram per channel. This is tough to do on the GPU because I need 4x 256-wide arrays. On the CPU I can incrementally process this by subtracting the left edge and adding the right edge as I move from x to x+1. But a pixel shader can only output 1 pixel and it can't really share any information for the neighboring pixel. |
I've been messing around with a function to return some rectangles from the RLE of a bitmap. It's ok for a first attempt I think, works with a variety of shapes. Rendered in random colours for viewing, as you suggested the aliasing is ignored but is there (which is why the circles don't look quite circular). It does RLE per row, then merges down rectangles as it can. I could do further merging passes easily enough but actually I'm not sure it makes it any more efficient. You could make larger rectangles within some of these objects, but doing so would split the existing columns, which may only need 2 reads anyway. In fact it might be more efficient for me to re-split a couple of them so that it's all either rectangles or columns, and there are now rows or single pixels. |
So from what I understand, we don't actually need a full integral table. The full integral table is problematic to calculate in Direct2D because it means that calculating its bottom-right pixel ultimately requires access to the whole input image. It'll work up to 4096x4096 pixels, and refuse to render for anything larger. I think this is what the paper was describing, but we can get away with tiled/partial integral tables. That is, we split the image into tiles of size So, for 4x4 tiles, for a 1 pixel-wide, 7-pixel tall rectangle extending from B to D, assuming B->D spans two tiles, we do D + C - A:
There would be 4 cases to consider: whether the rectangle spans 1 or 2 tiles horizontally, and then vertically. So there'd be some branching in the shader. Might be able to mask that away with some clever use of Is that correct? The inner loop would be slower, but all the work to get there should be much faster and simpler. But, I'm not confident in my math. This would limit the size of our RLE convolution matrix, but should still allow calculations to be separable into H and V passes. |
For a rect that spans all 4 tiles (expanded to 5x5 tiles for legibility), we'd have: I + H - G + F - E + D - C - B + A Note that only
|
Also I think the shader can take 4 sets of RLE data:
Each of these has a slightly different implementation strategy, and the latter 3 can be optimized a bit. At least this way if we have an irregular shape that has a lot of horizontal or vertical strips, we aren't using the full rectangle evaluator. Might actually use 4 different shaders, which would permit omitting any that aren't needed for a particular convolution. It might also just be easier to write the code that way. But, that's something I'd let the performance dictate. |
My lack of knowledge of Direct2D is showing :) I agree that I don't think you need access to the full table as long as you can visit neighbouring tiles, and if the size of the tiles matched the max convolution, you'd only need two tiles horizontally or vertically. Here are some questions so I can understand it better:
I've continued fiddling about with the rectangle generation code, should have another implementation of it shortly that fixes a couple of niggles in the ones above. My original implementation somewhat overcomplicated things. Having played with it a bit I'm not convinced it's worth the effort to utilise all three of rectangle, horizontal and vertical integral images. This is basically vertical RLE followed by merging of aligned columns. I'm trying to decide if it's worth doing more here to maximise the size of the rectangles in use. Theoretically there are some large "wide" rectangles that could span most of the kernel, for instance the blue and yellow ones here: I did those two by hand simply to illustrate. Actually this configuration is almost 2x less efficient, because all those low cost vertical columns are now split in two. To make this kind of thing work we'd then have to reorient many of the vertical sums into horizontal - but not the ones at the side! This is a lot more complex as an algorithm, and I'm not sure how effective it becomes when your kernel is very complex in shape. For simple shapes maybe, but at that point you'd be better of hand crafting an optimal for every possible shape. One advantage of having only rectangles and 1px vertical columns is that you no longer need to store the horizontal integral map at all, saving 1/3 of the memory? I think super optimising the rectangles vs horizontal vs vertical components for each kernel would maybe get you 50% savings at most, but I think this approach is many fewer than currently required in a complex convolution using e.g. 10 components, which is still pretty quick. |
If the integral table was provided separately -- let's say we calculated it on the CPU, stored it in a bitmap, and then sent it over to the GPU in the same manner as the regular image -- then no, this would work fine. You'd only need access to the pixels you originally described, which is tightly bound by the size of the convolution. It's easy to do all sorts of random access on the CPU with huge bitmaps. I can access (x=0, y=0) and (x=300,000, y=25,000) in two adjacent instructions with no trouble at all (assuming monolithic bitmap storage). The difficulty is when we want to calculate the integral table on the GPU. In that case our input image needs to run through an effect with two pixel shaders, separated into horizontal and vertical passes. Pixel shaders do calculations one pixel at a time and cannot share state between pixels, which means output pixel [x=width-1, y=0] would need to read all the way back to [x=0, y=0]. So, some form of tiling and some other magic (as described in that paper, I believe) would be necessary to get Direct2D to render this.
The bokeh blur is implemented as a series of transforms. Here's the code over in ComputeSharp.D2D1: https://github.com/Sergio0694/ComputeSharp/blob/main/tests/ComputeSharp.D2D1.Tests/Effects/BokehBlurEffect.cs#L415 . Paint.NET's code looks very different but is basically doing the same thing, also using Direct2D.
(The 'border' node provides a means for a sampling clamp at the borders, so that trying to read from, say, [-3, -3] produces the same result as [0, 0]. Sometimes this is necessary or not, and you can use other border modes like Wrap and Mirror, and I have also implemented Transparent. There are some pictures here that illustrate how it works: https://docs.microsoft.com/en-us/windows/win32/direct2d/border ) Each one of these nodes in the effect graph is essentially a function that maps from N input images (1 or 2 in this case) and M configurable property values to 1 output image. Each node has a pixel shader for the GPU-side processing. Each node also two internal functions, There is no manual management of buffers or bitmaps. I produce the GPU-side (BTW, "image" vs "bitmap" is an important distinction. A bitmap is a monolithic allocation that holds pixels and provides direct access. An "image" is an object that defines an output area and that can copy pixels into your buffer. A bitmap is an image, in the OOP sense. See also (Also, I'm referring to "effects" and "transforms" loosely as if they're the same thing. They're not, but that's a lower-level detail. Effects are connected together to form an effect graph. Each effect internally defines a transform graph. The effect graph is "flattened" to produce the full transform graph, and that's what Direct2D actually renders. Effects are a high-level construct for regular folks who are doing image processing tasks, and transforms are utilized by effect implementors. One transform = one pixel shader.) During The gamma exposure nodes will say that its input/output rects have a 1:1 mapping. Each output pixel is strictly a function of the input pixel at the same coordinate. Easy. Direct2D refers to this as a "simple" transform. The "VERTICAL" convolution nodes will say that any given output rect maps to A pixel shader can (try to) read from areas outside of the areas declared by Direct2D may determine that it can do everything in one rendering pass and bind the whole input image to the pixel shader's inputs. This will be true for smaller images, probably 4K x 4K or less. Otherwise it'll copy into intermediate textures and render across multiple passes using only sub-regions of the image. |
You can expand the bitmap by the size of the needed border and fill it using transparent pixels or with clamping or mirroring. This is how we do it in Direct2D using the BorderEffect -- although we get the nice benefit that, while it affects the size of the image, it doesn't affect its coordinate space. The pixel at (0,0) is the same, the border extends to infinity from all edges. Direct2D's Gaussian Blur effect documents that it uses a mirror border when you use a "hard" border mode. Using a "soft" border will treat all out-of-bounds pixels as if they're transparent black (all zeros), and will produce an output that is larger than the input by an amount equal to the radius (so they say, I've measured variance there). A hard border will produce an output image that is the same size/location as the input image, and out-of-bounds pixels will be pulled from inside the image, e.g. You'll probably need to create a copy of the bitmap to do this, but then you won't have to worry about anything in the rendering code. Just clamp your for-loops to only process the un-bordered portion of the bitmap. You should only have to expand the image by the size of the convolution. Direct2D can extend things to infinity because it virtualizes it -- it backtracks from the final output region to determine what input areas are needed, and only worries about those. |
Would you advocate adding any kind of border, mirroring etc prior to calculating the integral image? I can't quite wrap my head around border stuff when we're talking about an integral image, and rectangles that may sit fully outside the image, or partly outside the image. It could mean a lot of slow code for checking border / edge intersections. In essence: what is the correct sum for a rectangle which has some part of it lying outside the normal image region? We've got exam resits this week and I'm exams officer so progress will slow for a few days while I sort through that mess. I'll get back to it soon! |
Because the convolution will be sampling from the border region, it should be included in the integral image. So you take your input image, apply a border of (kernel size) pixels (might be able to get away with kernel size - 1). Calculate the integral image from that. Run the convolution with the bordered image and the integral image that was calculated from it, and write to a new bitmap of the original size. The loop that calculates output pixels only needs to run on pixels inside the border (that is, not including the border), and won't need to worry about checking that the pixels it's reading are in-bounds. So the input is (w x h), bordered and integral image are (w + k*2) x (h + k*2), output is (w x h). The rendering loop can run on pixels (left=k, top=k, right=w+k, bottom=h+k). |
Applying this border is also a very common practice in deep learning frameworks such as TensorFlow, where convolutional nodes give you the ability to choose between different padding modes. Traditionally, common ones are no padding, which means the convolution only computes values for an inner area of the image excluding the outer frame of thickness (kernel size - 1), and the result is smaller than the original image along each axis. What Rick is suggesting instead maps to "same" padding, whereas the image is pre-padded with an extra frame filled appropriately (eg. could be all 0s, or integral values here), so that the convolution code doesn't need to worry about bounds checks. This is generally a very good idea because preparing the padding is usually cheap, and is worth the tradeoff considering the performance boost you get from removing bounds checks (ie. eliding a ton of conditional branches) from the inner convolution code, which is the hot path for the operation 🙂 |
BTW here's a link to the paper by Michael A. Shantzis, "A model for efficient and flexible image computing," that describes the algorithm used in D2D's imaging/effects system: https://dl.acm.org/doi/10.1145/192161.192191 ... it's a Postcript file though, I don't even have an app installed that can load that right now. The paper is linked from this page in D2D's documentation for I tried downloading this a long time ago but it was behind a paywall or some other confusion. |
Thanks, very interesting :) I found the PDF, might be a uni thing, a lot of these paywalls seem to disappear depending on the machine I use. I should be able to test out both mirror border, repeating border etc fairly easily. It definitely sounds like adding some padding here is easier than messing about with rectangle intersections. I was trying to be clever with my coordinates but it was turning into a mess. As a complete aside, did you see this? https://twitter.com/nousr_/status/1563912133120647168 Maybe there's a future plugin to be written. They ended up implementing a locally listening python server that handles the stable diffusion. I've been using SD the last couple of days, getting it doing image 2 image is fairly easy. |
If I had to take a guess at what's going on, it's that the kernel values are only reasonable when sampled within +/- radius from the center. Since the kernel must be generated for integer pixel offsets, grabbing the value at dx=-1 (and dx=+1 of course) when r<1 we end up with values that dip way out of bounds. And, PDN's code is special in this case because it actually permits non-integer radius values. Everything else I've seen is using integers. |
(btw that cat picture was generated with Stable Diffusion, I found it on the Discord server by user |
It takes no time to generater 1-4 kernel parameters, so I'll do this tomorrow. I'd never considered what happens with fractional or very small radius values, are you using similar code to mine to map radius to kernel? I may need to plot these on a graph to see what happens, but it could also be the normalisation code - it's been a while since I looked at that bit. The sum under the kernel should be close to 1, otherwise the image will get brighter or darker when you convolve. In the normalisation function, I pass over the 2D coords of the kernel and calculate the kernel value at that position, summing these all up. You also have to do it across all components. I'm not sure quite what happens when r<1, I assume that the actual kernel you produce has some minimum radius, e.g. 3x3? The sharpening is also interesting. My guess is the same scaling down is happening, but rather than the kernel being negative, you're ending up with something where the peaks and troughs lie above and below zero, and it accidentally becomes a kernel that amplifies edges. |
I got a bit delayed because I forgot my laptop got reset, and I had to track down some ssh keys! My head keeps switching between radius and width, so apologies if I say 1 somewhere and mean 0.5 or vice versa. Also this post ended up more meandering and messy than I hoped! If we had some infinite number of components the function would be perfect, and any value outside -r - +r would be zero, with any number inside would be 1. I've been thinking about it and I don't think there is any mathematically sound way of computing this function with a circular diameter less than 1 pixel. In this case the centre pixel should have a weight of 1, and all other pixels should have a weight of zero. Once the width of the kernel extends beyond 1 then it does make sense to compute the function. So personally I would simply skip processing for any radii less than 0.5. This is not the same as a gaussian, where a standard deviation of e.g. 0.5 still has some small weighting from neighbouring pixels, it just approaches 0. Here is is actually 0. The radius of the 1 component kernel is about 1. I scaled it slightly by 1.2 or 1.4? I think this is because the edge is very sloped, and I wanted the flat bit at the top to appear of radius 1. In the end then it's essentially it's radius 1, but there's a blurry edge. For this reason it makes sense to include the surrounding pixels, so a kernel of 3x3 or maybe even 5x5. I might clamp minimum kernel size as 5x5 to ensure everything of interest is included in all cases. For the slightly noisier low component kernels you're probably right than an integral would calculate a more realistic size for the kernel, and thus a better normalisation. More at the end. I don't think this is worth doing for radius < 0.5. Mathematically, I think a circle width <1 would essentially represent an aperture so small that the amount of light passing through goes down, and actually darkens the image based on the size of the circle, I don't think you want to implement this as it's not really what the filter was for :) And it wouldn't work with your "compare with zero using the slider" thing. I've recomputed components 1 - 4. I don't think the fits are quite as good as the numbers you have (it's marginal) because the author tweaked some parts of his algorithm for these fits. There's a few if statements and stuff commented out, and no documentation on which was enabled when. He may have simply run it 100 times and picked the best. 1 2 3 4 I've been thinking about this image of the 1 component kernel. I think the integral is an issue, as is the general shape of this kernel with it's highly negative values. At about 0.85 where you saw issues, I'm pretty sure the two negative dips at the edge would sit directly over the neighboring pixels, and this would cause the sharpening effect you noticed. Consider also when the centre sits over 0, and the two dips over -1 and +1 pixels. The sum gets calculated to determine by how much to normalised, but this measurement is not including the two peaks on the edge of the disk, and so it doesn't surprise me this ends up as negative. To solve this we could approximate an integral by sub-sampling the pixels, e.g. calculate at -0.5 and +0.5 also. This causes knock on effects, because again looking above, there is no sampling happening at the two peaks when the blur is computed, and this is going to do odd things. I think this is less of a problem for >2 components because the centre of the disc is much flatter, so this has less effect. Does what I'm saying make sense? If so, maybe we can come up with something, I think maybe it might be a hack - <0.5 radius is zero blue, >1 radius is bokeh blur, and 0.5-1 is something else :) |
Alright I'm back from a 3 week vacation in Europe (Romania and Ireland), so I'll be able to take a look at this again soon -- I'm entering the home stretch for PDN v5. Wrapping up a ton of proofreading for the new plugin APIs, and should be able to tackle this bug in Bokeh too :) |
@mikepound fyi I've released a public alpha for PDN v5: https://forums.getpaint.net/topic/121340-paintnet-50-alpha-build-8361-welcome-to-paintnet-50/ Feel free to share it out and show folks the new ultra-high quality Bokeh blur that you helped with :) |
Context: https://twitter.com/rickbrewPDN/status/1562499054952534018
The kernel parameters listed at https://github.com/mikepound/convolve/blob/master/complex_kernels.py are a set of 6 hardcoded 4-vector arrays. I've been having difficulty determining how these are actually generated, and how I might generate kernel parameters beyond the 6 that are listed here (as in, how to calculate for 7, 8, 9, etc.). The explanations I've found are not close enough to my comprehension of the math involved ("it's just a Gaussian kernel with a complex width" ... ??!! 😂 from: http://yehar.com/blog/?p=1495). Every example of Bokeh blur code I've found online lists the same set of pregenerated kernel values, with no code showing how to actually generate them.
My motivation is to experiment with higher "quality" settings in Paint.NET's new Bokeh blur filter. "Quality" is a slider in the UI that controls which set of kernel parameters are used, which is then tied to the number of rendering passes.
Ringing artifacts are present when there's an imbalance between quality, radius, and the gamma exponent (default is 3). I'm hoping to extend the range at which these values can be combined before artifacts are noticeable (or even eliminating them, if possible). My rendering pipeline is fully 32-bit floating point per pixel component (128-bits total), so there's a lot of room for extra precision vs. the typical 8-bit RGBA32.
My dream solution here would be a function that takes in an integer for the # of components, and returns an array of 4 vectors for the kernel parameters. For example,
generate_kernel_parameters(2)
would return[[0.886528, 5.268909, 0.411259, -0.548794], [1.960518, 1.558213, 0.513282, 4.56111]]
(the numbers from https://github.com/mikepound/convolve/blob/master/complex_kernels.py). I'm not sure how feasible or computationally expensive that is though.cc @Sergio0694 @saucecontrol @JimBobSquarePants
The text was updated successfully, but these errors were encountered: