-
I'm sorry if this is the wrong place to post this, but I'm curious what these two training modes do. What is the Image training mode even doing? the testbed image trainer leads me to believe it's trying to replicate that specific image and become very good at replicating that specific image. How is this useful? what am I missing? how can I make gigapixel images described in the paper? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
In this lecture, one of the authors of the NeRF paper describes the image replication you are referring to as a toy problem: You can think of this as a toy example as he describes the problem they were exploring... why and how of trying to teach an image to a neural network even to overfit or have it memorize the input data, may have been challenging and how some encoding may help it. On their publication it was fourier analysis based or frequency based as instant-ngp refers to it... This has parallels to positional encodings recently done for transformer based models like GPT if I am understanding this correctly and it is to provide structural information to otherwise unstructured inputs into the neural network. You can compare this yourself, trying to fit an image representation into a vanilla MLP (fully connected neural network) without these types of "positional encodings"... What is proposed with instant-ngp seems to be some comparison between different methods of this encoding. It seem like they found spatial hash grid structures with multi-resolutions actually can replace frequency based encodings with better performance... This kind of data structures are usually typical in computer graphics to describe and accelerate high dimensional, such as 3 dimensional, data and highly optimized for parallel computing in GPUs. Not only the encoding but also building on the previous work of the author(s?) on tiny-cuda-nn which I believe they published as well to describe some optimizations specific to newer NVIDIA GPUs with bigger L1/L2 caches and different shared logic architectures to fuse some typical MLP operations for activations, has made this work out-perform things in spectacular ways. SDF refers to signed distance functions and they have been made popular (as far as I know) with these articles: These SDF representations are nothing new (just stating that if not obvious, to make sure I am not misleading when I refer to its popularity) but popular now, also because there are specific optimizations you can do with these data structures to produce better final pixels (if SDF can be modelled in a convenient way, which isn't usually the case) and also make spatial calculations faster as the data structure lends itself better for some geometric computation paradigms better. Having said all this, I believe the main goal of the authors along with other examples, like radiance caching for ray tracers or volume renderers, or nerf, is to show how generic this encoding may be for different tasks. I hope this may help you understand but take my comments with a pinch of salt :D I am not one of the authors. |
Beta Was this translation helpful? Give feedback.
In this lecture, one of the authors of the NeRF paper describes the image replication you are referring to as a toy problem:
https://youtu.be/nRyOzHpcr4Q?t=1454
You can think of this as a toy example as he describes the problem they were exploring... why and how of trying to teach an image to a neural network even to overfit or have it memorize the input data, may have been challenging and how some encoding may help it. On their publication it was fourier analysis based or frequency based as instant-ngp refers to it... This has parallels to positional encodings recently done for transformer based models like GPT if I am understanding this correctly and it is to provide structural informat…