-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NxImage.resize memory leak? #6
Comments
Are you queuing a lot of stuff for inference at once? It's possible the job is falling behind and holding a never ending queue of tensors |
Yes, it is worth mentioning that EXLA can only execute one operation at a time per device. So if you are using serving, you want to be confident you are batching them together, and you are not racing it with other Nx operations. The best use of this library is if you are batching it together with other parts of your ML graph, so you squeeze the resizing as part of your model inside the serving, otherwise StbImage will be better indeed. |
Hey guys, thanks for the quick responses and the tips! I'm using an Oban queue that is limited to one concurrent job. Observing the logs I can confirm that only one job is processing at a time. Sorry for the confusion, by heavy load I meant pushing a lot of images through it vs just testing a handful of images. These images were all still pushed through serially. I should also mention that memory usage continued to grow while using NxImage.resize. On our cloud provider we went from 350mb of memory usage to 4gb (max on the box). On my local mac it hit 15gb before I stopped it. |
Oh, thank you! So I think there is indeed something going wrong here, but if I had to guess, it would be more on Nx.Serving side. Can you provide an example that allows us to reproduce it? You don't need to use Oban. Perhaps a script that starts the serving and sends the same file to it for resizing over and over again? |
In my use case I was performing the NxImage.resize outside of a serving - the serving was provided with Ortex since I was loading an ONNX model. Given that, I put together a livebook which mimics that setup. I can see about setting up a serving if that's still desirable. Just let me know. Here's the livebook: https://gist.github.com/stocks29/4f51df8a1e0dce46505f770faf83fb1d For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application. |
@stocks29 to make sure we are on the same page, you just load one image at a time, do
The notebook is actually using config: [
- config: [nx: [default_backend: EXLA.Backend]]
+ nx: [default_backend: EXLA.Backend]
] |
Also, just for more context, did it happen quickly, or over a longer period of time? |
Here's an updated example. I run it for a while and the memory usage does go up slowly, but reliably. I tried explicit GC and running the resize jitted, in both cases it seems to be going up anyway. Interestingly, it doesn't go up every iteration, sometimes it actually takes a while to change (at least as reported by Mix.install([
{:exla, "~> 0.9.2"},
{:nx, "~> 0.9.2"},
{:nx_image, "~> 0.1.2"}
])
Nx.global_default_backend(EXLA.Backend)
defmodule Test do
def run() do
# fun = EXLA.jit(&NxImage.resize(&1, {224, 224}))
tensor = Nx.iota({496, 950, 3}, type: :u8)
Enum.each(1..1_000_000, fn i ->
if rem(i, 100) == 0, do: IO.puts("before: #{get_process_memory()} KB")
NxImage.resize(tensor, {224, 224})
# fun.(tensor)
# :erlang.garbage_collect(self())
if rem(i, 100) == 0, do: IO.puts("after: #{get_process_memory()} KB")
end)
end
defp get_process_memory() do
pid = System.pid()
{result, 0} = System.cmd("ps", ~w(-o rss= -p #{pid}))
result |> String.trim() |> String.to_integer()
end
end
Test.run() |
I managed to observe the same behaviour in Jax, so this may be something in XLA. I opened an issue in jax-ml/jax#25184. |
Yes that is correct, we load one image at a time, do the resize and then call batched_run. The images are of varying sizes. Thanks for pointing out my livebook config issue. I guess I had been starring at the screen too long.
It happened steadily over a long period of time. On average it increased by a few megabytes per image. |
I would keep this open, to track the upstream issue :) |
oh, sorry, I didn't mean to close this. I merged a related PR in a private repo and github automatically closed this one. |
I'm transforming images to run them through an Nx.Serving. One of the transformation steps is a resize. I was originally using the resize function in this module. When we first put it under heavy load memory usage shot up quickly and never came back down. Swapping out the
NxImage.resize
call forStbImage.resize
resolved the memory issue. I suspect there is some memory leak in the resize function of this library.The text was updated successfully, but these errors were encountered: