Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NxImage.resize memory leak? #6

Open
stocks29 opened this issue Nov 28, 2024 · 12 comments
Open

NxImage.resize memory leak? #6

stocks29 opened this issue Nov 28, 2024 · 12 comments
Assignees

Comments

@stocks29
Copy link

I'm transforming images to run them through an Nx.Serving. One of the transformation steps is a resize. I was originally using the resize function in this module. When we first put it under heavy load memory usage shot up quickly and never came back down. Swapping out the NxImage.resize call for StbImage.resize resolved the memory issue. I suspect there is some memory leak in the resize function of this library.

@seanmor5
Copy link

Are you queuing a lot of stuff for inference at once? It's possible the job is falling behind and holding a never ending queue of tensors

@josevalim
Copy link
Contributor

Yes, it is worth mentioning that EXLA can only execute one operation at a time per device. So if you are using serving, you want to be confident you are batching them together, and you are not racing it with other Nx operations.

The best use of this library is if you are batching it together with other parts of your ML graph, so you squeeze the resizing as part of your model inside the serving, otherwise StbImage will be better indeed.

@stocks29
Copy link
Author

stocks29 commented Nov 28, 2024

Hey guys, thanks for the quick responses and the tips!

I'm using an Oban queue that is limited to one concurrent job. Observing the logs I can confirm that only one job is processing at a time.

Sorry for the confusion, by heavy load I meant pushing a lot of images through it vs just testing a handful of images. These images were all still pushed through serially.

I should also mention that memory usage continued to grow while using NxImage.resize. On our cloud provider we went from 350mb of memory usage to 4gb (max on the box). On my local mac it hit 15gb before I stopped it.

@josevalim
Copy link
Contributor

Oh, thank you! So I think there is indeed something going wrong here, but if I had to guess, it would be more on Nx.Serving side. Can you provide an example that allows us to reproduce it? You don't need to use Oban. Perhaps a script that starts the serving and sends the same file to it for resizing over and over again?

@stocks29
Copy link
Author

In my use case I was performing the NxImage.resize outside of a serving - the serving was provided with Ortex since I was loading an ONNX model. Given that, I put together a livebook which mimics that setup. I can see about setting up a serving if that's still desirable. Just let me know.

Here's the livebook:

https://gist.github.com/stocks29/4f51df8a1e0dce46505f770faf83fb1d

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

@jonatanklosko
Copy link
Member

@stocks29 to make sure we are on the same page, you just load one image at a time, do NxImage.resize and then call Nx.Serving.batched_run? Are the images of certain sizes, or are the sizes entirely arbitrary?

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

The notebook is actually using Nx.BinaryBackend, it should be this:

  config: [
-    config: [nx: [default_backend: EXLA.Backend]]
+    nx: [default_backend: EXLA.Backend]
  ]

@jonatanklosko
Copy link
Member

On our cloud provider we went from 350mb of memory usage to 4gb (max on the box). On my local mac it hit 15gb before I stopped it.

Also, just for more context, did it happen quickly, or over a longer period of time?

@jonatanklosko
Copy link
Member

jonatanklosko commented Nov 29, 2024

Here's an updated example. I run it for a while and the memory usage does go up slowly, but reliably. I tried explicit GC and running the resize jitted, in both cases it seems to be going up anyway. Interestingly, it doesn't go up every iteration, sometimes it actually takes a while to change (at least as reported by ps).

Mix.install([
  {:exla, "~> 0.9.2"},
  {:nx, "~> 0.9.2"},
  {:nx_image, "~> 0.1.2"}
])

Nx.global_default_backend(EXLA.Backend)

defmodule Test do
  def run() do
    # fun = EXLA.jit(&NxImage.resize(&1, {224, 224}))

    tensor = Nx.iota({496, 950, 3}, type: :u8)

    Enum.each(1..1_000_000, fn i ->
      if rem(i, 100) == 0, do: IO.puts("before: #{get_process_memory()} KB")

      NxImage.resize(tensor, {224, 224})
      # fun.(tensor)
      # :erlang.garbage_collect(self())

      if rem(i, 100) == 0, do: IO.puts("after:  #{get_process_memory()} KB")
    end)
  end

  defp get_process_memory() do
    pid = System.pid()
    {result, 0} = System.cmd("ps", ~w(-o rss= -p #{pid}))
    result |> String.trim() |> String.to_integer()
  end
end

Test.run()

@jonatanklosko
Copy link
Member

I managed to observe the same behaviour in Jax, so this may be something in XLA. I opened an issue in jax-ml/jax#25184.

@stocks29
Copy link
Author

@stocks29 to make sure we are on the same page, you just load one image at a time, do NxImage.resize and then call Nx.Serving.batched_run? Are the images of certain sizes, or are the sizes entirely arbitrary?

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

The notebook is actually using Nx.BinaryBackend, it should be this:

  config: [
-    config: [nx: [default_backend: EXLA.Backend]]
+    nx: [default_backend: EXLA.Backend]
  ]

Yes that is correct, we load one image at a time, do the resize and then call batched_run. The images are of varying sizes.

Thanks for pointing out my livebook config issue. I guess I had been starring at the screen too long.

Also, just for more context, did it happen quickly, or over a longer period of time?

It happened steadily over a long period of time. On average it increased by a few megabytes per image.

@stocks29 stocks29 closed this as completed Dec 2, 2024
@jonatanklosko
Copy link
Member

I would keep this open, to track the upstream issue :)

@jonatanklosko jonatanklosko reopened this Dec 2, 2024
@stocks29
Copy link
Author

stocks29 commented Dec 2, 2024

oh, sorry, I didn't mean to close this. I merged a related PR in a private repo and github automatically closed this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants