NxImage.resize memory leak? #6

stocks29 · 2024-11-28T01:16:48Z

I'm transforming images to run them through an Nx.Serving. One of the transformation steps is a resize. I was originally using the resize function in this module. When we first put it under heavy load memory usage shot up quickly and never came back down. Swapping out the NxImage.resize call for StbImage.resize resolved the memory issue. I suspect there is some memory leak in the resize function of this library.

The text was updated successfully, but these errors were encountered:

seanmor5 · 2024-11-28T02:10:01Z

Are you queuing a lot of stuff for inference at once? It's possible the job is falling behind and holding a never ending queue of tensors

josevalim · 2024-11-28T09:19:55Z

Yes, it is worth mentioning that EXLA can only execute one operation at a time per device. So if you are using serving, you want to be confident you are batching them together, and you are not racing it with other Nx operations.

The best use of this library is if you are batching it together with other parts of your ML graph, so you squeeze the resizing as part of your model inside the serving, otherwise StbImage will be better indeed.

stocks29 · 2024-11-28T13:34:36Z

Hey guys, thanks for the quick responses and the tips!

I'm using an Oban queue that is limited to one concurrent job. Observing the logs I can confirm that only one job is processing at a time.

Sorry for the confusion, by heavy load I meant pushing a lot of images through it vs just testing a handful of images. These images were all still pushed through serially.

I should also mention that memory usage continued to grow while using NxImage.resize. On our cloud provider we went from 350mb of memory usage to 4gb (max on the box). On my local mac it hit 15gb before I stopped it.

josevalim · 2024-11-28T13:50:41Z

Oh, thank you! So I think there is indeed something going wrong here, but if I had to guess, it would be more on Nx.Serving side. Can you provide an example that allows us to reproduce it? You don't need to use Oban. Perhaps a script that starts the serving and sends the same file to it for resizing over and over again?

stocks29 · 2024-11-28T17:00:25Z

In my use case I was performing the NxImage.resize outside of a serving - the serving was provided with Ortex since I was loading an ONNX model. Given that, I put together a livebook which mimics that setup. I can see about setting up a serving if that's still desirable. Just let me know.

Here's the livebook:

https://gist.github.com/stocks29/4f51df8a1e0dce46505f770faf83fb1d

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

jonatanklosko · 2024-11-29T14:08:26Z

@stocks29 to make sure we are on the same page, you just load one image at a time, do NxImage.resize and then call Nx.Serving.batched_run? Are the images of certain sizes, or are the sizes entirely arbitrary?

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

The notebook is actually using Nx.BinaryBackend, it should be this:

  config: [
-    config: [nx: [default_backend: EXLA.Backend]]
+    nx: [default_backend: EXLA.Backend]
  ]

jonatanklosko · 2024-11-29T14:18:02Z

On our cloud provider we went from 350mb of memory usage to 4gb (max on the box). On my local mac it hit 15gb before I stopped it.

Also, just for more context, did it happen quickly, or over a longer period of time?

jonatanklosko · 2024-11-29T14:38:39Z

Here's an updated example. I run it for a while and the memory usage does go up slowly, but reliably. I tried explicit GC and running the resize jitted, in both cases it seems to be going up anyway. Interestingly, it doesn't go up every iteration, sometimes it actually takes a while to change (at least as reported by ps).

Mix.install([
  {:exla, "~> 0.9.2"},
  {:nx, "~> 0.9.2"},
  {:nx_image, "~> 0.1.2"}
])

Nx.global_default_backend(EXLA.Backend)

defmodule Test do
  def run() do
    # fun = EXLA.jit(&NxImage.resize(&1, {224, 224}))

    tensor = Nx.iota({496, 950, 3}, type: :u8)

    Enum.each(1..1_000_000, fn i ->
      if rem(i, 100) == 0, do: IO.puts("before: #{get_process_memory()} KB")

      NxImage.resize(tensor, {224, 224})
      # fun.(tensor)
      # :erlang.garbage_collect(self())

      if rem(i, 100) == 0, do: IO.puts("after:  #{get_process_memory()} KB")
    end)
  end

  defp get_process_memory() do
    pid = System.pid()
    {result, 0} = System.cmd("ps", ~w(-o rss= -p #{pid}))
    result |> String.trim() |> String.to_integer()
  end
end

Test.run()

jonatanklosko · 2024-11-29T15:57:54Z

I managed to observe the same behaviour in Jax, so this may be something in XLA. I opened an issue in jax-ml/jax#25184.

stocks29 · 2024-11-29T21:18:39Z

@stocks29 to make sure we are on the same page, you just load one image at a time, do NxImage.resize and then call Nx.Serving.batched_run? Are the images of certain sizes, or are the sizes entirely arbitrary?

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

The notebook is actually using Nx.BinaryBackend, it should be this:
  config: [
-    config: [nx: [default_backend: EXLA.Backend]]
+    nx: [default_backend: EXLA.Backend]
  ]

Yes that is correct, we load one image at a time, do the resize and then call batched_run. The images are of varying sizes.

Thanks for pointing out my livebook config issue. I guess I had been starring at the screen too long.

Also, just for more context, did it happen quickly, or over a longer period of time?

It happened steadily over a long period of time. On average it increased by a few megabytes per image.

jonatanklosko · 2024-12-02T14:37:09Z

I would keep this open, to track the upstream issue :)

stocks29 · 2024-12-02T15:56:47Z

oh, sorry, I didn't mean to close this. I merged a related PR in a private repo and github automatically closed this one.

josevalim assigned josevalim and jonatanklosko Nov 28, 2024

stocks29 closed this as completed Dec 2, 2024

jonatanklosko reopened this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NxImage.resize memory leak? #6

NxImage.resize memory leak? #6

stocks29 commented Nov 28, 2024

seanmor5 commented Nov 28, 2024

josevalim commented Nov 28, 2024

stocks29 commented Nov 28, 2024 •

edited

Loading

josevalim commented Nov 28, 2024

stocks29 commented Nov 28, 2024

jonatanklosko commented Nov 29, 2024

jonatanklosko commented Nov 29, 2024

jonatanklosko commented Nov 29, 2024 •

edited

Loading

jonatanklosko commented Nov 29, 2024

stocks29 commented Nov 29, 2024

jonatanklosko commented Dec 2, 2024

stocks29 commented Dec 2, 2024

NxImage.resize memory leak? #6

NxImage.resize memory leak? #6

Comments

stocks29 commented Nov 28, 2024

seanmor5 commented Nov 28, 2024

josevalim commented Nov 28, 2024

stocks29 commented Nov 28, 2024 • edited Loading

josevalim commented Nov 28, 2024

stocks29 commented Nov 28, 2024

jonatanklosko commented Nov 29, 2024

jonatanklosko commented Nov 29, 2024

jonatanklosko commented Nov 29, 2024 • edited Loading

jonatanklosko commented Nov 29, 2024

stocks29 commented Nov 29, 2024

jonatanklosko commented Dec 2, 2024

stocks29 commented Dec 2, 2024

stocks29 commented Nov 28, 2024 •

edited

Loading

jonatanklosko commented Nov 29, 2024 •

edited

Loading