[VL] Distinct aggregation OOM when getOutput #8025

ccat3z · 2024-11-22T08:17:13Z

Backend

VL (Velox)

Bug description

Distinct aggregation will merge all sorted spill file in getOutput() (SpillPartition::createOrderedReader). If there are too many spill files, reading the first batch of each file into memory will consume a significant amount of memory. In one of our internal cases, one task generated 300 spill files, which requires close to 3G of memory.

Possible workarounds:

Increase kMaxSpillRunRows, 1M will generate too many spill files for hundreds million rows of input. [GLUTEN-7249][VL] Lower default overhead memory ratio and spill run size #7531
Reduce kSpillWriteBufferSize to 1M or lower. Why it is set to 4M by default? Is there any experience in performance tuning?

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

The text was updated successfully, but these errors were encountered:

FelixYBW · 2024-11-23T04:49:05Z

Looks it's the same issue as shuffle spill. All spill merge should have the same issue. we should solve it by similar way.

is the tradeoff of spill file# and overhead memory.
It's set by pr [VL] Add 3 configs of spill #5088 , I didn't do any test on it. Not sure why I set it to 4M while Velox default 1M. Let's decrease the default value to 1M, also check the other configs

What's the vanilla spark's spill buffer size? is it configurable? in theory vanilla spark has the same issue as Gluten. @jinchengchenghh do you know?

FelixYBW · 2024-11-23T06:43:55Z

I can only find the configuration: spark.shuffle.spill.diskWriteBufferSize.

No spill merge one.

FelixYBW · 2024-11-23T07:40:15Z

Thank you, @ccat3z . I encounted the same issue in orderby operator and debugged several days!

jinchengchenghh · 2024-11-25T01:46:35Z

The kSpillReadBufferSize controls the size to read each file, the ordered reader will create FileOutputStream for each file and allocate kSpillReadBufferSize (default 1MB) for one file. Can you try to adjust this value?
Maybe we should add a new config to control the read buffer size for all the files. numFiles * bufferSize < threshold.

for (auto& fileInfo : files_) {
    streams.push_back(FileSpillMergeStream::create(
        SpillReadFile::create(fileInfo, bufferSize, pool, spillStats)));
  }

input_ = std::make_unique<common::FileInputStream>(
      std::move(file), bufferSize, pool_);

FileInputStream::FileInputStream(
    std::unique_ptr<ReadFile>&& file,
    uint64_t bufferSize,
    memory::MemoryPool* pool)
    : file_(std::move(file)),
      fileSize_(file_->size()),
      bufferSize_(std::min(fileSize_, bufferSize)),
      pool_(pool),
      readAheadEnabled_((bufferSize_ < fileSize_) && file_->hasPreadvAsync()) {
  VELOX_CHECK_NOT_NULL(pool_);
  VELOX_CHECK_GT(fileSize_, 0, "Empty FileInputStream");

  buffers_.push_back(AlignedBuffer::allocate<char>(bufferSize_, pool_)); // allocate buffer cause OOM
  if (readAheadEnabled_) {
    buffers_.push_back(AlignedBuffer::allocate<char>(bufferSize_, pool_));
  }
  readNextRange();
}

kSpillWriteBufferSize controls the serialization buffer, if up to this threshold, flush and compress the buffer.

jinchengchenghh · 2024-11-25T02:13:25Z

Spark also open all the spill file to read.

final UnsafeSorterSpillMerger spillMerger = new UnsafeSorterSpillMerger(
        recordComparatorSupplier.get(), prefixComparator, spillWriters.size());
      for (UnsafeSorterSpillWriter spillWriter : spillWriters) {
        spillMerger.addSpillIfNotEmpty(spillWriter.getReader(serializerManager));
      }
      if (inMemSorter != null) {
        readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
        spillMerger.addSpillIfNotEmpty(readingIterator);
      }
      return spillMerger.getSortedIterator();

Spark use the PriorityQueue<UnsafeSorterIterator> priorityQueue to get the record to merge.

Comparator<UnsafeSorterIterator> comparator = (left, right) -> {
      int prefixComparisonResult =
        prefixComparator.compare(left.getKeyPrefix(), right.getKeyPrefix());
      if (prefixComparisonResult == 0) {
        return recordComparator.compare(
          left.getBaseObject(), left.getBaseOffset(), left.getRecordLength(),
          right.getBaseObject(), right.getBaseOffset(), right.getRecordLength());
      } else {
        return prefixComparisonResult;
      }
    };
    priorityQueue = new PriorityQueue<>(numSpills, comparator);

It has config to control the read buffer size (default 1 MB) as following:

  private[spark] val UNSAFE_SORTER_SPILL_READ_AHEAD_ENABLED =
    ConfigBuilder("spark.unsafe.sorter.spill.read.ahead.enabled")
      .internal()
      .version("2.3.0")
      .booleanConf
      .createWithDefault(true)

  private[spark] val UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE =
    ConfigBuilder("spark.unsafe.sorter.spill.reader.buffer.size")
      .internal()
      .version("2.1.0")
      .bytesConf(ByteUnit.BYTE)
      .checkValue(v => 1024 * 1024 <= v && v <= MAX_BUFFER_SIZE_BYTES,
        s"The value must be in allowed range [1,048,576, ${MAX_BUFFER_SIZE_BYTES}].")
      .createWithDefault(1024 * 1024)

class UnsafeSorterSpillReader

final InputStream bs =
        new NioBufferedFileInputStream(file, bufferSizeBytes);
if (readAheadEnabled) {
        this.in = new ReadAheadInputStream(serializerManager.wrapStream(blockId, bs),
                bufferSizeBytes);
      } else {
        this.in = serializerManager.wrapStream(blockId, bs);
      }
      this.din = new DataInputStream(this.in);

It needs to load the bufferSize in NioBufferedFileInputStream, and 2 bufferSize in ReadAheadInputStream, after loaded, it will put the UnsafeSorterIterator reader to the priorityQueue again to load next record. But the buffer is allocated by ByteBuffer, not tracked by Spark memory pool. @FelixYBW

jinchengchenghh · 2024-11-25T02:29:16Z

In that case, we need to respect UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE.
Velox spill RowVector, so we must read the buffer size ahead. But it's better to add a config for total buffer size when UNSAFE_SORTER_SPILL_READ_AHEAD_ENABLED is false.

FelixYBW · 2024-11-25T23:40:34Z

Thank you, @jinchengchenghh . With the tuning of kMaxSpillRunRows and kSpillWriteBufferSize. one of my task succeed but the other one still fails. Looks like it still have some large memory allocation in getoutput.

FelixYBW · 2024-11-26T00:20:53Z

The kSpillReadBufferSize controls the size to read each file, the ordered reader will create FileOutputStream for each file and allocate kSpillReadBufferSize (default 1MB) for one file. Can you try to adjust this value? Maybe we should add a new config to control the read buffer size for all the files. numFiles * bufferSize < threshold.

Can you add it as config in Gluten?

FelixYBW · 2024-11-26T00:53:24Z

The kSpillReadBufferSize controls the size to read each file, the ordered reader will create FileOutputStream for each file and allocate kSpillReadBufferSize (default 1MB) for one file. Can you try to adjust this value? Maybe we should add a new config to control the read buffer size for all the files. numFiles * bufferSize < threshold.

should we propose the way of #7861?

FelixYBW · 2024-11-26T00:59:41Z

In that case, we need to respect UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE. Velox spill RowVector, so we must read the buffer size ahead. But it's better to add a config for total buffer size when UNSAFE_SORTER_SPILL_READ_AHEAD_ENABLED is false.

So the worst case of Vanilla spark is also 1M buffer per file, right? Let's hornor the value of spark.unsafe.sorter.spill.reader.buffer.size then. It may be set in queries.

jinchengchenghh · 2024-11-26T01:48:39Z

The kSpillReadBufferSize controls the size to read each file, the ordered reader will create FileOutputStream for each file and allocate kSpillReadBufferSize (default 1MB) for one file. Can you try to adjust this value? Maybe we should add a new config to control the read buffer size for all the files. numFiles * bufferSize < threshold.

should we propose the way of #7861?

#7861 releases the buffer after read, Velox FileInputStream reuse the readBufferSize to read the file. So it is the similar way.

jinchengchenghh · 2024-11-26T01:49:41Z

In that case, we need to respect UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE. Velox spill RowVector, so we must read the buffer size ahead. But it's better to add a config for total buffer size when UNSAFE_SORTER_SPILL_READ_AHEAD_ENABLED is false.

So the worst case of Vanilla spark is also 1M buffer per file, right? Let's hornor the value of spark.unsafe.sorter.spill.reader.buffer.size then. It may be set in queries.

Yes, I will draft a PR to respect this config.

jinchengchenghh · 2024-11-26T02:07:36Z

Thank you, @jinchengchenghh . With the tuning of kMaxSpillRunRows and kSpillWriteBufferSize. one of my task succeed but the other one still fails. Looks like it still have some large memory allocation in getoutput.

Maybe because the Streams will hold all the buffers, and released after all the files read completed.
Compress consumes much buffer but not tracked by memory pool in the meantime. https://github.com/facebookincubator/velox/blob/main/velox/serializers/PrestoSerializer.cpp#L4416

I don't see the compression in Spark spill, so it doesn't need to request memory for compression.

I will add a new config to control the velox spill codec.

It is still OOM or kill by yarn?

jinchengchenghh · 2024-11-26T02:23:38Z

Spark closes the reader when loadNext after all the records consumed.
I would prefer to close the input file after file is atEnd, https://github.com/facebookincubator/velox/blob/main/velox/exec/SpillFile.cpp#L314
It may benefits release the memory as early as possible.

jinchengchenghh · 2024-11-26T03:00:59Z

UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE

And also close the serializer in serde.

FelixYBW · 2024-11-26T03:41:24Z

#7861 releases the buffer after read, Velox FileInputStream reuse the readBufferSize to read the file. So it is the similar way.

No, 7681 uses mmap, memory is mapped into user space directly. velox uses file read/write, data is copied to buffer.

FelixYBW · 2024-11-26T03:44:08Z

s, I will draft a PR to respect

I'm adding to #8026

FelixYBW · 2024-11-26T03:45:44Z

I will add a new config to control the velox spill codec.

It is still OOM or kill by yarn?

we already should have spill codec and spill buffer configured.
OOM, not killed by yarn.

jinchengchenghh · 2024-11-26T06:20:16Z

Now it is Spark codec spark.io.compression.codec, I add the config to control it separately.
PrestoVectorSerde has Header, so it should read the compressed data and deserialize to RowVector. And hold RowVector for each file. The many RowVectors may cause OOM. Write kSpillWriteBufferSize flushes the RowVector. So for each file, kSpillReadBufferSize to reserve buffer in FileInputStream, one RowVector which size is approximately to kSpillWriteBufferSize.

In SortBuffer getOutputWithSpill, the std::vector<const RowVector*> spillSources_; and std::vector<vector_size_t> spillSourceRows_; size is equal to outputSize, which also holds memory but not tracked, may cause Kill By Yarn. I will draft a PR to track the memory.

ccat3z added bug Something isn't working triage labels Nov 22, 2024

This was referenced Nov 23, 2024

[GLUTEN-8025][VL] set default kSpillWriteBufferSize to the same as velox #8026

Closed

[VL] Task killed by Yarn #6947

Open

FelixYBW mentioned this issue Nov 26, 2024

Large memory allocation in getout after spill causes OOM facebookincubator/velox#11655

Open

FelixYBW mentioned this issue Nov 26, 2024

[CORE] Gluten should honor the spark configs as much as possible #8043

Open

github-actions bot mentioned this issue Nov 26, 2024

[GLUTEN-8025][VL] Respect config kSpillReadBufferSize and add spill compression codec #8045

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Distinct aggregation OOM when getOutput #8025

[VL] Distinct aggregation OOM when getOutput #8025

ccat3z commented Nov 22, 2024 •

edited

Loading

FelixYBW commented Nov 23, 2024 •

edited

Loading

FelixYBW commented Nov 23, 2024

FelixYBW commented Nov 23, 2024 •

edited

Loading

jinchengchenghh commented Nov 25, 2024

jinchengchenghh commented Nov 25, 2024 •

edited

Loading

jinchengchenghh commented Nov 25, 2024

FelixYBW commented Nov 25, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024 •

edited

Loading

[VL] Distinct aggregation OOM when getOutput #8025

[VL] Distinct aggregation OOM when getOutput #8025

Comments

ccat3z commented Nov 22, 2024 • edited Loading

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

FelixYBW commented Nov 23, 2024 • edited Loading

FelixYBW commented Nov 23, 2024

FelixYBW commented Nov 23, 2024 • edited Loading

jinchengchenghh commented Nov 25, 2024

jinchengchenghh commented Nov 25, 2024 • edited Loading

jinchengchenghh commented Nov 25, 2024

FelixYBW commented Nov 25, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 26, 2024 • edited Loading

ccat3z commented Nov 22, 2024 •

edited

Loading

FelixYBW commented Nov 23, 2024 •

edited

Loading

FelixYBW commented Nov 23, 2024 •

edited

Loading

jinchengchenghh commented Nov 25, 2024 •

edited

Loading

jinchengchenghh commented Nov 26, 2024 •

edited

Loading