Skip to content

Commit

Permalink
Merge commit 'c226821149756c895934eb6412cbc397c7ec053c' into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
Apollo3zehn committed Sep 18, 2023
2 parents 51981de + c226821 commit 34ce9ce
Show file tree
Hide file tree
Showing 14 changed files with 543 additions and 276 deletions.
3 changes: 2 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@
"python.testing.pytestEnabled": true,
"python.analysis.extraPaths": [
"src/clients/python-client"
]
],
"dotnet.defaultSolution": "Nexus.sln"
}
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
## v2.0.0-beta.16 - 2023-09-18

### Bugs fixed:
- Fixed a multithreading bug affecting the aggregation calculations. This bug very likely caused incorrect aggregation data probably for a long period of time for datasets with many NaN values.

## v2.0.0-beta.15 - 2023-07-13

### Bugs fixed:
Expand Down
24 changes: 24 additions & 0 deletions notes/system-resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# How to avoid issues with Nexus being killed by the Linux OOM killer?

- The value of `GcInfo.TotalAvailableMemoryBytes` is equal to the available physical memory when the current container the process is running inside has no specific memory limit. If it has a limit, the value of `GcInfo.TotalAvailableMemoryBytes` will be equal to `75%` of that value as long as `System.GC.HeapHardLimitPercent` is not set [[learn.microsoft.com]](https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector#heap-limit). This behavior has been confirmed by multiple tests using a simple memory allocation application and Docker Compose.

- When the application allocates memory which is not used immediately, the OS (Linux) commits the memory but does allocate it only when the associated memory page is accessed for the first time. This behavior has the advantage that an underutilization of physical RAM is avoided but the disadvantage is that the system may suddenly run out of memory when the commited memory is actually being used. Since this happens when the memory page is accessed for the first time, it is hard to determine which line of code will have a high chance of triggering the OOM killer of Linux.

- There is also the `GcInfo.HighMemoryLoadThresholdBytes` setting which is at 90 % of the physical memory. I don't know exactly why this value is not adapted to the value of `System.GC.HeapHardLimitPercent` [[github.com]](https://github.com/dotnet/runtime/issues/58974). Microsoft writes the following about the limit: *[...] for the dominant process on a machine with 64GB of memory, it's reasonable for GC to start reacting when there's 10% of memory available.* [[learn.microsoft.com]](https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector#high-memory-percent). This may be an explanation why Nexus does not always OOM but only when there is more than 10% of the memory available and Nexus tries to allocat an array > 10 % of the available memory. This would prevent the GC from becoming active and the OS runs the OOM killer.

- On Unix-based OS, there is currently no low memory notification [[github.com]](https://github.com/dotnet/runtime/issues/6051).

- With the above mentioned memory limits set to a value less than the available physical memory (e.g. by using the `DOTNET_GCHeapHardLimit` variable), either the GC runs in time and the allocation succeeds or we get an `OutOfMemoryException` (I don't know why the GC is not always able to free enough memory) but the application stays alive reliably now.

- So the simplest solution to avoid OOM issues might be to
- configure Docker Compose to apply a memory limit to the container (*Docker resource limits are built on top of cgroups, which is a Linux kernel capability* [[devblogs.microsoft.com/]](https://devblogs.microsoft.com/dotnet/using-net-and-docker-together-dockercon-2019-update/)),
- use `MemoryPool<T>.Shared.Rent` whereever possible to avoid allocations and
- catch `OutOfMemoryException` in potentially large array allocations to run the GC and retry once

More resources
- **.NET Core application running in docker gets OOMKilled if swapping is disabled** [[github.com]](https://github.com/dotnet/runtime/issues/851)

- **net5.0 console apps on linux don't show OutOfMemoryExceptions before being OOM-killed** - *It is not possible for .NET runtime to reliably throw OutOfMemoryException on Linux, unless you disable oom killer.
Note that average .NET process uses number of unmanaged libraries. .NET runtime does not have control over allocations
done by these libraries. If the library happens to make an allocation that overshoots the memory killer limit,
the Linux OS will happily make this allocation succeed, only to kill the process later.* [[github.com]](https://github.com/dotnet/runtime/issues/46147#issuecomment-747471498)
6 changes: 3 additions & 3 deletions src/Nexus/API/JobsController.cs
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ public async Task<ActionResult<Job>> ExportAsync(
catch (Exception ex)
{
_logger.LogError(ex, "Unable to export the requested data.");
throw new Exception("Unable to export the requested data.", ex);
throw;
}
});

Expand Down Expand Up @@ -267,7 +267,7 @@ public ActionResult<Job> RefreshDatabase()
catch (Exception ex)
{
_logger.LogError(ex, "Unable to reload extensions and reset the resource catalog.");
throw new Exception("Unable to reload extensions and reset the resource catalog.", ex);
throw;
}
});

Expand Down Expand Up @@ -307,7 +307,7 @@ public async Task<ActionResult<Job>> ClearCacheAsync(
catch (Exception ex)
{
_logger.LogError(ex, "Unable to clear the cache.");
throw new Exception("Unable to clear the cache.", ex);
throw;
}
});

Expand Down
2 changes: 1 addition & 1 deletion src/Nexus/Core/NexusClaims.cs
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ internal static class NexusClaims
public const string CAN_READ_CATALOG_GROUP = "CanReadCatalogGroup";
public const string CAN_WRITE_CATALOG_GROUP = "CanWriteCatalogGroup";
}
}
}
17 changes: 17 additions & 0 deletions src/Nexus/Extensibility/DataSource/DataSourceController.cs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
using System.Buffers;
using System.Collections.Concurrent;
using System.ComponentModel.DataAnnotations;
using System.Diagnostics;
using System.Reflection;
using System.Text.Json;
using System.Text.Json.Nodes;
Expand Down Expand Up @@ -485,6 +486,10 @@ await DataSource.ReadAsync(
progress,
cancellationToken);
}
catch (OutOfMemoryException)
{
throw;
}
catch (Exception ex)
{
Logger.LogError(ex, "Read original data period {Begin} to {End} failed", begin, end);
Expand Down Expand Up @@ -647,6 +652,10 @@ await _cacheService.UpdateAsync(
cancellationToken);
}
}
catch (OutOfMemoryException)
{
throw;
}
catch (Exception ex)
{
Logger.LogError(ex, "Read aggregation data period {Begin} to {End} failed", begin, end);
Expand Down Expand Up @@ -743,6 +752,10 @@ await DataSource.ReadAsync(
blockSize,
offset);
}
catch (OutOfMemoryException)
{
throw;
}
catch (Exception ex)
{
Logger.LogError(ex, "Read resampling data period {Begin} to {End} failed", roundedBegin, roundedEnd);
Expand Down Expand Up @@ -1006,6 +1019,10 @@ await controller.ReadAsync(
dataSourceProgress,
cancellationToken);
}
catch (OutOfMemoryException)
{
throw;
}
catch (Exception ex)
{
logger.LogError(ex, "Process period {Begin} to {End} failed", currentBegin, currentEnd);
Expand Down
1 change: 0 additions & 1 deletion src/Nexus/Nexus.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
</PropertyGroup>

<ItemGroup>
<PackageReference Include="MathNet.Numerics" Version="5.0.0" />
<PackageReference Include="Microsoft.AspNetCore.Authentication.Cookies" Version="2.2.0" />
<PackageReference Include="Microsoft.AspNetCore.Authentication.JwtBearer" Version="7.0.5" />
<PackageReference Include="Microsoft.AspNetCore.Authentication.OpenIdConnect" Version="7.0.5" />
Expand Down
8 changes: 7 additions & 1 deletion src/Nexus/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,16 @@
// architecture
if (!BitConverter.IsLittleEndian)
{
Log.Information("This software runs on little-endian systems.");
Log.Information("This software runs only on little-endian systems.");
return;
}

// memory info
var memoryInfo = GC.GetGCMemoryInfo();

Console.WriteLine($"GC: Total available memory: {memoryInfo.TotalAvailableMemoryBytes / 1024 / 1024} MB");
Console.WriteLine($"GC: High memory load threshold: {memoryInfo.HighMemoryLoadThresholdBytes / 1024 / 1024} MB");

Log.Information("Start host");

var builder = WebApplication.CreateBuilder(args);
Expand Down
8 changes: 5 additions & 3 deletions src/Nexus/Services/DataService.cs
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
using System.ComponentModel.DataAnnotations;
using System.Buffers;
using System.ComponentModel.DataAnnotations;
using System.IO.Compression;
using System.IO.Pipelines;
using System.Security.Claims;
using Microsoft.Extensions.Options;
using Nexus.Core;
using Nexus.Extensibility;
using Nexus.Utilities;
Expand Down Expand Up @@ -140,7 +140,9 @@ public async Task<ReadOnlyMemory<double>> ReadAsDoubleArrayAsync(
end,
cancellationToken);

var result = new double[stream.Length / 8];
var elementCount = (int)(stream.Length / 8); // TODO is this cast safe?
using var memoryOwner = MemoryPool<double>.Shared.Rent(elementCount);
var result = memoryOwner.Memory.Slice(0, elementCount);
var byteBuffer = new CastMemoryManager<double, byte>(result).Memory;

int bytesRead;
Expand Down
43 changes: 33 additions & 10 deletions src/Nexus/Services/MemoryTracker.cs
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,12 @@ public MemoryTracker(IOptions<DataOptions> dataOptions, ILogger<IMemoryTracker>
{
_dataOptions = dataOptions.Value;
_logger = logger;

_ = Task.Run(MonitorFullGC);
}

internal int Factor { get; set; } = 8;

public async Task<AllocationRegistration> RegisterAllocationAsync(long minimumByteCount, long maximumByteCount, CancellationToken cancellationToken)
{
if (minimumByteCount > _dataOptions.TotalBufferMemoryConsumption)
Expand All @@ -58,26 +62,26 @@ public async Task<AllocationRegistration> RegisterAllocationAsync(long minimumBy
// get exclusive access to _consumedBytes and _retrySemaphores
lock (this)
{
var halfOfRemainingBytes = _consumedBytes >= _dataOptions.TotalBufferMemoryConsumption
var fractionOfRemainingBytes = _consumedBytes >= _dataOptions.TotalBufferMemoryConsumption
? 0
: (_dataOptions.TotalBufferMemoryConsumption - _consumedBytes) / 2;
: (_dataOptions.TotalBufferMemoryConsumption - _consumedBytes) / Factor /* normal = 8, tests = 2 */;

long actualByteCount = 0;

if (halfOfRemainingBytes >= maximumByteCount)
if (fractionOfRemainingBytes >= maximumByteCount)
actualByteCount = maximumByteCount;

else if (halfOfRemainingBytes >= minimumByteCount)
actualByteCount = halfOfRemainingBytes;
else if (fractionOfRemainingBytes >= minimumByteCount)
actualByteCount = fractionOfRemainingBytes;

// success
if (actualByteCount > 0)
if (actualByteCount >= minimumByteCount)
{
// remove semaphore from list
if (myRetrySemaphore is not null)
_retrySemaphores.Remove(myRetrySemaphore);

_logger.LogTrace("Allocate {ByteCount} bytes", actualByteCount);
_logger.LogTrace("Allocate {ByteCount} bytes ({MegaByteCount} MB)", actualByteCount, actualByteCount / 1024 / 1024);
SetConsumedBytesAndTriggerWaitingTasks(actualByteCount);

return new AllocationRegistration(this, actualByteCount);
Expand All @@ -96,7 +100,7 @@ public async Task<AllocationRegistration> RegisterAllocationAsync(long minimumBy
}

// wait until _consumedBytes changes
_logger.LogTrace("Wait until {ByteCount} bytes are available", minimumByteCount);
_logger.LogTrace("Wait until {ByteCount} bytes ({MegaByteCount} MB) are available", minimumByteCount, minimumByteCount / 1024 / 1024);
await myRetrySemaphore.WaitAsync(timeout: TimeSpan.FromMinutes(1), cancellationToken);
}
}
Expand All @@ -106,7 +110,7 @@ public void UnregisterAllocation(AllocationRegistration allocationRegistration)
// get exclusive access to _consumedBytes and _retrySemaphores
lock (this)
{
_logger.LogTrace("Release {ByteCount} bytes", allocationRegistration.ActualByteCount);
_logger.LogTrace("Release {ByteCount} bytes ({MegaByteCount} MB)", allocationRegistration.ActualByteCount, allocationRegistration.ActualByteCount / 1024 / 1024);
SetConsumedBytesAndTriggerWaitingTasks(-allocationRegistration.ActualByteCount);
}
}
Expand All @@ -122,7 +126,26 @@ private void SetConsumedBytesAndTriggerWaitingTasks(long difference)
retrySemaphore.Release();
}

_logger.LogTrace("{ByteCount} bytes are currently in use", _consumedBytes);
_logger.LogTrace("{ByteCount} bytes ({MegaByteCount} MB) are currently in use", _consumedBytes, _consumedBytes / 1024 / 1024);
}

private void MonitorFullGC()
{
_logger.LogDebug("Register for full GC notifications");
GC.RegisterForFullGCNotification(1, 1);

while (true)
{
var status = GC.WaitForFullGCApproach();

if (status == GCNotificationStatus.Succeeded)
_logger.LogDebug("Full GC is approaching");

status = GC.WaitForFullGCComplete();

if (status == GCNotificationStatus.Succeeded)
_logger.LogDebug("Full GC has completed");
}
}
}
}
Loading

0 comments on commit 34ce9ce

Please sign in to comment.