Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge master to pool licensing feature branch #6185

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
118 commits
Select commit Hold shift + click to select a range
30501ad
CP-51694: Add testing of C# date converter
danilo-delbusso Sep 30, 2024
dc1ef20
CP-51694: Add testing of Java date deserializer
danilo-delbusso Sep 30, 2024
f0b6322
rrdd: avoid constructing intermediate lists, use Seq
edwintorok Apr 26, 2024
8b2b2d2
CA-391651 - rrd: Remove deprecated member of rra struct
last-genius Oct 11, 2024
cacf52a
CA-391651: Make timestamps of data collectors in xcp-rrdd independent
last-genius Oct 18, 2024
a7bc62d
CA-391651: rrdd_server - read plugins' timestamps, don't just ignore …
last-genius Oct 18, 2024
fdcb386
CA-391651: Propagate the timestamp inside RRD.
last-genius Oct 18, 2024
0a6daf2
CA-391651: Rename 'new_domid' parameter to 'new_rrd'
last-genius Oct 18, 2024
8aa596d
CA-391651 - rrd: Carry indices with datasources
last-genius Oct 18, 2024
a897b53
CA-391651: Use per-datasource last_updated timestamp during updating …
last-genius Oct 25, 2024
7052ddd
CA-391651: rrd - don't iterate over lists needlessly
last-genius Oct 21, 2024
65d16f5
CA-391651: rrdd_monitor - Handle missing datasources by resetting the…
last-genius Oct 23, 2024
8d7c057
CA-391651 - rrd protocol: Stop truncating timestamps to seconds
last-genius Oct 25, 2024
f0d252c
CA-391651 - rrdd.py: Stop truncating timestamps to seconds
last-genius Oct 29, 2024
b3da1c9
CA-391651 rrd: Don't truncate timestamps when calculating values
last-genius Oct 25, 2024
76e317c
CA-391651: Update RRD tests to the new interfaces
last-genius Oct 18, 2024
1d7fe35
CA-391651 - docs: Update RRD design pages
last-genius Oct 29, 2024
801dd96
Increase wait-init-complete timeout
tescande Nov 8, 2024
20f4dcc
CP-51694: Add testing of Go date deserialization
danilo-delbusso Oct 1, 2024
ca5d3b6
Set non-UTC timezone for date time unit test runners
danilo-delbusso Nov 12, 2024
5bcef81
Fix parsing of timezone agnostic date strings in Java deserializer
danilo-delbusso Nov 13, 2024
b81d11e
Ensure C# date tests work when running under any timezone
danilo-delbusso Nov 13, 2024
630aead
CA-402326: Fetch SM records from the pool to avoid race
Vincent-lau Nov 18, 2024
21d6773
Minimize xenstore accesses during domid-to-uuid lookups
last-genius Nov 20, 2024
7c62ede
CA-402326: Fetch SM records from the pool in one go to avoid race (#6…
Vincent-lau Nov 21, 2024
228071a
CP-52524 - dbsync_slave: stop calculating boot time ourselves
last-genius Nov 18, 2024
f0c9b4c
CP-52524: Generate an alert when various host kernel taints are set
last-genius Nov 19, 2024
aaabb6c
xenopsd: Optimize lazy evaluation
last-genius Nov 22, 2024
860843f
CA-402654: Partially revert 3e2e970af
contificate Nov 22, 2024
8f49371
CA-402654: Partially revert 3e2e970af (#6131)
contificate Nov 22, 2024
4083494
CA-402263, xapi_sr_operatrions: don't include all API storage operati…
psafont Nov 21, 2024
2994fcd
CA-402263, xapi_sr_operatrions: don't include all API storage operati…
robhoes Nov 25, 2024
3f59ae1
NUMA docs: Fix typos and extend the intro for the best-effort mode
bernhardkaindl Nov 25, 2024
fc6919e
CP-51772: Remove traceparent from Http.Request
contificate Oct 16, 2024
fe66bc4
CP-51772: Remove external usage of traceparent
contificate Oct 16, 2024
7b95bd6
CP-51772: Add TraceContext to Tracing
contificate Oct 16, 2024
e149817
CP-51772: Add Http Request Propagator
contificate Oct 16, 2024
0d996b3
CP-51772: Extract traceparent back out
contificate Oct 16, 2024
069ca95
CP-51772: Remove tracing dependency from http-lib
contificate Oct 16, 2024
c4962a3
CP-51772: Consolidate propagation into tracing lib
contificate Oct 17, 2024
6dad697
CP-51772: Repair xapi-cli-server's tracing
contificate Oct 17, 2024
673525e
CP-51772: Repair tracing in xapi
contificate Oct 17, 2024
8e20e3e
Restructuring
contificate Oct 21, 2024
4eb7185
CP-51772: Forward baggage from xe-cli
contificate Oct 22, 2024
c5fe9ba
CP-51772: Propagate trace context through spans
contificate Oct 22, 2024
252c05e
Apply fix by psafont: [Xenopsd] chooses NUMA nodes purely based on am…
bernhardkaindl Nov 26, 2024
aa54237
Apply fix by psafont: "Future XAPI versions may change `default_polic…
bernhardkaindl Nov 26, 2024
6f64a78
Remove tracing dependency from http-lib and add baggage (#6065)
robhoes Nov 26, 2024
fce648f
CP-51694: Add date deserialization unit tests for C#/Java/Go (#6027)
danilo-delbusso Nov 26, 2024
536db8c
Minimize xenstore accesses during domid-to-uuid lookups (#6129)
last-genius Nov 26, 2024
eeec845
xe-cli completion: Use grep -E instead of egrep
last-genius Nov 28, 2024
7ada734
CA-388210: factor out computing the domain parameter
edwintorok Aug 7, 2024
864d734
CA-388210: SMAPIv3 concurrency safety: send the (unique) datapath arg…
edwintorok Aug 7, 2024
d5cd034
CA-388210: SMAPIv3 debugging: log PID
edwintorok Aug 7, 2024
ce82302
CA-388210: SMAPIv3 concurrency: turn on concurrent operations by default
edwintorok Aug 7, 2024
4e3fd42
CA-388210: enable SMAPIv3 concurrent operations by default (#6141)
edwintorok Dec 2, 2024
4c97bfb
Improve Delay test
freddy77 Dec 1, 2024
78d6df3
CP-42675: add new SM GC message ID
MarkSymsCtx Nov 12, 2024
5007436
CP-42675: add new SM GC message ID (#6145)
robhoes Dec 2, 2024
21d1156
CA-403101: Keep host.last_update_hash for host joined a pool
gangj Nov 29, 2024
562ae78
CA-403101: Keep host.last_update_hash for host joined a pool (#6142)
gangj Dec 3, 2024
0772473
xapi_message: Fix incorrect slow path invocation (and its logs)
last-genius Dec 3, 2024
0cd32dd
Improve Delay test (#6143)
psafont Dec 3, 2024
e2f96bf
xapi_message: Fix incorrect slow path invocation (and its logs) (#6147)
robhoes Dec 3, 2024
b782202
CP-52524: Generate an alert when various host kernel taints are set (…
last-genius Dec 3, 2024
9e3ad1c
xapi: move the 'periodic' scheduler to xapi-stdext-threads
psafont Nov 27, 2024
cd98e9d
Check index before using it removing an element from Imperative prior…
freddy77 Nov 26, 2024
414deef
Fix removing elements from Imperative priority queue
freddy77 Nov 27, 2024
dab8684
Remove possible systematic leak in Imperative priority queue
freddy77 Nov 26, 2024
4f5bb1a
Add test for is_empty for Imperative priority queue
freddy77 Nov 26, 2024
1a8a1f3
Move and improve old test for Imperative priority queue
freddy77 Nov 27, 2024
3202058
Initialise Imperative priority queue array on creation
freddy77 Nov 28, 2024
87ca90e
CA-399757: Add CAS style check for SR scan
Vincent-lau Oct 1, 2024
b12a6b0
CA-399757: Add CAS style check for SR scan (#6113)
robhoes Dec 3, 2024
685ba39
Start moving Xapi periodic scheduler to an independent library (#6139)
robhoes Dec 3, 2024
6d275ae
xapi-stdext-threads: use mtime.clock.os
psafont Dec 3, 2024
95fc08f
xapi-stdext-threads: use mtime.clock.os (#6149)
robhoes Dec 3, 2024
8971c8d
CA-391651: Fix spike in derived RRD metrics (#6086)
last-genius Dec 4, 2024
c85270a
Remove unused ocaml/perftest
robhoes Dec 4, 2024
2b02240
Remove references to perftest
robhoes Dec 4, 2024
c141b6d
### `new-docs`/NUMA : Fix typos and simplify some parts (#6134)
psafont Dec 4, 2024
fd011dd
Update quality-gate
robhoes Dec 4, 2024
26b6ed6
Remove ocaml/perftest (#6156)
robhoes Dec 4, 2024
03fccc8
CA-401075: remove misleading logs from HTTP client
robhoes Dec 4, 2024
efff095
CP-52807: No more cluster stack alert
Vincent-lau Dec 4, 2024
9998658
CA-401075: remove misleading logs from HTTP client (#6158)
robhoes Dec 4, 2024
92114cd
Rewrite Delay module
freddy77 Dec 1, 2024
d3c9a50
CA-394851: Update allowed operations on the cloned VBD
last-genius Dec 5, 2024
0b47acd
CP-52807: No more cluster stack alert (#6160)
Vincent-lau Dec 5, 2024
1586c74
CP-51429 Avoid redundant processing when full metadata already exists…
LunfanZhang Nov 25, 2024
84e7394
CP-51429 Avoid redundant processing when full metadata already exists…
minglumlu Dec 6, 2024
7fdaf4f
CA-394851: Update allowed operations on the cloned VBD (#6159)
last-genius Dec 6, 2024
f936acb
Delay: wait a bit more testing the module
freddy77 Dec 6, 2024
87927f1
Rewrite Delay module (#6144)
psafont Dec 6, 2024
4d3c669
Increase wait-init-complete timeout (#6109)
psafont Dec 6, 2024
f14fcdf
Delay: wait a bit more testing the module (#6162)
robhoes Dec 6, 2024
3476a22
Simple test for periodic scheduler
freddy77 Nov 28, 2024
6249261
Limit mutex contention in add_to_queue
freddy77 Dec 6, 2024
f86c076
Compare correctly Mtime.t
freddy77 Dec 6, 2024
2950dd9
Protect queue with mutex in remove_from_queue
freddy77 Dec 6, 2024
529eeaa
Remove signal parameter from add_to_queue
freddy77 Dec 6, 2024
2c192c9
Fix multiple issues in periodic scheduler
freddy77 Nov 26, 2024
935c84f
Add test for removing periodic event in periodic scheduler
freddy77 Nov 28, 2024
60e1257
Add test for handling event if queue was empty in periodic scheduler
freddy77 Nov 28, 2024
21b56b4
xapi_sr: remove commented code from 2009
psafont Dec 5, 2024
88dd4d9
Add a test to check the loop is woken up adding a new event
freddy77 Dec 9, 2024
098546a
CA-390025: do not override SR's client-set metadata on update
psafont Dec 5, 2024
ea46f81
xe-cli completion: Hide COMPREPLY manipulation behind functions
last-genius Dec 10, 2024
2c9c9e7
xe-cli completion bugfixes (#6166)
psafont Dec 10, 2024
d8baca7
CA-390025: do not override SR's client-set metadata on update (#6165)
psafont Dec 10, 2024
3e70a6d
Improve the scan comparison logic
Vincent-lau Dec 10, 2024
9ad4626
CA-402901: Update leaked dp to Sr
changlei-li Dec 10, 2024
8941c9d
Test and improve Xapi periodic scheduler (#6155)
psafont Dec 11, 2024
309e7f6
CA-402901: Update leaked dp to Sr (#6169)
robhoes Dec 11, 2024
4f3f08f
Improve the scan comparison logic (#6168)
Vincent-lau Dec 11, 2024
a540ac8
CA-403633: Keep vPCI devices in the same order
psafont Dec 12, 2024
9aa2baa
CA-403633: Keep vPCI devices in the same order (#6176)
psafont Dec 12, 2024
0b15ba2
Merge branch 'master' into private/mingl/merge_master_to_feature
minglumlu Dec 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/generate-and-build-sdks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,14 @@ jobs:
shell: bash
run: opam exec -- make sdk

# sdk-ci runs some Go unit tests.
# This setting ensures that SDK date time
# tests are run on a machine that
# isn't using UTC
- name: Set Timezone to Tokyo for datetime tests
run: |
sudo timedatectl set-timezone Asia/Tokyo

- name: Run CI for SDKs
uses: ./.github/workflows/sdk-ci

Expand Down Expand Up @@ -54,6 +62,7 @@ jobs:
path: |
_build/install/default/share/go/*
!_build/install/default/share/go/dune
!_build/install/default/share/go/**/*_test.go

- name: Store Java SDK source
uses: actions/upload-artifact@v4
Expand Down Expand Up @@ -110,6 +119,14 @@ jobs:
java-version: '17'
distribution: 'temurin'

# Java Tests are run at compile time.
# This setting ensures that SDK date time
# tests are run on a machine that
# isn't using UTC
- name: Set Timezone to Tokyo for datetime tests
run: |
sudo timedatectl set-timezone Asia/Tokyo

- name: Build Java SDK
shell: bash
run: |
Expand Down Expand Up @@ -138,6 +155,21 @@ jobs:
name: SDK_Source_CSharp
path: source/

# All tests builds and pipelines should
# work on other timezones. This setting ensures that
# SDK date time tests are run on a machine that
# isn't using UTC
- name: Set Timezone to Tokyo for datetime tests
shell: pwsh
run: Set-TimeZone -Id "Tokyo Standard Time"

- name: Test C# SDK
shell: pwsh
run: |
dotnet test source/XenServerTest `
--disable-build-servers `
--verbosity=normal

- name: Build C# SDK
shell: pwsh
run: |
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/go-ci/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ runs:
working-directory: ${{ github.workspace }}/_build/install/default/share/go/src
args: --config=${{ github.workspace }}/.golangci.yml

- name: Run Go Tests
shell: bash
working-directory: ${{ github.workspace }}/_build/install/default/share/go/src
run: go test -v

- name: Run CI for Go SDK
shell: bash
run: |
Expand Down
9 changes: 3 additions & 6 deletions doc/content/design/plugin-protocol-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ DATASOURCES
000001e4
dba4bf7a84b6d11d565d19ef91f7906e
{
"timestamp": 1339685573,
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
Expand Down Expand Up @@ -62,7 +62,7 @@ reported datasources.
### Example
```
{
"timestamp": 1339685573,
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
Expand Down Expand Up @@ -96,7 +96,7 @@ Protocol V2
|data checksum |32 |int32 |binary-encoded crc32 of the concatenation of the encoded timestamp and datasource values|
|metadata checksum |32 |int32 |binary-encoded crc32 of the metadata string (see below) |
|number of datasources|32 |int32 |only needed if the metadata has changed - otherwise RRDD can use a cached value |
|timestamp |64 |int64 |Unix epoch |
|timestamp |64 |double|Unix epoch |
|datasource values |n * 64 |int64 \| double |n is the number of datasources exported by the plugin, type dependent on the setting in the metadata for value_type [int64\|float] |
|metadata length |32 |int32 | |
|metadata |(string length)*8|string| |
Expand Down Expand Up @@ -193,6 +193,3 @@ This means that for a normal update, RRDD will only have to read the header plus
the first (16 + 16 + 4 + 8 + 8*n) bytes of data, where n is the number of
datasources exported by the plugin. If the metadata changes RRDD will have to
read all the data (and parse the metadata).

n.b. the timestamp reported by plugins is not currently used by RRDD - it uses
its own global timestamp.
72 changes: 46 additions & 26 deletions doc/content/toolstack/features/NUMA/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ There is also I/O NUMA where a cost is similarly associated to where a PCIe is p

NUMA does have advantages though: if each node accesses only its local memory, then each node can independently achieve maximum throughput.

For best performance we should:
For best performance, we should:
- minimize the amount of interconnect bandwidth we are using
- run code that accesses memory allocated on the closest NUMA node
- maximize the number of NUMA nodes that we use in the system as a whole
Expand All @@ -62,39 +62,59 @@ The Xen scheduler supports 2 kinds of constraints:
* hard pinning: a vCPU may only run on the specified set of pCPUs and nowhere else
* soft pinning: a vCPU is *preferably* run on the specified set of pCPUs, but if they are all busy then it may run elsewhere

The former is useful if you want strict separation, but it can potentially leave part of the system idle while another part is bottlenecked with lots of vCPUs all competing for the same limited set of pCPUs.
Hard pinning can be used to partition the system. But, it can potentially leave part of the system idle while another part is bottlenecked by many vCPUs competing for the same limited set of pCPUs.

Xen does not migrate workloads between NUMA nodes on its own (the Linux kernel does), although it is possible to achieve a similar effect with explicit migration.
However migration introduces additional delays and is best avoided for entire VMs.
Xen does not migrate workloads between NUMA nodes on its own (the Linux kernel can). Although, it is possible to achieve a similar effect with explicit migration.
However, migration introduces additional delays and is best avoided for entire VMs.

The latter (soft pinning) is preferred: running a workload now, even on a potentially suboptimal pCPU (higher NUMA latency) is still better than not running it at all and waiting until a pCPU is freed up.
Therefore, soft pinning is preferred: Running on a potentially suboptimal pCPU that uses remote memory could still be better than not running it at all until a pCPU is free to run it.

Xen will also allocate memory for the VM according to the vCPU (soft) pinning: if the vCPUs are pinned only to NUMA nodes A and B, then it will allocate the VM's memory from NUMA nodes A and B (in a round-robin way, resulting in interleaving).
Xen will also allocate memory for the VM according to the vCPU (soft) pinning: If the vCPUs are pinned to NUMA nodes A and B, Xen allocates memory from NUMA nodes A and B in a round-robin way, resulting in interleaving.

By default (no pinning) it will interleave memory from all NUMA nodes, which provides average performance, but individual tasks' performance may be significantly higher or lower depending on which NUMA node the application may have "landed" on.
Furthermore restarting processes will speed them up or slow them down as address space randomization picks different memory regions inside a VM.
### Current default: No vCPU pinning

By default, when no vCPU pinning is used, Xen interleaves memory from all NUMA nodes. This averages the memory performance, but individual tasks' performance may be significantly higher or lower depending on which NUMA node the application may have "landed" on.
As a result, restarting processes will speed them up or slow them down as address space randomization picks different memory regions inside a VM.

This uses the memory bandwidth of all memory controllers and distributes the load across all nodes.
However, the memory latency is higher as the NUMA interconnects are used for most memory accesses and vCPU synchronization within the Domains.

Note that this is not the worst case: the worst case would be for memory to be allocated on one NUMA node, but the vCPU always running on the furthest away NUMA node.

## Best effort NUMA-aware memory allocation for VMs

By default Xen stripes the VM's memory accross all NUMA nodes of the host, which means that every VM has to go through all the interconnects.

### Summary

The best-effort mode attempts to fit Domains into NUMA nodes and to balance memory usage.
It soft-pins Domains on the NUMA node with the most available memory when adding the Domain.
Memory is currently allocated when booting the VM (or while constructing the resuming VM).

Parallel boot issue: Memory is not pre-allocated on creation, but allocated during boot.
The result is that parallel VM creation and boot can exhaust the memory of NUMA nodes.

### Goals

By default, Xen stripes the VM's memory across all NUMA nodes of the host, which means that every VM has to go through all the interconnects.
The goal here is to find a better allocation than the default, not necessarily an optimal allocation.
An optimal allocation would require knowing what VMs you would start/create in the future, and planning across hosts too.
An optimal allocation would require knowing what VMs you would start/create in the future, and planning across hosts.
This allows the host to use all NUMA nodes to take advantage of the full memory bandwidth available on the pool hosts.

Overall we want to balance the VMs across NUMA nodes, such that we use all NUMA nodes to take advantage of the maximum memory bandwidth available on the system.
Overall, we want to balance the VMs across NUMA nodes, such that we use all NUMA nodes to take advantage of the maximum memory bandwidth available on the system.
For now this proposed balancing will be done only by balancing memory usage: always heuristically allocating VMs on the NUMA node that has the most available memory.
Note that this allocation has a race condition for now when multiple VMs are booted in parallel, because we don't wait until Xen has constructed the domain for each one (that'd serialize domain construction, which is currently parallel).
For now, this allocation has a race condition: This happens when multiple VMs are booted in parallel, because we don't wait until Xen has constructed the domain for each one (that'd serialize domain construction, which is currently parallel).
This may be improved in the future by having an API to query Xen where it has allocated the memory, and to explicitly ask it to place memory on a given NUMA node (instead of best_effort).

If a VM doesn't fit into a single node then it is not so clear what the best approach is.
One criteria to consider is minimizing the NUMA distance between the nodes chosen for the VM.
Large NUMA systems may not be fully connected in a mesh requiring multiple hops to each a node, or even have assymetric links, or links with different bitwidth.
These tradeoff should be approximatively reflected in the ACPI SLIT tables, as a matrix of distances between nodes.
Large NUMA systems may not be fully connected in a mesh requiring multiple hops to each a node, or even have asymmetric links, or links with different bandwidth.
The specific NUMA topology is provided by the ACPI SLIT table as the matrix of distances between nodes.
It is possible that 3 NUMA nodes have a smaller average/maximum distance than 2, so we need to consider all possibilities.

For N nodes there would be 2^N possibilities, so [Topology.NUMA.candidates] limits the number of choices to 65520+N (full set of 2^N possibilities for 16 NUMA nodes, and a reduced set of choices for larger systems).

### Implementation

[Topology.NUMA.candidates] is a sorted sequence of node sets, in ascending order of maximum/average distances.
Once we've eliminated the candidates not suitable for this VM (that do not have enough total memory/pCPUs) we are left with a monotonically increasing sequence of nodes.
There are still multiple possibilities with same average distance.
Expand All @@ -110,19 +130,19 @@ See page 13 in [^AMD_numa] for a diagram of an AMD Opteron 6272 system.

* Booting multiple VMs in parallel will result in potentially allocating both on the same NUMA node (race condition)
* When we're about to run out of host memory we'll fall back to striping memory again, but the soft affinity mask won't reflect that (this needs an API to query Xen on where it has actually placed the VM, so we can fix up the mask accordingly)
* XAPI is not aware of NUMA balancing across a pool, and choses hosts purely based on total amount of free memory, even if a better NUMA placement could be found on another host
* XAPI is not aware of NUMA balancing across a pool. Xenopsd chooses NUMA nodes purely based on amount of free memory on the NUMA nodes of the host, even if a better NUMA placement could be found on another host
* Very large (>16 NUMA nodes) systems may only explore a limited number of choices (fit into a single node vs fallback to full interleaving)
* The exact VM placement is not yet controllable
* Microbenchmarks with a single VM on a host show both performance improvements and regressions on memory bandwidth usage: previously a single VM may have been able to take advantage of the bandwidth of both NUMA nodes if it happened to allocate memory from the right places, whereas now it'll be forced to use just a single node.
As soon as you have more than 1 VM that is busy on a system enabling NUMA balancing should almost always be an improvement though.
* it is not supported to combine hard vCPU masks with soft affinity: if hard affinities are used then no NUMA scheduling is done by the toolstack and we obey exactly what the user has asked for with hard affinities.
* It is not supported to combine hard vCPU masks with soft affinity: if hard affinities are used, then no NUMA scheduling is done by the toolstack, and we obey exactly what the user has asked for with hard affinities.
This shouldn't affect other VMs since the memory used by hard-pinned VMs will still be reflected in overall less memory available on individual NUMA nodes.
* Corner case: the ACPI standard allows certain NUMA nodes to be unreachable (distance `0xFF` = `-1` in the Xen bindings).
This is not supported and will cause an exception to be raised.
If this is an issue in practice the NUMA matrix could be pre-filtered to contain only reachable nodes.
NUMA nodes with 0 CPUs *are* accepted (it can result from hard affinity pinnings)
NUMA nodes with 0 CPUs *are* accepted (it can result from hard affinity pinning)
* NUMA balancing is not considered during HA planning
* Dom0 is a single VM that needs to communicate with all other VMs, so NUMA balancing is not applied to it (we'd need to expose NUMA topology to the Dom0 kernel so it can better allocate processes)
* Dom0 is a single VM that needs to communicate with all other VMs, so NUMA balancing is not applied to it (we'd need to expose NUMA topology to the Dom0 kernel, so it can better allocate processes)
* IO NUMA is out of scope for now

## XAPI datamodel design
Expand All @@ -139,7 +159,7 @@ Meaning of the policy:
* `best_effort`: the algorithm described in this document, where soft pinning is used to achieve better balancing and lower latency
* `default_policy`: when the admin hasn't expressed a preference

* Currently `default_policy` is treated as `any`, but the admin can change it, and then the system will remember that change across upgrades.
* Currently, `default_policy` is treated as `any`, but the admin can change it, and then the system will remember that change across upgrades.
If we didn't have a `default_policy` then changing the "default" policy on an upgrade would be tricky: we either risk overriding an explicit choice of the admin, or existing installs cannot take advantage of the improved performance from `best_effort`
* Future XAPI versions may change `default_policy` to mean `best_effort`.
Admins can still override it to `any` if they wish on a host by host basis.
Expand All @@ -149,7 +169,7 @@ It is not expected that users would have to change `best_effort`, unless they ru
There is also no separate feature flag: this host flag acts as a feature flag that can be set through the API without restarting the toolstack.
Although obviously only new VMs will benefit.

Debugging the allocator is done by running `xl vcpu-list` and investigating the soft pinning masks, and by analyzing xensource.log.
Debugging the allocator is done by running `xl vcpu-list` and investigating the soft pinning masks, and by analyzing `xensource.log`.

### Xenopsd implementation

Expand All @@ -166,18 +186,18 @@ This avoids exponential state space explosion on very large systems (>16 NUMA no
* [Topology.NUMA.choose] will choose one NUMA node deterministically, while trying to keep overall NUMA node usage balanced.
* [Domain.numa_placement] builds a [NUMARequest] and uses the above [Topology] and [Softaffinity] functions to compute and apply a plan.

We used to have a `xenopsd.conf` configuration option to enable numa placement, for backwards compatibility this is still supported, but only if the admin hasn't set an explicit policy on the Host.
We used to have a `xenopsd.conf` configuration option to enable NUMA placement, for backwards compatibility this is still supported, but only if the admin hasn't set an explicit policy on the Host.
It is best to remove the experimental `xenopsd.conf` entry though, a future version may completely drop it.

Tests are in [test_topology.ml] which checks balancing properties and whether the plan has improved best/worst/average-case access times in a simulated test based on 2 predefined NUMA distance matrixes (one from Intel and one from an AMD system).

## Future work

* enable 'best_effort' mode by default once more testing has been done
* an API to query Xen where it has actually allocated the VM's memory.
Currently only an `xl debug-keys` interface exists which is not supported in production as it can result in killing the host via the watchdog, and is not a proper API, but a textual debug output with no stability guarantees.
* more host policies (e.g. `strict`).
Requires the XAPI pool scheduler to be NUMA aware and consider it as part of chosing hosts.
* Enable 'best_effort' mode by default once more testing has been done
* Add an API to query Xen for the NUMA node memory placement (where it has actually allocated the VM's memory).
Currently, only the `xl debug-keys` interface exists which is not supported in production as it can result in killing the host via the watchdog, and is not a proper API, but a textual debug output with no stability guarantees.
* More host policies, e.g. `strict`.
Requires the XAPI pool scheduler to be NUMA aware and consider it as part of choosing hosts.
* VM level policy that can set a NUMA affinity index, mapped to a NUMA node modulo NUMA nodes available on the system (this is needed so that after migration we don't end up trying to allocate vCPUs to a non-existent NUMA node)
* VM level anti-affinity rules for NUMA placement (can be achieved by setting unique NUMA affinity indexes)

Expand Down
Loading
Loading