Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge master to pool licensing feature branch #6185

Conversation

minglumlu
Copy link
Member

No description provided.

danilo-delbusso and others added 30 commits October 1, 2024 11:25
This enables extending the type without causing performance issues,
and should reduce the work for the garbage collector.

Signed-off-by: Edwin Török <[email protected]>
Instead of getting one timestamp for all collectors, get them closer to the
actual measurement time.

Signed-off-by: Andrii Sultanov <[email protected]>
Instead of timestamps taken at read time in xcp-rrdd, propagate timestamps
taken by plugins (or collectors in xcp-rrdd itself) and use these when
updating values.

Also process datasources without flattening them all into one list.

This allows to process datasources with the same timestamp (provided by the
plugin or dom0 collector in xcp_rrdd) at once.

Co-authored-by: Edwin Török <[email protected]>
Signed-off-by: Edwin Török <[email protected]>
Signed-off-by: Andrii Sultanov <[email protected]>
RRDs can have different owners, and new ones can show up without new domids
being involved (SRs, VMs are also owner types).

Signed-off-by: Andrii Sultanov <[email protected]>
ds_update* functions previously relied on being called for the entirety of
datasources of a certain RRD. Instead, operate on chunks of datasources
provided by the same plugin. These chunks are not contiguous (more detailed
explanation in the comments in the code), so indices to rrd_dss and the
associated structures need to be carried with each datasource.

Also turns the value*transform tuple into a type for more explicit accesses.

Signed-off-by: Andrii Sultanov <[email protected]>
…and archiving

This ensures that the interval between the current timestamp and previous
update is calculated correctly.

Keep the same external interface through rrdd_http_handler by adding an
'internal' parameter to xml exporting functions.

Signed-off-by: Andrii Sultanov <[email protected]>
Add an "unsafe" version of rrd_add_ds to avoid checking if the datasource
already exists twice.

Signed-off-by: Andrii Sultanov <[email protected]>
…m explicitly

If RRD's datasource disappears and is not reported by the plugin (or collected
by xcp-rrdd), then it's not going to be refreshed, keeping its old value and
timestamp. Previously, this was handled nicely because the whole RRD was
updated at once, and updates were defaulted to VT_Unknown, Identity before
being assigned actually collected values. After RRD collection was changed to
process datasources in chunks per source, we need to explicitly find the ones
that weren't updated on this iteration, and explicitly reset them.

(This is not just a theoretical exercise, it can happen when a VIF gets
unplugged or destroyed, for example, with its stats not being reset and
continuing to display old values)

Signed-off-by: Andrii Sultanov <[email protected]>
Instead of writing Int64 (truncated from timestamp floats) into the
memory-mapped files, keep the precision of the timestamp.

Signed-off-by: Andrii Sultanov <[email protected]>
RRDD, aside from the OCaml library, exposes two interfaces. This changes the
Python one. The C one lives in rrd-client-lib and needs to be changed at the
same time and coordinated during updating.

Signed-off-by: Andrii Sultanov <[email protected]>
When a host starts, the systemd service xapi-wait-init-complete waiting
on the creation of the xapi init cookie file may fail on timeout for a
matter of seconds. This patch adds 1 minute (300 seconds total) to the
timeout passed to the script xapi-wait-init-complete.

Signed-off-by: Thierry Escande <[email protected]>
Without the change, the tests only pass when running on a runner with UTC as a timezone

Signed-off-by: Danilo Del Busso <[email protected]>
Previously the SM feature check was done in two parts, fetch all the SM
ref from the coordinator, and then fetch their information such as types
and features from the coordinator basedon the refs. This can cause race
conditions, where the previously fetched refs might have been deleted
when we fetch the SM features. The deletion might happen due to the way
SM registeration works for shared SRs such as GFS2, where each
`PBD.plug` will trigger a deregister of existing SM and re-register
(which will delete existing SMs), and create another (very likely) identical
SM.

To avoid this race, instead of fetch SM refs and their features
separately, do this in one go so we get a consistent snapshot of the db
state.

Also add a bit more debugging messages.

Signed-off-by: Vincent Liu <[email protected]>
Bake in assumptions that have been constant ever since xenstore was created:
getdomainpath always returns "/local/domain/<domid>", /local/domain/domid/vm
returns "/vm/<uuid>", so there's no need to look up that path to get the uuid
again

This reduces the number of xenstore accesses from 3 to 1 with no functional
change.

As suggested in: xapi-project#6068 (review)

Signed-off-by: Andrii Sultanov <[email protected]>
…pi-project#6127)

Previously the SM feature check was done in two parts, fetch all the SM
ref from the coordinator, and then fetch their information such as types
and features from the coordinator basedon the refs. This can cause race
conditions, where the previously fetched refs might have been deleted
when we fetch the SM features. The deletion might happen due to the way
SM registeration works for shared SRs such as GFS2, where each
`PBD.plug` will trigger a deregister of existing SM and re-register
(which will delete existing SMs), and create another (very likely)
identical SM.

To avoid this race, instead of fetch SM refs and their features
separately, do this in one go so we get a consistent snapshot of the db
state.
Completes the 15+ years old TODO, at the expense of removing an ultimate
example of a "not invented here" attitude.

Signed-off-by: Andrii Sultanov <[email protected]>
Issue an alert about a broken host kernel if bits 4, 5, 7, 9, or 14 are set in
/proc/sys/kernel/tainted, indicating some kind of error was encountered and the
future behaviour of the kernel might not be predictable or safe anymore (though
it generally should reasonably recover).

Only one alert per tainted bit per boot can be present (more than one can be
issued, if the user dismissed the alerts and restarted the toolstack).

Distinguish between Major (4,5,7 - these are all things that might cause a
host crash, but are unlikely to corrupt whatever data has been written out) and
Warning (9, 14 - might be a concern and could be raised to Support but usually
are not severe enough to crash the host) levels of errors as
suggested by the Foundations team.

This should serve as an indicator during issue investigation to look for the
cause of the taint.

Signed-off-by: Andrii Sultanov <[email protected]>
The manual notes that `Lazy.from_fun` "should only be used if the function f
is already defined. In particular it is always less efficient to write
`from_fun (fun () -> expr)` than `lazy expr`.

So, replace `Lazy.from_fun` with `lazy` in this particular case

Compare the lambda dump for `lazy (fun () -> [| 1;2;3 |] |> Array.map (fun x -> x+1))`:
```
(seq
  (let
    (l/269 =
       (function param/319[int]
         (apply (field_imm 12 (global Stdlib__Array!))
           (function x/318[int] : int (+ x/318 1)) (makearray[int] 1 2 3))))
    (setfield_ptr(root-init) 0 (global Main!) l/269))
  0)
```

with the lambda dump of the `let x = Lazy.from_fun (fun () -> [| 1;2;3 |] |>
Array.map (fun x -> x+1))`:
```
(seq
  (let
    (x/269 =
       (apply (field_imm 5 (global Stdlib__Lazy!))
         (function param/332[int]
           (apply (field_imm 12 (global Stdlib__Array!))
             (function x/331[int] : int (+ x/331 1)) (makearray[int] 1 2 3)))))
    (setfield_ptr(root-init) 0 (global Main!) x/269))
  0)
```

See: https://patricoferris.github.io/js_of_ocamlopt/#code=bGV0IGwgPSBsYXp5IChmdW4gKCkgLT4gW3wgMTsyOzMgfF0gfD4gQXJyYXkubWFwIChmdW4geCAtPiB4KzEpKQoKbGV0IHggPSBMYXp5LmZyb21fZnVuIChmdW4gKCkgLT4gW3wgMTsyOzMgfF0gfD4gQXJyYXkubWFwIChmdW4geCAtPiB4KzEpKQ%3D%3D

Signed-off-by: Andrii Sultanov <[email protected]>
A module binding appeared to be unused but was being evaluated for its
effect alone. We reintroduce it here and don't bind a name.

Signed-off-by: Colin James <[email protected]>
A module binding appeared to be unused but was being evaluated for its
effect alone. We reintroduce it here and don't bind a name.
last-genius and others added 27 commits December 6, 2024 08:49
…6159)

Default empty `allowed_operations` on a cloned VBD means that XenCenter
does not display the DVD option in the console tab for VMs cloned from
templates, for example.

Follow the practice in `xapi_vbd`, and `update_allowed_operations`
immediately after `Db.VBD.create`.

---

I've tested this fix and XC displays the ISO selection on the cloned VM
now, and allowed operations on the cloned VBDs are correct as well.
In some environments the time ranges checked are too strict causing test to
fail.
Previously the maximum error accepted was 10 ms, increase to 50 ms.
Also increase timeouts to reduce error/value ratio.

Signed-off-by: Frediano Ziglio <[email protected]>
Old implementation had different issues.
Just use mutexes and conditions using C stubs.
Current Ocaml Condition module does not have support for timeout waiting
for conditions, so use C stubs.
This implementation does not require opening a pipe. This reduces the
possibilities of errors calling "wait" to zero. Mostly of the times it
does not require kernel calls.
When a host starts, the systemd service xapi-wait-init-complete waiting
on the creation of the xapi init cookie file may fail on timeout for a
matter of seconds. This patch adds 1 minute (300 seconds total) to the
timeout passed to the script xapi-wait-init-complete.
In some environments the time ranges checked are too strict causing test
to fail.
Previously the maximum error accepted was 10 ms, increase to 50 ms.
Also increase timeouts to reduce error/value ratio.
Test that the event is correctly executed.

Signed-off-by: Frediano Ziglio <[email protected]>
Do not use ">" or other operators to compare Mtime.t, the value is intended to
be unsigned and should be compared with Int64.unsigned_compare as Mtime
functions do.

Signed-off-by: Frediano Ziglio <[email protected]>
The parameter was false only to support an internal usage, external users
should always alert the thread loop.

Signed-off-by: Frediano Ziglio <[email protected]>
- Do not wait huge amount of time if the queue is empty but
  use Delay.wait if possible;
- Fix delete of periodic events. In case the event is processed
  it's removed from the queue. Previously remove_from_queue was
  not able to mark this event as removed;
- Do not race between checking the first event and removing it.
  These 2 actions were done in 2 separate critical section, now
  they are done in a single one.

Signed-off-by: Frediano Ziglio <[email protected]>
Potentially a periodic event can be cancelled while this is processed.
Simulate this behavior removing the event inside the handler.
This was fixed by previous commit.

Signed-off-by: Frediano Ziglio <[email protected]>
Previously if the queue was empty and the loop thread was active
the scheduler took quite some time to pick up the new event.
Check that this is done in a timely fashion to avoid regressions in code.

Signed-off-by: Frediano Ziglio <[email protected]>
Similar to test_empty test however the state of the loop should be different.

Signed-off-by: Frediano Ziglio <[email protected]>
Some plugins may not store the client-set metadata, and return a static value
when replying to the update. This would override the values that a client
used when the SR was created, or set afterwards, which is unexpected.

Now name_label and name_description fields returned by the plugins are ignored
on update.

Current set_name_label and set_name_description rely on the update mechanism to
work. Instead add database call at the end of the methods to ensure both xapi
and the SR backend are synchronized, even when the latter fails to update the
values.

Signed-off-by: Pau Ruiz Safont <[email protected]>
Though the majority of completions were already using set_completions and the
like to add completion suggestions, there were two leftovers still needlessly
changing COMPREPLY themselves. This caused bugs, as in the case of

xe vm-import filename=<TAB>

autocompleting all of the filenames into the prompt instead of presenting
the choice. Let only these functions operate on COMPREPLY directly.

Signed-off-by: Andrii Sultanov <[email protected]>
**xe-cli completion: Use grep -E instead of egrep**

Otherwise newer packages in XS9 issue
"egrep: warning: egrep is obsolescent; using grep -E"
warnings all over the place

---

**xe-cli completion: Hide COMPREPLY manipulation behind functions**

Though the majority of completions were already using set_completions
and the
like to add completion suggestions, there were two leftovers still
needlessly
changing COMPREPLY themselves. This caused bugs, as in the case of

`xe vm-import filename=<TAB>`

autocompleting all of the filenames into the prompt instead of
presenting
the choice. Let only these functions operate on COMPREPLY directly.
…roject#6165)

Some plugins may not store the client-set metadata, and return a static
value
when replying to the update. This would override the values that a
client
used when the SR was created, or set afterwards, which is unexpected.

Now name_label and name_description fields returned by the plugins are
ignored
on update.

Current set_name_label and set_name_description rely on the update
mechanism to
work. Instead, add database call at the end of the methods to ensure
both xapi
and the SR backend are synchronized, even when the latter fails to
update the
values.

Tested on GFS2 tests (JR 4175192), as well as ring3 bvt + bst (209177),
and storage validation tests (SR 209180)
For the scan retry, previously we were comparing the entire vdi data
structure from the database using the (<>) operator. This is a bit
wasteful and not very stable. Instead let us just compare the vdi refs,
since the race here comes from `VDI.db_{introduce,forget}`, which would
only add/remove vdis from the db, but not change its actual data
structure.

Also add a bit more logging when retrying, since this should not happen
very often.

Signed-off-by: Vincent Liu <[email protected]>
When add leaked datapath:
1. add leaked datapath to Sr.vdis
2. write to db file
3. log enhance

If there are storage exceptions raised when destroying datapath,
the procedure fails and the state of VDI becomes incorrect,
which leads to various abnormalresults in subsequent operations.
To handle this, the leaked datapath is designed to redestroy the
datapath and refresh the state before next storage operation via
function remove_datapaths_andthen_nolock. But this mechanism
doesn't take effect in current code.
This commit is to fix this bug. Leaked datapath should be added
to Sr.vdis to make the leaked datapath really work. And write to
db file to avoid losing the leaked datapath if xapi restarts.

Signed-off-by: Changlei Li <[email protected]>
There was some issue in the current code, from structure corruption to
thread safety.
Add some test and fix discovered issues.

More details on commit messages.
When add leaked datapath:
1. add leaked datapath to Sr.vdis
2. write to db file
3. log enhance

If there are storage exceptions raised when destroying datapath,
the procedure fails and the state of VDI becomes incorrect,
which leads to various abnormalresults in subsequent operations.
To handle this, the leaked datapath is designed to redestroy the
datapath and refresh the state before next storage operation via
function remove_datapaths_andthen_nolock. But this mechanism
doesn't take effect in current code.
This commit is to fix this bug. leaked datapath should be added
to Sr.vdis to make the leaked datapath really work. And write to
db file to avoid losing the leaked datapath if xapi restarts.
For the scan retry, previously we were comparing the entire vdi data
structure from the database using the (<>) operator. This is a bit
wasteful and not very stable. Instead let us just compare the vdi refs,
since the race here comes from `VDI.db_{introduce,forget}`, which would
only add/remove vdis from the db, but not change its actual data
structure.

Also add a bit more logging when retrying, since this should not happen
very often.
QEMU orders devices by the time of plugging. Parallelizing them introduces
randomness, which breaks the assumption that devices are ordered in a
deterministic way.

Serialize all PCI and VUSB plugs to restore behaviour.

Signed-off-by: Pau Ruiz Safont <[email protected]>
QEMU orders devices by the time of plugging. Parallelizing them
introduces randomness, which breaks the assumption that devices are
ordered in a deterministic way.

Serialize all PCI and VUSB plugs to restore behaviour.

This fix has been manually tested by @freddy77 against an affected VM
relying on the ordering
@minglumlu minglumlu merged commit a889151 into xapi-project:feature/pool-licensing Dec 13, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.