Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Host Volumes #15489

Open
akamensky opened this issue Dec 7, 2022 · 37 comments · May be fixed by #24479
Open

Dynamic Host Volumes #15489

akamensky opened this issue Dec 7, 2022 · 37 comments · May be fixed by #24479
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/enhancement
Milestone

Comments

@akamensky
Copy link

akamensky commented Dec 7, 2022

Proposal

Currently to create host volume one needs to edit agent configuration to add volume stanza and restart the agent. This is impractical as Nomad itself may be provisioned using one of numerous tools (Ansible/Salt/etc) and restarting agent that may already have other tasks running is far from good idea.

CSI volumes can be created/allocated on the fly. However CSI volumes are often networked storage (like NFS). Host volumes are extremely useful for stateful workloads that require high performance local storage (think local SSD array or NVMe for databases like rocksdb for example).

I think allowing to create host volumes on agent nodes on the fly using API calls (and perhaps with corresponding CLI commands) is sensible and very practical approach.

Use-cases

  • High performance local DBs like rocksdb/leveldb which can be built-in into application itself
  • On disk cache that would speed up application startup (when storage speed matters)
  • Maybe other use cases too, but above is our primary use cases for this

Attempted Solutions

  • raw_exec driver with direct access to host filesystems. This however is not an ideal solution as this removes the process isolation.
@tgross
Copy link
Member

tgross commented Dec 8, 2022

Hi @akamensky! I've been working up a proposal for exactly this for a while now ("dynamic host volumes") but it hasn't quite made it over the line in terms of wrapping that design up and getting it implemented. But this is definitely on my radar!

@tgross tgross added theme/storage stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Dec 8, 2022
@akamensky
Copy link
Author

akamensky commented Dec 8, 2022

@tgross thank you for the response. Not implying any rush with this, just wondering what's the possible ETA on this feature landing in stable release (how long would it normally take from proposal until feature available)? We are evaluating Nomad as a replacement for homebrewed deployment/orchestration system for mostly legacy stack and this may be a show-stopper for us.

@tgross
Copy link
Member

tgross commented Dec 9, 2022

For reasons of "we're a public company now and I'll get put in Comms Jail" 😀 I can't give an ETA on new features but it's almost certainly not going to get worked on until after 1.5.0 which will go out early in the new year.

@mikenomitch mikenomitch changed the title Allow creating host volumes from API and/or CLI Dynamic Host Volumes Dec 22, 2022
@mikenomitch
Copy link
Contributor

FYI, I changed the name on this feature to make it easier for us to find internally. We tend to refer to it as "Dynamic Host Volumes" so just updating the title to match that.

@mikenomitch mikenomitch moved this to 1.7 or Later in Nomad Roadmap Dec 22, 2022
@akamensky
Copy link
Author

akamensky commented Dec 23, 2022

I will ask this here first as not sure if this should be bundled in with this FR or raised as a separate ticket. If any of you @tgross or @mikenomitch could provide a feedback on this would appreciate it.

I know with docker driver it is possible to provide a simple directory mapping in config, such as:

    driver = "docker"
    config {
      volumes = ["/data/test_dir:/data"]
    }

Which then utilizes Dockers functionality to do the bind mount.

However not everything is Docker. I feel that it may be good to provide similar configuration option for exec driver. I think under the hood this option could do something as simple as a symlink of source into chroot space of the task. I think this would be pretty straightforward option that won't require any additional configurations and could perhaps work outside of dynamic host volumes (which I think still needed for use-cases different from this one).

@tgross
Copy link
Member

tgross commented Jan 3, 2023

@akamensky yeah that'd be simple on its face, with a couple of caveats:

  • We gate access to this feature in docker with the volumes.enabled configuration so that job operators can't bind-mount arbitrary paths from the host unless the cluster administrator allows it. We'd need to do the same for any other task driver we implement that feature for.
  • We have a long-standing issue option to manage volume permissions #8892 around setting permissions on volumes. The exec driver doesn't even have an option for user namespaces, so this seems especially important to solve correctly so that a job operator can't break permissions on the whole host.

Probably worth splitting that idea out to another issue specific to the exec driver.

@apollo13
Copy link
Contributor

apollo13 commented Jan 5, 2023

@tgross not going to ask for an ETA ;) but mind laying out how you want to implement this? I thought about this a bit already and I came up with the following crazy solution (I like it because it is self-contained so to say):

  • Make raw-exec CSI plugins a first class citizen (maybe are, didn't check yet)
  • Implement a topology aware CSI hostvolume plugin that is shipped with nomad

This would imo have quite a few upside over how host volumes are now and as an added benefit would show up automatically in the storage section etc…

@tgross
Copy link
Member

tgross commented Jan 5, 2023

We definitely don't want to implement any kind of CSI-plugin-like interface; I'm going to be honest and say my opinion on CSI is that we implemented it reluctantly as a shared standard and aren't happy with the resulting interface and failure modes. Whatever we do for dynamic host volumes will be Nomad-native and wired into the scheduler directly.

@blinkinglight
Copy link

waiting for this feature

@ljb2of3
Copy link

ljb2of3 commented Apr 13, 2023

This feature would be very useful for my deployment as well.

@tgross
Copy link
Member

tgross commented Apr 13, 2023

Hi folks, adding a 👍 reaction to the top-level issue description is enough to make sure your enthusiasm gets recorded for the team. You can also see that it's currently on our "Later Release Shortlist" on the Public Roadmap. If you've got new use cases or design ideas, feel free to post them here though!

@1doce8
Copy link

1doce8 commented Apr 17, 2023

I would also be excited to see this feature implemented. Additionally, I'd like to suggest a design idea: it would be great to have the ability to not only create volumes on the fly (which would be a massive accomplishment), but also to separate the logic of volume creation and binding an allocation to a volume. This concept could be similar to how Kubernetes implements Persistent Volumes (PV) and Persistent Volume Claims (PVC).

@tgross
Copy link
Member

tgross commented Apr 18, 2023

it would be great to have the ability to not only create volumes on the fly (which would be a massive accomplishment), but also to separate the logic of volume creation and binding an allocation to a volume. This concept could be similar to how Kubernetes implements Persistent Volumes (PV) and Persistent Volume Claims (PVC).

That's how we implemented CSI support, where it makes sense (folks have asked to be able to merge them in the job spec anyways #11195 but I'm not sure it makes sense to have those operations for CSI because the characteristic times for creating CSI volumes are on the scale of minutes for some cloud providers, rather than milliseconds like the rest of the scheduler path). For dynamic host volumes we won't have the same kind of timescale problems so creating them on the fly with the job will be feasible.

Fortunately if we can create them on the fly we can create them separately as well. I think the UX challenge will be how to surface placement of those volumes during creation. Unlike CSI (and k8s) dynamic host volumes are initially associated with a specific client node (because they'd have to be created by the client and not by some third-party cloud storage API). Whereas if the creation is tied to a job the volume would be created at the time of placement so the scheduler is just using all the existing scheduler logic to do so.


For some additional context, here's a snippet from an internal design doc that I'm otherwise not ready to share yet about some of our underlying assumptions about the design:

  • Users want to be able to define volumes in the jobspec, rather than as separate registration/creation commands. (Or maybe “in addition to”?)
  • Users want to be able to move volumes between hosts, but will recognize that this isn’t cheap for non-remote volumes.
  • Remote-attached volumes remain owned by CSI.

I'm not 100% sold yet on the 2nd bullet point. And I'd like to figure out a way to get "except NFS" crammed into that 3rd bullet-point somehow, because it's a widely-supported standard and would cut out a ton of use cases where folks are stuck with CSI when its complexity is unwarranted. But that might be opening a whole other can of worms 😀

@ivantopo
Copy link

Hey @tgross, thanks for sharing!

Just wanted to ensure that the "in addition to" part of this message gets considered 😄

  • Users want to be able to define volumes in the jobspec, rather than as separate registration/creation commands. (Or maybe “in addition to”?)

Context from my experience: when our jobs require host volumes, scheduling is not dynamic at all. We know up front which hosts should be running these jobs, we create the volumes via config file on those hosts, and the job allocations stay with those hosts practically forever. A bit of manual intervention before scheduling these jobs is not a problem at all and might even be desired in some cases, so we are sure we are placing things where we wanted to.

If it was possible to have something like nomad volume create -type host ... for volumes that would be awesome!

@apollo13
Copy link
Contributor

Hi Tim,

I am not really sold on the 2nd bullet point either. I'd really like to hear actual usecases for it. I doubt HA is a usecase for it because migration can simply take to long or the node might simply die in which case you cannot migrate any data at all anyways.

I agree with @ivantopo that host volumes are most likely used for workloads with fixed allocations (like a cluster of elasticsearch servers for instance). I honestly don't know how much sense it would make to generate volumes from the jobspec automatically. From an operator perspective I want to be in control where data ends up. I do not want to allow users to fill my disks… I also don't want the allocations from an elasticsearch cluster (assume three allocations spread over five client nodes) to suddenly run on a new node with an automatically created empty volume (migration as in bulletpoint two is imo not an option). HA is provided here by running three allocations after all, I can and will deal with one of them dying.

As for NFS, I am going out on a limb here and please don't read it as critique -- I know you dislike CSI (I do as well) but I don't think special casing NFS (or more accurately any network filesystem) makes much sense. How would nomad know that hostvolume X on node 1 and hostvolume X on node 2 would use the same backing storage? In CSI that is easy, it is just one volume. Honestly it feels like that all the complexity we hate about CSI would end up in host volumes if we started to "support" network filesystems there (even if a user is not using a network filesystem then they would probably pay the price for the increased code complexity that nomad has to support it, ie more bugs).

Btw, CSI in nomad is quite stable nowadays (you fixed most ugly issues I think, so massive thanks for that) and running my CSI NFS plugin really is not much of a burden (This is not meant as self-promotion, but I would really hate for host volumes to have more complexity than needed, even if it is to "just" support NFS -- I am also not saying that my driver has no bugs, it probably has, but so far it works). That said, for safer operations I'd love to see #6554 fixed especially for CSI.

Long story short, if it were possible to create nomad host volumes via the CLI and also set stuff like owner/group/mode it would be a massive improvement over what we have now. I bet it would also be what 99.42% of the userbase would love to see.

If you want we can do a call and discuss this further?

@apollo13
Copy link
Contributor

Regarding

That's how we implemented CSI support, where it makes sense (folks have asked to be able to merge them in the job spec anyways #11195 but I'm not sure it makes sense to have those operations for CSI because the characteristic times for creating CSI volumes are on the scale of minutes for some cloud providers, rather than milliseconds like the rest of the scheduler path).

I think this goes beyond the scope of this ticket but it is something that I miss as well. Not only in the context of CSI but generally (think consul intentions for the service mesh etc). This is imo something nomad-pack (or something else completely, nomad-pack still hasn't won me over) should provide and is bigger than simply volumes.

@tgross
Copy link
Member

tgross commented Apr 19, 2023

From an operator perspective I want to be in control where data ends up. I do not want to allow users to fill my disks…

This is great context! I think that's where we originally wanted to go with CSI and why we didn't have the create workflow in place -- the idea was always that you'd create via Terraform or whatever so that it's the responsibility of the cluster administrator rather than the job submitter. What you're saying here has a similar separation of duties, and that makes a lot nore sense for host volumes.

As for NFS, I am going out on a limb here and please don't read it as critique -- I know you dislike CSI (I do as well) but I don't think special casing NFS (or more accurately any network filesystem) makes much sense. How would nomad know that hostvolume X on node 1 and hostvolume X on node 2 would use the same backing storage? In CSI that is easy, it is just one volume. Honestly it feels like that all the complexity we hate about CSI would end up in host volumes if we started to "support" network filesystems there (even if a user is not using a network filesystem then they would probably pay the price for the increased code complexity that nomad has to support it, ie more bugs).

Yeah I don't think I disagree with most of what you're saying here. The only way I'd want to be able to support NFS is if we could treat it like an ordinary mount without special casing -- just supporting the right syntax for the mount call we'd need. If that can't be done, totally not worth the effort/complexity when we've got CSI right there!

@apollo13
Copy link
Contributor

Yeah I don't think I disagree with most of what you're saying here. The only way I'd want to be able to support NFS is if we could treat it like an ordinary mount without special casing -- just supporting the right syntax for the mount call we'd need.

That begs an interesting question :) I guess in the end it all boils down to how far you are willing to go, starting from the existing stanza:

  host_volume "mysql" {
    path      = "/opt/mysql/data"
    read_only = false
  }

and transforming this into something along the lines of

  host_volume "mysql" {
   source = "/dev/sda1"
   type = "ext4"
   options = "rw"
  }

would immediately allow for block devices, normal bind mounts and nfs (and everything else). The main questions now become:

  • How much awareness will the scheduler have about this. Ie NFS would be fine to mount from two allocs on that host (actually all hosts, but I don't assume you want to have cross-host awareness for host (!) volumes). Bind mounts will work just fine as well but with block devices you want to be able to tell the client that it can just mount this once.

  • How much security do we want here. Currently since the volumes can only be defined in the configuration file we don't need to think much about this since it is solely operator controlled. As soon as it is possible to create them via the API it is probably not just the operator who has access to create them and you most likely want to limit the allowed source paths or mount types (ie to prevent bind mounting / from the host into the container).

@mberdnikov
Copy link

(sorry for google translate) I would like to add my own context.

We use Nomad to run multiple instances of our application for automated testing. These launches are initiated by a daemon based on the merge request status. We have a zfs volume cloning daemon on a few nodes to get our databases up and running quickly. And it looks like this:

  1. Send a command to the daemon to clone the volume.
  2. Get the hostname where it happened.
  3. Insert hostname into constraints and run Job.
  4. After testing, stop the Job.
  5. Inform the daemon that the volume can be deleted.

To sum it up: we don't care where the volume is. It is important for us that the volume be created and deleted dynamically in the host group. By inserting this into prestart, we are worried that poststop may not work and garbage collection will not be performed. Also, mixing different levels of responsibility looks dirty.


The second use case is to mount a subdirectory of a volume declared via the host client configuration. We now mount the entire directory; create subdirectories and update application configuration in entrypoint. It would be great to delegate this to Nomad in volume stanza (mkdir, chmod, chown).

@apollo13
Copy link
Contributor

These launches are initiated by a daemon based on the merge request status. We have a zfs volume cloning daemon on a few nodes to get our databases up and running quickly.

This imo sounds like a job for a CSI plugin.

@mberdnikov
Copy link

This imo sounds like a job for a CSI plugin.

I didn't find a working implementation. Making a daemon with two methods turned out to be easier. 🤷‍♂️

@apollo13
Copy link
Contributor

Ha, yeah I doubt you will find a prewritten CSI plugin for that. What I mainly wanted to say is that something like this is imo out of scope for host volumes.

@mberdnikov
Copy link

Ha, yeah I doubt you will find a prewritten CSI plugin for that. What I mainly wanted to say is that something like this is imo out of scope for host volumes.

But I would like to register these volumes dynamically in Nomad, rather than inserting constraints and absolute path to the volume in Job file. 😄

@tgross tgross closed this as completed Apr 21, 2023
@tgross tgross reopened this Apr 21, 2023
@tgross
Copy link
Member

tgross commented Apr 21, 2023

(oops, clicking around the GitHub interface first thing in the morning is error prone for clumsy people like me 😀 )

@apollo13 wrote:

How much awareness will the scheduler have about this. Ie NFS would be fine to mount from two allocs on that host (actually all hosts, but I don't assume you want to have cross-host awareness for host (!) volumes).

I think this is a problem even for non-NFS volumes and frankly something we've totally ignored for the existing host volumes implementation. There's nothing that currently stops you from mounting a host volume r/w on two allocations. We've left that the responsibility of the application to avoid corruption. In CSI there's the capability field that lets you restrict that further, and retaining that field for dynamic host volumes would let us reuse some existing scheduler code.

How much security do we want here. Currently since the volumes can only be defined in the configuration file we don't need to think much about this since it is solely operator controlled. As soon as it is possible to create them via the API it is probably not just the operator who has access to create them and you most likely want to limit the allowed source paths or mount types (ie to prevent bind mounting / from the host into the container).

Right, definitely something to consider! CSI volumes are "namespaced resources", so they fall under the existing namespace ACL. There's a host_volumes ACL but we'd almost certainly need to expand the logic there.

@mberdnikov wrote:

Send a command to the daemon to clone the volume.

Yeah this part is what makes it a good bet for a CSI plugin. I've built something along these lines for zvols (unfortunately I wasn't able to open source it), so it's definitely feasible.

The second use case is to mount a subdirectory of a volume declared via the host client configuration. We now mount the entire directory; create subdirectories and update application configuration in entrypoint. It would be great to delegate this to Nomad in volume stanza (mkdir, chmod, chown).

👍 This feels exactly what I'd expect as a minimum dynamic host volumes implementation.

@apollo13
Copy link
Contributor

There's a host_volumes ACL but we'd almost certainly need to expand the logic there.

Or get rid of them and make dynamic host volumes namespace aware (though I am not sure of the ups or downs)

@heatzync
Copy link

heatzync commented Jul 25, 2023

(Edited to prevent confusion - see other comments lower down. This thread isn't really about the docker driver).

nomad.client.hcl

client {
  enabled = true

  host_volume "all-your-base-are-belong-to-us" {
    path = "/path/to/base"
    read_only = false
  }
}

plugin "docker" {
  config {
    allow_privileged = true

    volumes {
      enabled = true
    }
  }

  node_pool = "NODE1"
  #node_pool = "${attr.unique.hostname}" # not sure whether node attributes can be interpolated here - perhaps using a go template?
}

example.job.hcl

job "example" {
  node_pool = "NODE1"
  #node_pool = var.specific_node # alternatively using a job input variable
  type = "service"

  group "example" {

    volume "all-your-base-are-belong-to-us" {
      type = "host"
      source = "all-your-base-are-belong-to-us"
      read_only = false
    }

    task "use-volume" {
      driver = "docker"

      config {
        image = "your-image:latest"

        # Using volumes instead of a bind mount:
        # https://admantium.medium.com/persisting-data-with-nomad-f98754753c0e
        # https://admantium.com/blog/ic09_persisting_data_with_nomad/
        volumes = ["/path/to/base/${NOMAD_JOB_NAME}/and/then/some:/path/in/container"]
      }
    }
  }
}

Disclaimer: Making use of the node_pool with a single node ensures the job always runs on the same node.

@jdoss
Copy link

jdoss commented Jul 25, 2023

@heatzync that is a pretty clever work around. If you create an example that uses an affinity tag, please share it :)

@suikast42
Copy link
Contributor

@heatzync

The essence to have dymamic hoost volumes is to avoid docker in priviliged mode. This is big secirity gotcha .

I don't understand the lifecycle approach.

Nomad can out of the box create docker host volumes of you have config docker in privileged mode.

@akamensky
Copy link
Author

akamensky commented Aug 16, 2023

I feel like this thread lost its course. The request was created specifically for "exec" driver with Nomad "host volume" functionality, not for Docker with Docker volumes.

I guess it can be expanded to Docker with Nomad host volumes. Although I imagine for Docker it should be easier to use Docker volumes.

@heatzync
Copy link

Please accept the apology of a Nomad n00b for convoluting this thread @akamensky. I also now realised that @suikast42 tried to put me on the right path with:

I don't understand the lifecycle approach.

Nomad can out of the box create docker host volumes of you have config docker in privileged mode.

I will edit the above comments and remove the prepare-volume task with the lifecycle stuff, in an attempt to spare the next person of any confusion, as that's not really necessary.

@jdoss
Copy link

jdoss commented Aug 16, 2023

@akamensky All drivers would benefit from a dynamic host volumes. I think @heatzync's contributions to this conversation are very relevant.

@suikast42
Copy link
Contributor

Any Progress on this?

@apollo13
Copy link
Contributor

apollo13 commented Dec 8, 2023 via email

@suikast42
Copy link
Contributor

Thnaks .Where I can find the roadmap?

@apollo13
Copy link
Contributor

apollo13 commented Dec 9, 2023 via email

@tgross
Copy link
Member

tgross commented Nov 1, 2024

Just FYI folks, we've just kicked off this project and are planning on shipping it in Nomad 1.10.0. We're going through the initial design work and you can see that I've been landing some preliminary PRs into a feature branch to put together the skeleton of the feature. I'm going to attempt to publish our design document here once that's ready to share (with customer stories or other internal data filed-off, of course).

@matifali
Copy link

matifali commented Nov 2, 2024

@angrycub Do you think this will let us update our Nomad Coder template to use localhost volumes instead of relying on the local HOST CSI driver?

@tgross tgross linked a pull request Nov 18, 2024 that will close this issue
tgross added a commit that referenced this issue Nov 18, 2024
Add several validation steps in the create/register RPCs for dynamic host
volumes. We first check that submitted volumes are self-consistent (ex. max
capacity is more than min capacity), then that any updates we've made are
valid. And we validate against state: preventing claimed volumes from being
updated and preventing placement requests for nodes that don't exist.

Ref: #15489
tgross added a commit that referenced this issue Nov 18, 2024
Add several validation steps in the create/register RPCs for dynamic host
volumes. We first check that submitted volumes are self-consistent (ex. max
capacity is more than min capacity), then that any updates we've made are
valid. And we validate against state: preventing claimed volumes from being
updated and preventing placement requests for nodes that don't exist.

Ref: #15489
tgross added a commit that referenced this issue Nov 19, 2024
Add several validation steps in the create/register RPCs for dynamic host
volumes. We first check that submitted volumes are self-consistent (ex. max
capacity is more than min capacity), then that any updates we've made are
valid. And we validate against state: preventing claimed volumes from being
updated and preventing placement requests for nodes that don't exist.

Ref: #15489
tgross added a commit that referenced this issue Nov 19, 2024
Add several validation steps in the create/register RPCs for dynamic host
volumes. We first check that submitted volumes are self-consistent (ex. max
capacity is more than min capacity), then that any updates we've made are
valid. And we validate against state: preventing claimed volumes from being
updated and preventing placement requests for nodes that don't exist.

Ref: #15489
tgross added a commit that referenced this issue Nov 20, 2024
Add several validation steps in the create/register RPCs for dynamic host
volumes. We first check that submitted volumes are self-consistent (ex. max
capacity is more than min capacity), then that any updates we've made are
valid. And we validate against state: preventing claimed volumes from being
updated and preventing placement requests for nodes that don't exist.

Ref: #15489
tgross added a commit that referenced this issue Nov 20, 2024
Add several validation steps in the create/register RPCs for dynamic host
volumes. We first check that submitted volumes are self-consistent (ex. max
capacity is more than min capacity), then that any updates we've made are
valid. And we validate against state: preventing claimed volumes from being
updated and preventing placement requests for nodes that don't exist.

Ref: #15489
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/enhancement
Projects
Status: 1.10
Development

Successfully merging a pull request may close this issue.