Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanks for creating this :) cannot get past this error... #7

Open
mannp opened this issue Mar 31, 2022 · 8 comments
Open

Thanks for creating this :) cannot get past this error... #7

mannp opened this issue Mar 31, 2022 · 8 comments

Comments

@mannp
Copy link

mannp commented Mar 31, 2022

Not sure what I am doing wrong, but I have tried a number of reconfigures and always get back to this error.

Thanks for any pointers.

module.k3s.proxmox_vm_qemu.k3s-support: Still creating... [6m10s elapsed]
╷
│ Error: file provisioner error
│ 
│   with module.k3s.proxmox_vm_qemu.k3s-support,
│   on .terraform/modules/k3s/support_node.tf line 72, in resource "proxmox_vm_qemu" "k3s-support":
│   72:   provisioner "file" {
│ 
│ timeout - last error: dial tcp 192.168.90.200:22: i/o timeout
@fvumbaca
Copy link
Owner

fvumbaca commented Apr 1, 2022

That looks like your new support node is unreachable. After the terraform fails, is the node up? Can you ssh into it?

@mannp
Copy link
Author

mannp commented Apr 1, 2022

That looks like your new support node is unreachable. After the terraform fails, is the node up? Can you ssh into it?

Hey, thanks for taking the time to reply :)

Some progress, creating a new base image.... I now get;

│ Error: file provisioner error
│ 
│   with module.k3s.proxmox_vm_qemu.k3s-support,
│   on .terraform/modules/k3s/support_node.tf line 72, in resource "proxmox_vm_qemu" "k3s-support":
│   72:   provisioner "file" {
│ 
│ timeout - last error: SSH authentication failed ([email protected]:22): ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey],
│ no supported methods remain

I can ssh into the first support node using my private key, but I am unclear how your plan knows where the private key is?

I also note

  1. That maybe by design or issues for me, in that I use multiple vlans and using addresses on vlans do not appear to be supported? dns settings of 'host' fail, as the dns is the same as the gw set by terraform....also the vlan is not set in the network settings for the node.

For testing, I am using the base vlan (192.168.1.x) and I can at least connect to the node.

  1. My base ubuntu image has qemu-guest-agent installed, but the nodes created from it, do not appear to have qemu-guest-agent installed?

@fvumbaca
Copy link
Owner

fvumbaca commented Apr 1, 2022

Hey @mannp my pleasure! Thanks for taking an interest!

Let me see if i can clear up some of your questions:

[...] But I am unclear how your plan knows where the private key is?

The module loads your public ssh keys through the authorized_keys_file setting on the module. Right now, the module assumes the matching private key is added to your ssh agent at the time you run the terraform apply. The use of an explicit private key file is possible, but would need to be an enhancement on this module to support it (feel free to give it a go!).

That maybe by design or issues for me, in that I use multiple vlans and using addresses on vlans do not appear to be supported? dns settings of 'host' fail, as the dns is the same as the gw set by terraform....also the vlan is not set in the network settings for the node.

vlans is not something I have personally worked with extensively, but from what you described it should still work. As long as your node is routable, has the correct public key set in authorized_keys_files and your private key is added to your agent, it should work. From your example ssh command, that proves everything but the private key being available on your ssh agent.

My base ubuntu image has qemu-guest-agent installed, but the nodes created from it, do not appear to have qemu-guest-agent installed?

This is not something that is part of the terraform code in this repo, but I do think this is important. It is also something I struggled a bit with my self. Building custom cloud-init vm templates is a little tricky sometimes. If I have time, I might write up a guide on setting one up in the documentation section. Until then google is your best friend here I am afraid.

Hope this helps and let me know if you have any additional questions!

@mannp
Copy link
Author

mannp commented Apr 1, 2022

Hi @fvumbaca

Oh dear, a facepalm moment

I was looking for;

ssh_private_key = <<EOF
-----BEGIN RSA PRIVATE KEY-----
private ssh key root
-----END RSA PRIVATE KEY-----
EOF

It's obvious now, but might be worth stating so, for other noobs :) wasting a good few hours :-/ (my excuse is other similar things create the keys, or ask for the private key lol)

It's busy provisioning on 192.168.1.x, so I will try other subnets if this completes successfully.

So I am getting E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem. for each of the nodes, so expect that is another issue with my template.

So, created a new template with debian 11 instead of ubuntu and things looks much better!

Currently applying with no errors.

For simplicity, I used this for my template -> https://github.com/oytal/proxmox-ubuntu-cloudinit-bash and used a debian 11 image.

Is it possible to specify / change the start id/number used for the vms? So they are all kepted together in the gui?

So this error, which appears to do with where terraform expects to see the ssh binary, which is hardcoded? Not there in NixOS :)

module.k3s.proxmox_vm_qemu.k3s-worker["default-1"]: Creation complete after 3m17s [id=pve/qemu/107]
╷
│ Error: External Program Lookup Failed
│ 
│   with module.k3s.data.external.kubeconfig,
│   on .terraform/modules/k3s/master_nodes.tf line 101, in data "external" "kubeconfig":
│  101:   program = [
│  102:     "/usr/bin/ssh",
│  103:     "-o UserKnownHostsFile=/dev/null",
│  104:     "-o StrictHostKeyChecking=no",
│  105:     "${local.master_node_settings.user}@${local.master_node_ips[0]}",
│  106:     "echo '{\"kubeconfig\":\"'$(sudo cat /etc/rancher/k3s/k3s.yaml | base64)'\"}'"
│  107:   ]
│ 
│ The data source received an unexpected error while attempting to find the program.
│ 
│ The program must be accessible according to the platform where Terraform is running.
│ 
│ If the expected program should be automatically found on the platform where Terraform is running, ensure that the program is in an expected directory. On Unix-based platforms, these directories are typically
│ searched based on the '$PATH' environment variable. On Windows-based platforms, these directories are typically searched based on the '%PATH%' environment variable.
│ 
│ If the expected program is relative to the Terraform configuration, it is recommended that the program name includes the interpolated value of 'path.module' before the program name to ensure that it is
│ compatible with varying module usage. For example: "${path.module}/my-program"
│ 
│ The program must also be executable according to the platform where Terraform is running. On Unix-based platforms, the file on the filesystem must have the executable bit set. On Windows-based platforms, no
│ action is typically necessary.
│ 
│ Platform: linux
│ Program: /usr/bin/ssh
│ Error: exec: "/usr/bin/ssh": stat /usr/bin/ssh: no such file or directory
╵

I sshed into master-01 and got the kubeconfig, and successfully in lens 👍🏻

It seems 'support' is for MariaDB and a nginx load balancer? Not sure of the terminology used, and 'default' are the worker nodes? [Edit, yes, I changed the name to worker and recreated the cluster]

Will do some more investigation, thanks for your help, got there!

Finally, it doesn't work for vlans I believe due to the tag not being set;

  network {
    bridge    = "vmbr0"
    firewall  = true
    link_down = false
    macaddr   = upper(macaddress.k3s-masters[count.index].address)
    model     = "virtio"
    queues    = 0
    rate      = 0
    tag       = -1
  } 

I have tried changing a git pulled version of your repo, and changing the tag to 10, but it doesn't work.... I guess its pulling your git repo every time.

The only subnet that works for me is 192.168.1.x, the others fail.

@fvumbaca
Copy link
Owner

fvumbaca commented Apr 4, 2022

No worries! Happens to me all the time, im glad you got it working!

I can see the issue you might be having with the ssh binary not being available or wanting to use another loadbalancer other than nginx. Right now, both nginx and ssh are hardcoded for simplicity with the idea of just getting something that works - but I agree all the issues you had should be mentioned upfront in the docs.

Nginx I think is the OK choice to hardcode it since it is only being used as a means to balance across master nodes for the API. Personally, once the cluster is up I use metallb for everything else. Because of the way the module sets up your API server i would recommend you to use the kubeconfig output variable instead of sshing into the node and pulling it manually. In the module, the kubeconfig is actually patched to use the loadbalanced endpoint where the config from the node will always point to that specific node. If for whatever reason you loose that node, you will loose control of your cluster. (Just my 2 cents tho!)

I honestly did not put as much thought into MariaDB other then it being easy to backup, runs externally from the cluster, and (from my non-scientific knowledge) is the most efficient supported DB for this size of cluster. I found it would also be difficult to support many different databases, especially for newer users.

As for your issue with changing the terraform module, first be sure to re-run terraform init once you make changes - I dont remember if it would complain or not if you forget and running it twice wont hurt. Also, if you are looking to work on a fork, it might be helpful to source it from a local location as opposed to git - but thats personal preference; And please open an MR with your improvements! I would love to incorporate it all here.

I think the improvements I can take away from this are:

  • Better documentation on
    • SSH key management
    • building a compatible node template
  • Better support for vlans
  • Support control over VM id numbers

Recently I have become really busy with things so I am not sure when I will have the time to sit down and work on these things. If your interested I would be more then happy to make the time to review any MRs! In any case I really appreciate you reaching out and using this little module of mine! Hope it helped you out some :)

(Im going to keep this issue open until I get around to making an issue for each improvement here so I dont forget about them)

@mannp
Copy link
Author

mannp commented May 13, 2022

I've tried to add support for control over VM id numbers, I'd added vlans, but not submitted a PR in time :), as well as a couple of other updates I found useful.

@fvumbaca
Copy link
Owner

@mannp thanks for #13! I have left a comment on it as I think I understand where you are going with it, but I dont think the interface being introduced will have the desired effect. After looking it over myself, I think controlling the VM IDs can get very tricky, especially when there is node pool rollovers to consider. In any case, we can continue the specific conversation for this control on that MR thread.

@c-p-b
Copy link
Contributor

c-p-b commented Jul 11, 2022

I had quite similar issues as the OP (same errors, etc) and most of it boiled down to setting up cloud-init and the VM template correctly. Here's what worked for me (a modified version of this guide):

Don’t try to do this from the installer ISO, you will almost certainly run into issues (for example, the ISO disables cloud-init networking configuration in the image). Use the cloud images specifically. First, go to the console of one of your promox nodes.

Get the tooling to install packages into the image

apt install libguestfs-tools

Download the .img file (amd64 in my case, usually is) from https://cloud-images.ubuntu.com/

For example:

 wget https://cloud-images.ubuntu.com/jammy/20220708/jammy-server-cloudimg-amd64.img

Create the bootable template using cloud-init

export STORAGE_POOL="local-lvm"
export VM_ID = "9000"
export VM_NAME = "jammy-server-cloudimg-amd64.img"

virt-customize -a $VM_NAME --install qemu-guest-agent # installs qemu guest agent into the template
# virt-customize -a $VM_NAME --install nfs-common # optional - if you are going to use nfs on the host from within k3s
qm create $VM_ID --name ubuntu --memory 2048 --net0 virtio,bridge=vmbr0
qm importdisk $VM_ID $VM_NAME $STORAGE_POOL -format qcow2
qm set $VM_ID  --scsihw virtio-scsi-pci --scsi0 $STORAGE_POOL:9000/vm-9000-disk-0.qcow2
qm set $VM_ID  --ide2 $STORAGE_POOL:cloudinit
qm set $VM_ID  --boot c --bootdisk scsi0
qm set $VM_ID  --serial0 socket --vga serial0

Enable the guest agent in UI by going to VM -> Options -> Enable QEMU Guest Agent

qm template $VM_ID 

Ensure the private key that is referenced public key specified in authorized_keys_files has been added to the ssh agent using ssh-add. In other words, if your public key in the file is called id_rsa, then make sure you run ssh-add id_rsa.

Then run terraform.

I'm happy to bundle this up into a docs pull request if it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants