Setting up Kubernetes clusters in VMware Cloud Director using Talos and Terraform

Introduction


Due to its complexity, installing and operating Kubernetes can be daunting. However, this article aims to illustrate how easy it is to spin up a production-ready Kubernetes cluster from scratch in VMware Cloud Director.

Using Talos Linux (an OS designed for Kubernetes) and Terraform's automation capabilities, you can spin up Kubernetes on VMs running in your private cloud in just a few minutes.

In this article, we will build a Kubernetes cluster according to the following specifications:

Control plane nodes: 3
Worker nodes: 3
OS: Talos Linux v1.8.3
Kubernetes version: v1.31.1
Control plane VIP: 172.16.0.10
Host network: 172.16.0.0/24
Pod network: 10.244.0.0/16
Service network: 10.96.0.0/12
CNI: Cilium (configured with geneve tunnels as there is a known bug with vxlan on VMware)
Load balancer: Cilium in L2 mode

Prerequisites


  • VMware Cloud Director API token.
  • NSX Edge Gateway with a public IP address and an SNAT rule allowing outbound access from the Kubernetes host network (172.16.0.0/24).
  • Client with access to the Kubernetes host network (172.16.0.0/24).
  • Terraform and kubectl installed on the client.

Deploying a Talos Kubernetes cluster using Terraform

Step 1 - Initializing Terraform


On the client machine, create a file called main.tf and paste the following configuration into it:

terraform {
  required_providers {
    vcd = {
      source  = "vmware/vcd"
      version = "3.14.0"
    }
    talos = {
      source  = "siderolabs/talos"
      version = "0.6.0"
    }
  }
}

provider "vcd" {
  user      = ""
  password  = ""
  auth_type = "api_token"
  token     = ""
  api_token = var.vcd_api_token
  url       = "https://${var.vcd_url}/api"
  org       = var.vcd_org
  vdc       = var.vcd_vdc
}

provider "talos" {
}

Next, define the variables for your project to enhance code reusability across different environments.

Create a file called variables.tf and paste the following configuration:

variable "vcd_url" {
  type        = string
  description = "Cloud Director URL (Example: 'vcd.dc-fbg1.glesys.net')"
}

variable "vcd_org" {
  type        = string
  description = "Tenant Organization (Example: 'vdo-xxxxx')"
}

variable "vcd_api_token" {
  type        = string
  description = "API Token to authenticate to Cloud Director"
}

variable "vcd_vdc" {
  type        = string
  description = "Organization Virtual Datacenter (Example: 'vdc-xxxxx')"
}

variable "vcd_edge" {
  type        = string
  description = "Edge Gateway (Example: 't1-vdc-xxxxx-fbg1-01')"
}

variable "k8s_cluster_name" {
  type        = string
  description = "Kubernetes cluster name"
}

variable "k8s_cluster_vip" {
  type        = string
  description = "VIP to connect to the Kubernetes cluster"
}

variable "k8s_cluster_endpoint" {
  type        = string
  description = "URL for the Kubernetes API e.g. https://server.yourdomain.tld:6443 or https://VIP:6443"
}

variable "k8s_cluster_node_network" {
  type        = string
  description = "The CIDR includes the IP address of the gateway e.g. 192.168.100.1/24 represents the gateway address 192.168.100.1 and subnet mask 255.255.255.0"
}

variable "k8s_cluster_node_network_first_controller_hostnum" {
  type        = number
  description = "The hostnum of the first controller host"
}

variable "k8s_cluster_node_network_first_worker_hostnum" {
  type        = number
  description = "The hostnum of the first worker host"
}

variable "k8s_controller_count" {
  type        = number
  description = "Number of control plane nodes"
  default     = 1
  validation {
    condition     = var.k8s_controller_count >= 1
    error_message = "Must be 1 or more."
  }
}

variable "k8s_worker_count" {
  type        = number
  description = "Number of worker nodes"
  default     = 1
  validation {
    condition     = var.k8s_worker_count >= 1
    error_message = "Must be 1 or more."
  }
}

Run terraform init to initialize the project and install the required providers:

$ terraform init

Initializing the backend...

Initializing provider plugins...
- Finding siderolabs/talos versions matching "0.6.0"...
- Finding vmware/vcd versions matching "3.14.0"...
- Installing siderolabs/talos v0.6.0...
- Installed siderolabs/talos v0.6.0 (signed by a HashiCorp partner, key ID AF0815C7E2EC16A8)
- Installing vmware/vcd v3.14.0...
- Installed vmware/vcd v3.14.0 (signed by a HashiCorp partner, key ID 8BF53DB49CDB70B0)

Terraform has been successfully initialized!

Step 2 - Defining Kubernetes resources


On the client machine, create a file called kubernetes.tf and paste the following configuration:

# since the edge gateway is not managed by tf, define a data resource for the edge gateway
data "vcd_nsxt_edgegateway" "my_edge" {
  name = var.vcd_edge
}

# since the catalog is not managed by tf, define a data resource for the glesys templates catalog
data "vcd_catalog" "os_templates" {
  org  = "GleSYS"
  name = "GleSYS Templates"
}

# define a data resource for the talos-v1.8 template
data "vcd_catalog_vapp_template" "talos_v1_8" {
  catalog_id = data.vcd_catalog.os_templates.id
  name       = "talos-v1.8"
}

locals {
  controller_nodes = [
    for i in range(var.k8s_controller_count) : {
      name = "k8s-cp-${i}"
      ip   = cidrhost(var.k8s_cluster_node_network, var.k8s_cluster_node_network_first_controller_hostnum + i)
    }
  ]
  worker_nodes = [
    for i in range(var.k8s_worker_count) : {
      name = "k8s-wn-${i}"
      ip   = cidrhost(var.k8s_cluster_node_network, var.k8s_cluster_node_network_first_worker_hostnum + i)
    }
  ]
  gateway       = split("/", var.k8s_cluster_node_network)[0]
  prefix_length = split("/", var.k8s_cluster_node_network)[1]
}

resource "talos_machine_secrets" "talos" {}

data "talos_machine_configuration" "controller" {
  cluster_name     = var.k8s_cluster_name
  cluster_endpoint = var.k8s_cluster_endpoint
  machine_secrets  = talos_machine_secrets.talos.machine_secrets
  machine_type     = "controlplane"
  config_patches = [
    yamlencode({
      machine = {
        install = {
          disk = "/dev/sda"
        }
        network = {
          interfaces = [
            {
              interface = "eth0"
              vip = {
                ip = var.k8s_cluster_vip
              }
            }
          ]
        }
      }
      cluster = {
        network = {
          cni = {
            name = "custom"
            urls = [
              "https://gist.githubusercontent.com/jaymzmac/9f949266783f531dca0b1bfa2d06f0a3/raw/0618788fddf3b45b985eed3dd95ce6d39b87cf12/cilium-install.yaml"
            ]
          }
        }
        proxy = {
          disabled = true
        }
      }
    })
  ]
}

data "talos_machine_configuration" "worker" {
  cluster_name     = var.k8s_cluster_name
  cluster_endpoint = var.k8s_cluster_endpoint
  machine_secrets  = talos_machine_secrets.talos.machine_secrets
  machine_type     = "worker"
  config_patches = [
    yamlencode({
      machine = {
        install = {
          disk = "/dev/sda"
        }
      }
      cluster = {
        network = {
          cni = {
            name = "custom"
            urls = [
              "https://gist.githubusercontent.com/jaymzmac/9f949266783f531dca0b1bfa2d06f0a3/raw/0618788fddf3b45b985eed3dd95ce6d39b87cf12/cilium-install.yaml"
            ]
          }
        }
        proxy = {
          disabled = true
        }
      }
    })
  ]
}

data "talos_client_configuration" "talos" {
  cluster_name         = var.k8s_cluster_name
  client_configuration = talos_machine_secrets.talos.client_configuration
  endpoints            = [for node in local.controller_nodes : node.ip]
}

resource "vcd_network_routed_v2" "demo_net_1" {
  name            = "demo-net-1"
  edge_gateway_id = data.vcd_nsxt_edgegateway.my_edge.id
  gateway         = local.gateway
  prefix_length   = local.prefix_length
  static_ip_pool {
    start_address = cidrhost(var.k8s_cluster_node_network, 2)
    end_address   = cidrhost(var.k8s_cluster_node_network, -2)
  }
}

resource "vcd_vm" "controller" {
  count            = var.k8s_controller_count
  name             = local.controller_nodes[count.index].name
  computer_name    = local.controller_nodes[count.index].name
  vapp_template_id = data.vcd_catalog_vapp_template.talos_v1_8.id
  memory           = 8192
  cpus             = 4
  cpu_cores        = 1
  network {
    type               = "org"
    name               = vcd_network_routed_v2.demo_net_1.name
    ip_allocation_mode = "MANUAL"
    ip                 = local.controller_nodes[count.index].ip
    connected          = true
  }
  set_extra_config {
    key   = "guestinfo.userdata"
    value = base64encode(data.talos_machine_configuration.controller.machine_configuration)
  }
  set_extra_config {
    key = "guestinfo.metadata"
    value = base64encode(yamlencode({
      "local-hostname" : local.controller_nodes[count.index].name,
      "network" : {
        "version" : 2,
        "ethernets" : {
          "eth0" : {
            "addresses" : ["${local.controller_nodes[count.index].ip}/${local.prefix_length}"],
            "nameservers" : {
              "addresses" : ["8.8.8.8", "8.8.4.4"]
            },
            "gateway4" : local.gateway
          }
        }
      }
    }))
  }
  lifecycle {
    ignore_changes = [
      vapp_template_id
    ]
  }
}

resource "vcd_vm" "worker" {
  count            = var.k8s_worker_count
  name             = local.worker_nodes[count.index].name
  computer_name    = local.worker_nodes[count.index].name
  vapp_template_id = data.vcd_catalog_vapp_template.talos_v1_8.id
  memory           = 8192
  cpus             = 4
  cpu_cores        = 1
  network {
    type               = "org"
    name               = vcd_network_routed_v2.demo_net_1.name
    ip_allocation_mode = "MANUAL"
    ip                 = local.worker_nodes[count.index].ip
    connected          = true
  }
  set_extra_config {
    key   = "guestinfo.userdata"
    value = base64encode(data.talos_machine_configuration.worker.machine_configuration)
  }
  set_extra_config {
    key = "guestinfo.metadata"
    value = base64encode(yamlencode({
      "local-hostname" : local.worker_nodes[count.index].name,
      "network" : {
        "version" : 2,
        "ethernets" : {
          "eth0" : {
            "addresses" : ["${local.worker_nodes[count.index].ip}/${local.prefix_length}"],
            "nameservers" : {
              "addresses" : ["8.8.8.8", "8.8.4.4"]
            },
            "gateway4" : local.gateway
          }
        }
      }
    }))
  }
  lifecycle {
    ignore_changes = [
      vapp_template_id
    ]
  }
}

resource "talos_machine_bootstrap" "talos" {
  client_configuration = talos_machine_secrets.talos.client_configuration
  endpoint             = local.controller_nodes[0].ip
  node                 = local.controller_nodes[0].ip
  depends_on = [
    vcd_vm.controller
  ]
}

resource "talos_cluster_kubeconfig" "talos" {
  client_configuration = talos_machine_secrets.talos.client_configuration
  endpoint             = local.controller_nodes[0].ip
  node                 = local.controller_nodes[0].ip
  depends_on = [
    talos_machine_bootstrap.talos
  ]
}

output "talosconfig" {
  value     = data.talos_client_configuration.talos.talos_config
  sensitive = true
}

output "kubeconfig" {
  value     = talos_cluster_kubeconfig.talos.kubeconfig_raw
  sensitive = true
}

Step 3 - Applying Terraform configuration


Run terraform apply to apply your configuration and provision your Kubernetes cluster:

$ terraform apply \
-var vcd_url="vcd.dc-fbg1.glesys.net" \
-var vcd_api_token="ABC12345678" \
-var vcd_org="vdo-#####" \
-var vcd_vdc="vdc-#####" \
-var vcd_edge="t1-vdc-#####-fbg1-01" \
-var k8s_cluster_name="demo" \
-var k8s_cluster_vip="172.16.0.10" \
-var k8s_cluster_endpoint="https://172.16.0.10:6443" \
-var k8s_cluster_node_network="172.16.0.1/24" \
-var k8s_cluster_node_network_first_controller_hostnum=11 \
-var k8s_cluster_node_network_first_worker_hostnum=20 \
-var k8s_controller_count=3 \
-var k8s_worker_count=3

Plan: 10 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

talos_machine_secrets.talos: Creating...
vcd_network_routed_v2.demo_net_1: Creating...
talos_machine_secrets.talos: Creation complete after 1s
vcd_network_routed_v2.demo_net_1: Creation complete after 11s
vcd_vm.worker[0]: Creating...
vcd_vm.worker[1]: Creating...
vcd_vm.worker[2]: Creating...
vcd_vm.controller[1]: Creating...
vcd_vm.controller[0]: Creating...
vcd_vm.controller[2]: Creating...
vcd_vm.worker[0]: Creation complete after 2m19s
vcd_vm.worker[1]: Creation complete after 2m20s
vcd_vm.controller[2]: Creation complete after 2m20s
vcd_vm.worker[2]: Creation complete after 2m22s
vcd_vm.controller[0]: Creation complete after 2m23s
vcd_vm.controller[1]: Creation complete after 2m23s
talos_machine_bootstrap.talos: Creating...
talos_machine_bootstrap.talos: Creation complete after 17s
talos_cluster_kubeconfig.talos: Creating...
talos_cluster_kubeconfig.talos: Creation complete after 0s

Apply complete! Resources: 10 added, 0 changed, 0 destroyed.

Extract the kubeconfig and verify the Kubernetes cluster has been created successfully:

$ terraform output -raw kubeconfig > config-demo
$ export KUBECONFIG=config-demo
$ alias k=kubectl

$ k cluster-info
Kubernetes control plane is running at https://172.16.0.10:6443
CoreDNS is running at https://172.16.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

$ k get nodes -o wide
NAME       STATUS     ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
k8s-cp-0   Ready      control-plane   54s   v1.31.1   172.16.0.11   <none>        Talos (v1.8.3)   6.6.60-talos     containerd://2.0.0
k8s-cp-1   Ready      control-plane   28s   v1.31.1   172.16.0.12   <none>        Talos (v1.8.3)   6.6.60-talos     containerd://2.0.0
k8s-cp-2   Ready      control-plane   56s   v1.31.1   172.16.0.13   <none>        Talos (v1.8.3)   6.6.60-talos     containerd://2.0.0
k8s-wn-0   Ready      <none>          55s   v1.31.1   172.16.0.20   <none>        Talos (v1.8.3)   6.6.60-talos     containerd://2.0.0
k8s-wn-1   Ready      <none>          22s   v1.31.1   172.16.0.21   <none>        Talos (v1.8.3)   6.6.60-talos     containerd://2.0.0
k8s-wn-2   Ready      <none>          56s   v1.31.1   172.16.0.22   <none>        Talos (v1.8.3)   6.6.60-talos     containerd://2.0.0

Step 4 - Deploying "hello world" application (optional)


Deploy a "hello world" application to your cluster and create a load balancer for that application:

$ k create deployment hello-world --image us-docker.pkg.dev/google-samples/containers/gke/hello-app:2.0 --replicas 2
$ k expose deployment hello-world --type LoadBalancer --port 80  --target-port 8080

$ k get svc hello-world
NAME                  TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
service/hello-world   LoadBalancer   10.100.238.17   <pending>     80:30386/TCP   5s

As you can see, currently, no external IP has been assigned to the load balancer.

Create a Cilium L2Announcement and IP Pool resource which defines an available block in the host network to allocate IPs from:

$ k apply -f - <<EOF
apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: cilium-lb-all-services
  namespace: kube-system
spec:
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
  loadBalancerIPs: true
---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "cilium-lb-pool"
spec:
  blocks:
  - cidr: "172.16.0.224/28"
---
EOF

From your client machine, use curl to verify that you can access the application via the external IP that has now been allocated to the load balancer:

$ k get svc hello-world
NAME                  TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE
service/hello-world   LoadBalancer   10.100.238.17   172.16.0.224   80:30386/TCP   1m44s

$ curl http://172.16.0.224
Hello, world!
Version: 2.0.0
Hostname: hello-world-7bb8d9fdcd-4jv2h

To expose this application externally, create a DNAT rule on the Edge Gateway to forward traffic from a public IP to 172.16.0.224:

$ curl http://PUBLIC_IP
Hello, world!
Version: 2.0.0
Hostname: hello-world-7bb8d9fdcd-4jv2h

Summary


This concludes our article on quickly spinning up a production-ready Kubernetes cluster from scratch in VMware Cloud Director.

One of the major benefits of building Kubernetes clusters using Talos and Terraform is that it's extremely flexible and can be customized as needed. Do you need a different CNI for your cluster? Do you want to use BGP to advertise the load balancer network? Do you want to deploy a CSI using Ceph and Rook? Do you want to deploy a GitOps tool like Flux or ArgoCD? You can achieve all these things by customizing the talos_machine_configuration resource in Terraform.

Furthermore, there are several potential benefits to running Talos Kubernetes, specifically in VMware Cloud Director, such as:

  • Ability to run traditional VM workloads in the same private network as the Talos Kubernetes cluster.
  • Flexibility to expose apps and services only locally or to expose them to the Internet.
  • Leverage VMware Cloud Director networking services to configure VPNs, firewalls, NAT rules, etc.
  • Full insight into the cluster, VMs, storage and network.
  • No additional licenses required (unlike VMware Tanzu).

Hittar du inte det du söker?

Kontakta oss gärna för mer information. Vi hjälper dig att komma fram till den bästa lösningen för dina behov.

Skicka e-post Ring 0200-23 88 00