RancherOS v2 is an immutable Linux distribution built to run Rancher and it's corresponding Kubernetes distributions RKE2 and k3s

Architecture

RancherOS v2 is an immutable Linux distribution built to run Rancher and it's corresponding Kubernetes distributions RKE2 and k3s. It is built using the cOS-toolkit and based on openSUSE. Initial node configurations is done using a cloud-init style approach and all further maintenance is done using Kubernetes operators.

Use Cases

RancherOS is intended to be ran as the operating system beneath a Rancher Multi-Cluster Management server or as a node in a Kubernetes cluster managed by Rancher. RancherOS also allows you to build stand alone Kubernetes clusters that run an embedded and smaller version of Rancher to manage the local cluster. A key attribute of RancherOS is that it is managed by Rancher and thus Rancher will exist either locally in the cluster or centrally with Rancher Multi-Cluster Manager.

OCI Image based

RancherOS v2 is an image based distribution with an A/B style update mechanism. One first runs on a read-only image A and to do an upgrade pulls a new read only image B and then reboots the system to run on B. What is unique about RancherOS v2 is that the runtime images come from OCI Images. Not an OCI Image containing special artifacts, but an actual Docker runnable image that is built using standard Docker build processes. RancherOS is built using normal docker build and if you wish to customize the OS image all you need to do is create a new Dockerfile.

rancherd

RancherOS v2 includes no container runtime, Kubernetes distribution, or Rancher itself. All of these assests are dynamically pulled at runtime. All that is included in RancherOS is rancherd which is responsible for bootstrapping RKE2/k3s and Rancher from an OCI registry. This means an update to containerd, k3s, RKE2, or Rancher does not require an OS upgrade or node reboot.

cloud-init

RancherOS v2 is initially configured using a simple version of cloud-init. It is not expected that one will need to do a lot of customization to RancherOS as the core OS's sole purpose is to run Rancher and Kubernetes and not serve as a generic Linux distribution.

RancherOS Operator

RancherOS v2 includes an operator that is responsible for managing OS upgrades and managing a secure device inventory to assist with zero touch provisioning.

openSUSE Leap

RancherOS v2 is based off of openSUSE Leap. There is no specific dependency on openSUSE beyond that RancherOS assumes the underlying distribution is based on systemd. We choose openSUSE for obvious reasons, but beyond that openSUSE Leap provides a stable layer to build upon that is well tested and has paths to commercial support, if one chooses.

Comments
  • Support for secure boot

    Support for secure boot

    I believe we are missing the shim image in our base image. This is something we need to support secure boot, also some adjustments on grub2 installation for secure boot might be required.

  • Hostname is changed upon each reboot

    Hostname is changed upon each reboot

    After installation I set my hostname to ros-node-02 and configured a K3s cluster on it with rancherd/Rancher. All was fine but after a reboot the hostname has been changed to something like rancher-XXXXX. I tried to reset the hostname but same issue after a reboot. I'm able to reproduce this every time.

    I have a DHCP/DNS server that fix the IP address, so it's not an IP change upon each reboot. I attached journalctl logs on this issue. journalctl.log.gz

  • Teal

    Teal

    • Don't be scared from the diff, it's mostly drops
    • Splits os2 into ros-installer, golang code is gone
    • Base image switched to sle micro for rancher + elemental binaries included
    • framework files are now tracked individually and statically (we could go with git submodules, but wanted to keep it simple for now) allowing sandboxed builds
    • Adds a CI workflow which keeps the framework static files mentioned above in sync with cos-toolkit, opening up PRs
    • Should be ready to go to be built with obs/ibs @kkaempf - it also replaces the os2-framework package, with a unique Dockerfile that can be built directly from obs
    • Drop temporarly selinux as SLE Micro for Rancher has supports for it, but as we don't have profiles for it, fails booting
    • Might need a follow-up, the PR pipeline should work, but yeah :)
    • All binaries are implied to be provided as part of the base image. now this repo will be the "end" dockerfile which just applies the customizations from the framework - so for testing, it is enough to provide a different base image with different binaries (e.g. a pinpointed elemental-cli version)

    It is better to browse it directly: https://github.com/rancher-sandbox/os2/tree/teal as most of things got simplified and dropped

    Draft as gotta test this locally still and trying to wire up the CI

    Supersedes #115 Part of https://github.com/rancher-sandbox/os2/issues/94

  • nomodeset required for booting certain hardware

    nomodeset required for booting certain hardware

    What steps did you take and what happened:

    I'm reporting on behalf of a user who is having issues booting Elemental Teal. They were able to work around the issue by pressing e and adding nomodeset to the kernel params.

    They are using an AMD Ryzen-based SimplyNUC (https://simplynuc.com/product/llm2v8cy-full/)

    What did you expect to happen:

    The machine should boot normally without manual intervention.

    Environment: (Asking for details and will fill in)

    • Elemental release version (use cat /etc/os-release):
    • Rancher version:
    • Kubernetes version (use kubectl version):
    • Cloud provider or hardware configuration:
  • Consistent CI environment

    Consistent CI environment

    Our current CI / workers / runners setup is somewhat 'spread' across internal and AWS machines. We should try to have it all in one place and properly documented.

    • Paul, Itxaka, Julien, and Loic - phrase this issue correctly and add acceptance criterias
  • Empty `/etc/resolv.conf`

    Empty `/etc/resolv.conf`

    Since the latest release (11th of September 2022) elemental does not properly set /etc/resolv.conf file at boot. In fact, sle-micro-for-rancher introduced NetworkManager in this release.

  • Image names and labels for elemental stack

    Image names and labels for elemental stack

    In our current built artifacts we have the following image names and labels. <NUM> is the OBS build ID and <registry> is the registry domain and base repository where OBS pushes to.

    HELM CHART:

    <registry>/elemental/elemental-operator:
       * latest
       * 1.0.0
       * 1.0.0-build<NUM>
    

    TEAL IMAGE:

    <registry>/rancher/elemental-teal/5.3:
       * 1.0.0
       * 1.0.0-<NUM>
      
    <registry>/rancher/elemental-node-image/5.3:
       * latest
       * 1.0.0
    

    BUILDER IMAGE:

    <registry>/rancher/elemental-builder-image/5.3:
       * latest
       * 0.1.3
       * 0.1.3-<NUM>
    

    OPERATOR IMAGE:

    <registry>/rancher/elemental-operator/5.3:
       * latest
       * 1.0.0
       * 1.0.0-<NUM>
    

    I see a couple of little issues here, I'd say they should be fixed before the release:

    • Helm chart is not under rancher repository like all others, I doubt this was on purpose.
    • Teal image has two different repositories but they do not follow the tree tags approach like other images, note that including a tag with the build ID might be relevant for base OS upgrades not coming from us (CVEs in the base image)

    Beyond this two little issues I am wondering if the schema of /rancher/<image>/5.3 repository is what we want. Also it is unclear to me why do we have two different repositories for the teal image, it doesn't hurt, but I don't see how this could be used.

    @kkaempf @agracey @rancher/elemental are you fine with the current tags? Anything you think it should be arranged?

  • build-iso still pulls from quay.io

    build-iso still pulls from quay.io

    elemental build-iso should run offline when the base image is available in the local image cache. Currently it still pulls data from quay.io (grub2 files apparently).

    Not sure why this happens. Either elemental looks at the wrong places in the image (or not at all :wink:)

    Everything needed for ISO building should be provided by node-image and builder-image.

  • Receiving 503 when reinstalling Elemental Operator

    Receiving 503 when reinstalling Elemental Operator

    What steps did you take and what happened: I reinstalled the Elemental Operator to fix a different problem.

    helm uninstall -n cattle-elemental-system elemental-operator
    kubectl delete -f registration.yaml
    kubectl delete -f selector.yaml
    kubectl delete -f cluster.yaml
    

    Then I installed the operator

    helm upgrade --create-namespace -n cattle-elemental-system --install elemental-operator oci://registry.opensuse.org/isv/rancher/elemental/stable/charts/rancher/elemental-operator-chart
    

    And installed selector, cluster and registration. When I ran wget to get the machine registration information rancher returned a 503:

    wget --no-check-certificate `kubectl get machineregistration -n fleet-default your-machine -o jsonpath="{.status.registrationURL}"` -O initial-registration.yaml
    --2022-12-20 15:14:50--  https://<omitted>
    Resolving <omitted>
    Connecting to <omitted>|:443... connected.
    WARNING: cannot verify <omitted>'s certificate, issued by ‘CN=dynamiclistener-ca,O=dynamiclistener-org’:
      Unable to locally verify the issuer's authority.
    HTTP request sent, awaiting response... 503 Service Unavailable
    2022-12-20 15:15:05 ERROR 503: Service Unavailable.
    

    I receive a proper URL from the kubectl get machineregistration but the wget doesn't get the registration information.

    What did you expect to happen: I should receive a yaml file with the registration token and cert.

    Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

    Environment:

    • Elemental release version (use cat /etc/os-release): N/A
    • Rancher version: 2.6.6
    • Kubernetes version (use kubectl version): RKE2:
    Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.8+rke2r1", GitCommit:"fdc77503e954d1ee641c0e350481f7528e8d068b", GitTreeState:"clean", BuildDate:"2022-11-10T17:56:22Z", GoVersion:"go1.18.8b7", Compiler:"gc", Platform:"linux/amd64"
    
    • Cloud provider or hardware configuration: self-hosted hardware
  • Docs migration to Docusaurus

    Docs migration to Docusaurus

    Docusaurus installed and initial configuration done:

    • Elemental sidebar created to match the mkdocs side bar.
    • Header links set and logo is missing
    • Footer created with same links from mkdocs

    The docs have been migrated 1:1 without any adaptation, so expect view glitches.

  • Split ros-operator into its own repo

    Split ros-operator into its own repo

    It makes no sense to have all this mixed together with the os2 files. First, everything is clumped into a ton of files to setup like the CI, the makefile, the scripts, etc... while ros-operator is very simple and could do with a simple repo in which we control everything and its much clearer where to change things.

    Plus we can release it separately instead of in a big release with the installer, the chart, the iso, etc..

    ros-operator and its chart should live in its own repo for ease of releasing, testing and updating. We also dont need os2 to test the ros-operator so that makes it simpler.

    A test repo is available at https://github.com/Itxaka/ros-operator with the full package, CI stuff, release stuff.

    Action items

    • [x] Split the operator code to https://github.com/rancher-sandbox/rancheros-operator with all the pipelines, tests, etc.
    • [x] Test releasing and QA
    • [x] Update os2 to drop the ros-operator code counterpart (adapt pipelines and CI accordingly)
  • Improve proxy tests

    Improve proxy tests

    Follow up of https://github.com/rancher/elemental/pull/594

    Proxy tests are now live in the CI but we could improve them by doing the followings things:

    • [ ] Use the same IP (172.17.0.1) for the proxy
    • [ ] Check the proxy log to make sure the traffic passes through it
    • [ ] Upload proxy log
    • [ ] Write documentation
  • Wrong shutdown order (on aarch64?)

    Wrong shutdown order (on aarch64?)

    What steps did you take and what happened: [A clear and concise description of what the bug is.]

    Run

    # halt
    

    on a node's root prompt.

    This fails to umount /var/lib/rancher and subsequently /var:

    [  OK  ] Unmounted /var/lib/kubelet342ojected/kube-api-access-g7zn8.
    [  OK  ] Unmounted /var/lib/longhorn.
    [FAILED] Failed unmounting /var/lib/rancher.
    [  OK  ] Stopped Flush Journal to Persistent Storage.
             Unmounting /etc...
             Unmounting /var/lib/kubelet...
             Unmounting /var/log...
    [  OK  ] Unmounted /etc.
    [  OK  ] Unmounted /var/lib/kubelet.
    [  OK  ] Unmounted /var/log.
             Unmounting /usr/local...
             Unmounting /var...
    [  OK  ] Unmounted /usr/local.
    [FAILED] Failed unmounting /var.
             Unmounting /run/overlay...
    [  OK  ] Unmounted /run/overlay.
    [  OK  ] Stopped target Preparation for Local File Systems.
    [  OK  ] Stopped target Swaps.
    [  OK  ] Reached target Unmount All Filesystems.
             Stopping Monitoring of LVM342meventd or progress polling...
    [  OK  ] Stopped Create Static Device Nodes in /dev.
    [  OK  ] Stopped Create System Users.
    [  OK  ] Stopped Remount Root and Kernel File Systems.
    [  OK  ] Stopped Monitoring of LVM2… dmeventd or progress polling.
    

    and then it hangs in stopping a container(?)

    [   ***] A stop job is running for libcontai342c0b054e2b43bc51a3 (29s / 1min 30s)
    [  *** ] A stop job is running for libcontai…c0b054e2b43bc51a3 (29s / 1min 30s)
    [ ***  ] A stop job is running for libcontai342c0b054e2b43bc51a3 (30s / 1min 30s)
    [***   ] A stop job is running for libcontai342c0b054e2b43bc51a3 (30s / 1min 30s)
    [**    ] A stop job is running for libcontai342c0b054e2b43bc51a3 (31s / 1min 30s)
    [*     ] A stop job is running for libcontai342c0b054e2b43bc51a3 (31s / 1min 30s)
    [**    ] A stop job is running for libcontai342c0b054e2b43bc51a3 (32s / 1min 30s)
    [***   ] A stop job is running for libcontai342c0b054e2b43bc51a3 (32s / 1min 30s)
    [ ***  ] A stop job is running for libcontai342c0b054e2b43bc51a3 (33s / 1min 30s)
    [  *** ] A stop job is running for libcontai342c0b054e2b43bc51a3 (33s / 1min 30s)
    [   ***] A stop job is running for libcontai342c0b054e2b43bc51a3 (34s / 1min 30s)
    [    **] A stop job is running for libcontai342c0b054e2b43bc51a3 (34s / 1min 30s)
    [     *] A stop job is running for libcontai342c0b054e2b43bc51a3 (35s / 1min 30s)
    [    **] A stop job is running for libcontai342c0b054e2b43bc51a3 (35s / 1min 30s)
    

    and only continues after systemd's timeout is reached.

    What did you expect to happen:

    An orderly and quick shutdown.

    Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

    Environment:

    • Elemental release version (use cat /etc/os-release): HEAD as of today.
    • Rancher version: 2.7.0
    • Kubernetes version (use kubectl version):
    • Cloud provider or hardware configuration:
  • e2e: test new autogenerated seed for emulated TPM

    e2e: test new autogenerated seed for emulated TPM

    elemental-operator v1.0.x allows only one node with emulated TPM per MachineRegistration, but new operator v1.1.x allows more if emulatedSeed is set to -1.

    A test for this should be added as soon as we don't need to keep CI test for operator v1.0.x.

  • Revisit elemental release procedures

    Revisit elemental release procedures

    Few things to improve and elaborate around releases:

    • [x] Not all sources are in github (specs, dockerfiles, etc.)
      • I'd suggest adding a some sort of dist/obs/ folder within the repos to include those. In dist/obs there could be a README.md explaining the files in there are OBS specific and mostly used for SUSE's builds.
    • [x] Rebuild of RPMs on PRs
    • [ ] Release candidate tags are useless, probably we should simply consider Stable as our RC once artifacts are in registry.suse.com
    • [ ] Linked packaged diffs are reversed
    • [ ] Arrange IBS project to built properly, SR is accepted, but images are not building
    • [ ] A bot user should be used in OBS instead of my own user
  • Injection script should be able to reuse local iso

    Injection script should be able to reuse local iso

    currently if you have the iso already downloaded in the same dir, it will refuse to inject the file. It should be safe to reuse the same iso over and over to inject the registration file.

    $ .github/elemental-iso-add-registration initial-registration.yaml
    Pulling artifacts from isv:Rancher:Elemental:Stable OBS project
    Error: /home/itxaka/projects/elemental/elemental-teal.x86_64.iso already exists, aborting
    
    
Truly Minimal Linux Distribution for Containers

Statesman Statesman is a minimal Linux distribution, running from memory, that has just enough functionality to run OCI-compatible containers. Rationa

Nov 12, 2021
Cluster API k3s

Cluster API k3s Cluster API bootstrap provider k3s (CABP3) is a component of Cluster API that is responsible for generating a cloud-init script to tur

Dec 23, 2022
A Kubernetes operator that allows for automatic provisioning and distribution of cert-manager certs across namespaces

cached-certificate-operator CachedCertificate Workflow When a CachedCertificate is created or updated the operator does the following: Check for a val

Sep 6, 2022
Assigns floating ip addresses to Rancher Guest clusters.
Assigns floating ip addresses to Rancher Guest clusters.

kube-fip-operator The kube-fip-operator application manages the FloatingIP and FloatingIPRange Custom Resource Definition objects in a Rancher environ

Dec 6, 2021
repo de teste para executar á pipeline do rancher

pipeline-example-go This is a sample golang project to demonstrate the integration with rancher pipeline. Building go build -o ./bin/hello-server Runn

Dec 19, 2021
immutable, fluent, builders for Kubernetes resources

Dies - immutable, fluent, builders for Kubernetes resources Using dies Common methods Creating dies diegen die markers +die This project contains dies

May 6, 2022
Apko: Build images for apk-based distributions declaratively

apko Build images for apk-based distributions declaratively! Why When maintainin

Jan 4, 2023
Fast docker image distribution plugin for containerd, based on CRFS/stargz
Fast docker image distribution plugin for containerd, based on CRFS/stargz

[ ⬇️ Download] [ ?? Browse images] [ ☸ Quick Start (Kubernetes)] [ ?? Quick Start (nerdctl)] Stargz Snapshotter Read also introductory blog: Startup C

Dec 29, 2022
resource manifest distribution among multiple clusters.

Providing content to managed clusters Support a primitive that enables resources to be applied to a managed cluster. Community, discussion, contributi

Dec 26, 2022
Walker's alias method is an efficient algorithm to sample from a discrete probability distribution.

walker-alias Walker's alias method is an efficient algorithm to sample from a discrete probability distribution. This means given an arbitrary probabi

Jun 14, 2022
Apachedist-resource - A concourse resource to track updates of an apache distribution, e.g. tomcat

Apache Distribution Resource A concourse resource that can track information abo

Feb 2, 2022
The NiFiKop NiFi Kubernetes operator makes it easy to run Apache NiFi on Kubernetes.
The NiFiKop NiFi Kubernetes operator makes it easy to run Apache NiFi on Kubernetes.

The NiFiKop NiFi Kubernetes operator makes it easy to run Apache NiFi on Kubernetes. Apache NiFI is a free, open-source solution that support powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Dec 26, 2022
Open Source runtime scanner for Linux containers (LXD), It performs security audit checks based on CIS Linux containers Benchmark specification
Open Source runtime scanner for Linux containers (LXD), It performs security audit checks based on CIS Linux containers  Benchmark specification

lxd-probe Scan your Linux container runtime !! Lxd-Probe is an open source audit scanner who perform audit check on a linux container manager and outp

Dec 26, 2022
Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

May 19, 2021
Linux Controllers for Kubernetes

Tambourine Kubelet replacement with Built in Linux extensions Development Success: Install, Manage, and Observe a new systemd service from Kubernetes.

Jun 2, 2021
⚡️ A dev tool for microservice developers to run local applications and/or forward others from/to Kubernetes SSH or TCP
⚡️ A dev tool for microservice developers to run local applications and/or forward others from/to Kubernetes SSH or TCP

Your new microservice development environment friend. This CLI tool allows you to define a configuration to work with both local applications (Go, Nod

Jan 4, 2023
Open Source runtime tool which help to detect malware code execution and run time mis-configuration change on a kubernetes cluster
Open Source runtime tool which help to detect malware code execution and run time mis-configuration change on a kubernetes cluster

Kube-Knark Project Trace your kubernetes runtime !! Kube-Knark is an open source tracer uses pcap & ebpf technology to perform runtime tracing on a de

Sep 19, 2022