Provide task runtime implementation with pidfd and eBPF sched_process_exit tracepoint to manage deamonless container with low overhead.

embedshim

The embedshim is the kind of task runtime implementation, which can be used as plugin in containerd.

With current shim design, it is used to manage the lifecycle of container process and allow to be reconnected after containerd restart. The one of the key design elements of a small shim is to be a container process monitoring, at least it is important to containerd created by runC-like runtime.

Without pidfd and ebpf trace point feature, it is unlikely to receive exit notification in time and receive exit code correctly as non-parents after shim dies. And in kubernetes infra, even if the containers in pod can share one shim, the VmRSS of shim(Go Runtime) is still about 8MB.

So, this plugin aims to provide task runtime implementation with pidfd and eBPF sched_process_exit tracepoint to manage deamonless container with low overhead.

embedshim-overview

asciicast

Build/Install

The embedshim needs to compile bpf with clang/llvm. So install clang/llvm as first.

$ echo "deb http://apt.llvm.org/focal/ llvm-toolchain-focal main" | sudo tee -a /etc/apt/sources.lis
$ wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
$ sudo apt-get update -y
$ sudo apt-get install -y g++ libelf-dev clang lld llvm

And then pull the repo and build it.

$ git clone https://github.com/fuweid/embedshim.git
$ cd embedshim
$ git submodule update --init --recursive
$ make
$ sudo make install

The binary is named by embedshim-containerd which has full functionality in linux. You can just replace your local containerd with it.

$ sudo install bin/embedshim-containerd $(command -v containerd)
$ sudo systemctl restart containerd

And check plugin with ctr

$ ctr plugin ls | grep embed
io.containerd.runtime.v1        embed                    linux/amd64    ok

Status

The embedshim supports to run container in headless or with input. But it still works in progress, do not use in production.

  • Support Pause/Resume
  • Task Event(Create/Start/Exit/Delete/OOM) support

Requirements

  • raw tracepoint bpf >= kernel v4.18
  • CO-RE BTF vmlinux support >= kernel v5.4
  • pidfd polling >= kernel v5.3

License

Owner
Fu Wei
a @containerd maintainer
Fu Wei
Comments
  • Support ExecProcess in shim

    Support ExecProcess in shim

    [PATCH 7] init exec_process support
    
    Unlike runc-init, the exec process needs a runc-exec wrapper to be
    subreaper so that the embedshim can use pidfd to watch the process's
    exit event correctly.
    
    Since there is no way to recover the exec process after containerd
    restarted, this commit introduces new in-memory exitsnoop store to trace
    the exec process, just in case that there is no leaky items in map.
    
    And it is based release/1.5's exec_process code base. I make it to be
    implementation of runtime.Process. Basically, we don't need to use shim
    to wrap exec_process like init_process. I think in the future,
    init_process will be up to the shim layer.
    
    And one more thing, it is alpha version of exec :).
    
    [PATCH 6] .github: align goversion with matrix
    
    According to golangci-lint doc[1], the new version of golangci-lint
    action will use actions/setup-go@v2 result. Otherwise, it will use
    latest version of golang[2].
    
    REF:
    
    [1] https://github.com/golangci/golangci-lint-action
    [2] https://github.com/golangci/golangci-lint-action/issues/435
    
    [PATCH 5] cmd: add a runc-exec wrapper helper commandline
    
    [PATCH 4] pkg/runcext: introduce process sync proto
    
    [PATCH 3] pkg/pidfd: support pidfd_getfd
    
    [PATCH 2] .github: update go to 1.17.x
    
    [PATCH 1] pkg/pidfd: support waitid API
    
    [PATCH 0] pkg/es: support store from non-pinned maps
    
    Since the runC doesn't support check the exec-process's state in
    current, if we trace the exec-process in the same BPF map with
    container, the recover will be more complicated. And the runC-exec
    doesn't support fork-execve two steps pattern like init, it is also hard
    to recover it after restart containerd.
    
    So, we need other exitsnoop.Store to trace the exec process's exit event.
    The exitsnoop will be gone if containerd exits.
    
  • bug: fd leaky when delete created container

    bug: fd leaky when delete created container

    critest will call CreateContainer and delete it. And then the fifo will be leaky.

    The case name is runtime should support removing created container [Conformance].

    reproduce:

    critest -runtime-endpoint /run/containerd/containerd.sock -ginkgo.focus 'runtime should support removing created container'
    

    The result is from v1.5.11 containerd (using runc-v2 shim). It is upstream issue. But block v0.1.0 release.

    ➜  testing sudo lsof -p $(pidof containerd)
    lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
          Output information may be incomplete.
    lsof: WARNING: can't stat() fuse file system /run/user/1000/doc
          Output information may be incomplete.
    COMMAND      PID USER   FD      TYPE             DEVICE SIZE/OFF     NODE NAME
    container 155110 root  cwd       DIR              259,2     4096        2 /
    container 155110 root  rtd       DIR              259,2     4096        2 /
    container 155110 root  txt       REG              259,2 47675128  8398013 /usr/bin/containerd
    container 155110 root  mem-W     REG              259,2   524288 17566340 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/metadata.db
    container 155110 root  mem-W     REG              259,2  8388608 17575408 /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
    container 155110 root  mem       REG              259,2  1983576  8391102 /usr/lib/x86_64-linux-gnu/libc-2.33.so
    container 155110 root  mem       REG              259,2   150720  8391620 /usr/lib/x86_64-linux-gnu/libpthread-2.33.so
    container 155110 root  mem       REG              259,2    22912  8391104 /usr/lib/x86_64-linux-gnu/libdl-2.33.so
    container 155110 root  mem       REG              259,2   216192  8391094 /usr/lib/x86_64-linux-gnu/ld-2.33.so
    container 155110 root    0r      CHR                1,3      0t0        5 /dev/null
    container 155110 root    1u     unix 0xffff9202e6745940      0t0  1957280 type=STREAM
    container 155110 root    2u     unix 0xffff9202e6745940      0t0  1957280 type=STREAM
    container 155110 root    3u  a_inode               0,14        0    12472 [eventpoll]
    container 155110 root    4r     FIFO               0,13      0t0  1955442 pipe
    container 155110 root    5w     FIFO               0,13      0t0  1955442 pipe
    container 155110 root    6uW     REG              259,2  8388608 17575408 /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
    container 155110 root    7u  a_inode               0,14        0    12472 [eventpoll]
    container 155110 root    8r  a_inode               0,14        0    12472 inotify
    container 155110 root    9u  a_inode               0,14        0    12472 [eventpoll]
    container 155110 root   10r     FIFO               0,13      0t0  1949683 pipe
    container 155110 root   11w     FIFO               0,13      0t0  1949683 pipe
    container 155110 root   12u     unix 0xffff920220e92a80      0t0  1949684 /run/containerd/debug.sock type=STREAM
    container 155110 root   13u     unix 0xffff920220e96a40      0t0  1949685 /run/containerd/containerd.sock.ttrpc type=STREAM
    container 155110 root   14u     unix 0xffff920220e91980      0t0  1949686 /run/containerd/containerd.sock type=STREAM
    container 155110 root   15uW     REG              259,2   524288 17566340 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/metadata.db
    container 155110 root   16u     IPv4            1952557      0t0      TCP localhost:41601 (LISTEN)
    container 155110 root   23u     FIFO               0,25      0t0    10010 /run/containerd/io.containerd.grpc.v1.cri/containers/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885/io/2766993513/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885-stdout (deleted)
    container 155110 root   24u     FIFO               0,25      0t0    10011 /run/containerd/io.containerd.grpc.v1.cri/containers/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885/io/2766993513/36fc8dc0b6e479999877bf7fdafc83b26766c667df0bfe17f292ffe7ee04a885-stderr (deleted)
    
  • pkg/exitsnoop: hold raw_tp link to prevent from GC

    pkg/exitsnoop: hold raw_tp link to prevent from GC

    The cilium/ebpf defines sys.FD's SetFinalizer to close fd when GC. Since the exec process's exit code needs memory-type exitsnoop, we should keep the reference on the raw_tp link. Otherwise, the exitsnoop will be gone and process's exit code will be wrong.

    Signed-off-by: Wei Fu [email protected]

  • LICENSE/README.md: Add LICENSE

    LICENSE/README.md: Add LICENSE

    According to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/bpf/bpf_licensing.rst#n87

    Packaging BPF programs with user space applications
    ====================================================
    
    Generally, proprietary-licensed applications and GPL licensed BPF programs
    written for the Linux kernel in the same package can co-exist because they are
    separate executable processes. This applies to both cBPF and eBPF programs.
    

    Signed-off-by: Wei Fu [email protected]

  • fix: exitCode needs to be translated before use

    fix: exitCode needs to be translated before use

    Current:

    ➜  embedshim git:(unstable) sudo ctr run --rm --runtime io.containerd.runtime.v1.embed docker.io/library/alpine:latest testing sh -c "exit 10"
    ➜  embedshim git:(unstable) echo $?
    0
    

    After:

    ➜  embedshim git:(fix-issue) sudo ctr run --rm --runtime io.containerd.runtime.v1.embed docker.io/library/alpine:latest testing sh -c "exit 10"
    ➜  embedshim git:(fix-issue) echo $?
    10
    

    Signed-off-by: Wei Fu [email protected]

  • Feature: support exec API

    Feature: support exec API

    The runC-like command doesn't support create-start two steps like init. There needs a wrapper to support exec by pidfd and exitsnoop.

    And maybe draft propose two steps in runc community.

  • rewrite embedshim's task manager

    rewrite embedshim's task manager

    [PATCH 10] embedshim: store id and pin bpf in root dir
    
    The host can be restarted. If we should store id allocator db in tmpfs,
    it doesn't bring issue but the containerd log will show id reuse. I
    think the uint64 range is bigger enough and it will be easy to debug the
    trace ID when restart containerd if we store it in root dir.
    
    BPF pinned dir is just aligned with id db.
    
    [PATCH 9] embedshim: rename pid_monitor to exitsnoop
    
    [PATCH 8] embedshim: rename struct embedshim to shim
    
    [PATCH 7] embedshim: add helper function on bundle
    
    * Rootfs() returns the bundleDir/rootfs path
    * IsValid() returns nil if the bundleDir/work dir is still there, which
      is used to check the bundle is valid during restart containerd
    
    [PATCH 6] embedshim: fix linter issue
    
    [PATCH 5] embedshim: fix panic on close nil IO
    
    [PATCH 4] embedshim: rewrite pid monitor
    
    Rewrited the id allocator because we don't have to allocate id with
    namespace and task ID and release it. The uin64 number range is totally
    enough for us. Just keep it simpler with nextID() interface. And the
    trace event ID can also used for exec process in the future.
    
    Added traceEventId field in the initProcess because the trace event ID
    is the identity of process.
    
    And rewrite the monitor interface:
    
    * subscribe -> traceInitProcess
    * resubscribe -> repollingInitProces
    
    I think it can be easy to understand.
    
    [PATCH 3] embedshim: rename utils to runtime_utils
    
    And remove unuse codes.
    
    [PATCH 2] embedshim: rewrite init process
    
    Basically, we should reuse the upstream pkg/process package. But the IO
    design doesn't work with that. The initProcess should be redesigned to
    work with embedshim.
    
    In this commit, we rename Init to initProcess because we don't need to
    export it. And then we make newInitProcess with bundle since the bundle
    is the key store dir for us, especially when containerd restart.
    
    [PATCH 1] embedshim: move bundle.go to pkg/bundle
    
    [PATCH 0] embedshim: rewrite bundle handler
    
    The bundle is the key store for reload. Besides the
    rootfs/init.pid/config.json, the embedshim needs to store
    stdio/options/eventID important information in bundle.
    
    In order to manage files in bundle easily, this commit introduces option
    design as newBundle' interface and bring helper to read the file in
    bundle.
    

    Signed-off-by: Wei Fu [email protected]

  • .github/.golangci.yml/.go: fix linter issue

    .github/.golangci.yml/.go: fix linter issue

    • Copy .golangci.yml from containerd/containerd repo

    • .github:

      • Updated Ubuntu version from 18.04 to 20.04
      • Added step to install llvm/clang dependencies
    • Fixed the linter issues created by golangci-linter

    Signed-off-by: Wei Fu [email protected]

  • .github: update ci.yaml

    .github: update ci.yaml

    .github: update ci.yaml

    * remove the working-directory because it is invalid
    * add pull_request trigger event on master
    * update push trigger event on master
    
  • .github/bpf/pkg: fix Linter issue

    .github/bpf/pkg: fix Linter issue

    bpf: change monitor.bpf.c to pid_monitor.bpf.c and remove example

    pkg/ebpf: update go:generate

    .github: support pull requests

    Signed-off-by: Wei Fu [email protected]

  • Feature: support basic task events

    Feature: support basic task events

    • [ ] TaskCreateEventTopic for task create "/tasks/create"
    • [ ] TaskStartEventTopic for task start "/tasks/start"
    • [ ] TaskOOMEventTopic for task oom "/tasks/oom"
    • [ ] TaskExitEventTopic for task exit "/tasks/exit"
    • [ ] TaskDeleteEventTopic for task delete "/tasks/delete"
    • [ ] TaskExecAddedEventTopic for task exec create "/tasks/exec-added"
    • [ ] TaskExecAddedEventTopic for task exec start "/tasks/exec-started"
Testcontainers is a Golang library that providing a friendly API to run Docker container. It is designed to create runtime environment to use during your automatic tests.

When I was working on a Zipkin PR I discovered a nice Java library called Testcontainers. It provides an easy and clean API over the go docker sdk to

Jan 7, 2023
The dumb container runtime trying to be compatible with Kubernetes CRI

Go Dumb CRI The dumb container runtime trying to be compatible with Kubernetes CRI. Usage Run the server and create an IPC socket in /tmp/go-dumbcri.s

Dec 12, 2021
NVIDIA container runtime

nvidia-container-runtime A modified version of runc adding a custom pre-start hook to all containers. If environment variable NVIDIA_VISIBLE_DEVICES i

Dec 29, 2022
Container Runtime Interface profile

criprof Container Runtime Interface profiling and introspection. Useful for tracking down containers in logs or grouping by runtime characteristics. c

Jan 18, 2022
Metrics collector and ebpf-based profiler for C, C++, Golang, and Rust

Apache SkyWalking Rover SkyWalking Rover: Metrics collector and ebpf-based profiler for C, C++, Golang, and Rust. Documentation Official documentation

Jan 6, 2023
Go project to manage an ubuntu docker container
Go project to manage an ubuntu docker container

Go-docker-manager This project consist of a Go app that connects to a Docker backend, spans a Ubuntu container and shows live CPU/Memory information f

Oct 27, 2021
Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
Moby Project - a collaborative project for the container ecosystem to assemble container-based systems

The Moby Project Moby is an open-source project created by Docker to enable and accelerate software containerization. It provides a "Lego set" of tool

Jan 8, 2023
Boxygen is a container as code framework that allows you to build container images from code

Boxygen is a container as code framework that allows you to build container images from code, allowing integration of container image builds into other tooling such as servers or CLI tooling.

Dec 13, 2021
Amazon ECS Container Agent: a component of Amazon Elastic Container Service
Amazon ECS Container Agent: a component of Amazon Elastic Container Service

Amazon ECS Container Agent The Amazon ECS Container Agent is a component of Amazon Elastic Container Service (Amazon ECS) and is responsible for manag

Dec 28, 2021
The Container Storage Interface (CSI) Driver for Fortress Block Storage This driver allows you to use Fortress Block Storage with your container orchestrator

fortress-csi The Container Storage Interface (CSI) Driver for Fortress Block Storage This driver allows you to use Fortress Block Storage with your co

Jan 23, 2022
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

kepler Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics Architectur

Dec 26, 2022
Hubble - Network, Service & Security Observability for Kubernetes using eBPF
Hubble - Network, Service & Security Observability for Kubernetes using eBPF

Network, Service & Security Observability for Kubernetes What is Hubble? Getting Started Features Service Dependency Graph Metrics & Monitoring Flow V

Jan 2, 2023
A K8s ClusterIP HTTP monitoring library based on eBPF

Owlk8s Seamless RED monitoring of k8s ClusterIP HTTP services. This library provides RED (rate,error,duration) monitoring for all(by default but exclu

Jun 16, 2022
This manager helps handle the life cycle of your eBPF programs

eBPF Manager This repository implements a manager on top of Cilium's eBPF library. This declarative manager simplifies attaching and detaching eBPF pr

Dec 1, 2022
The k8s-generic-webhook is a library to simplify the implementation of webhooks for arbitrary customer resources (CR) in the operator-sdk or controller-runtime.

k8s-generic-webhook The k8s-generic-webhook is a library to simplify the implementation of webhooks for arbitrary customer resources (CR) in the opera

Nov 24, 2022
A golang CTF competition platform with high-performance, security and low hardware requirements.
A golang CTF competition platform with high-performance, security and low hardware requirements.

CTFgo - CTF Platform written in Golang A golang CTF competition platform with high-performance, security and low hardware requirements. Live Demo • Di

Oct 20, 2022
Go package for interacting with the "ELK" Bluetooth Low Energy RGB LED Controller
Go package for interacting with the

Go interactions for the ELK-BLEDOM RGB LED Controller This repository contains information on the common (and cheap) ELK-BLEDOM Bluetooth Low Energy R

Jan 2, 2023
provide api for cloud service like aliyun, aws, google cloud, tencent cloud, huawei cloud and so on

cloud-fitter 云适配 Communicate with public and private clouds conveniently by a set of apis. 用一套接口,便捷地访问各类公有云和私有云 对接计划 内部筹备中,后续开放,有需求欢迎联系。 开发者社区 开发者社区文档

Dec 20, 2022
topolvm operator provide kubernetes local storage which is light weight and high performance

Topolvm-Operator Topolvm-Operator is an open source cloud-native local storage orchestrator for Kubernetes, which bases on topolvm. Supported environm

Nov 24, 2022