Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications

Last update: Jan 5, 2023

Comments: 17

Nomad

Nomad is a simple and flexible workload orchestrator to deploy and manage containers (docker, podman), non-containerized applications (executable, Java), and virtual machines (qemu) across on-prem and clouds at scale.

Nomad is supported on Linux, Windows, and macOS. A commercial version of Nomad, Nomad Enterprise, is also available.

Website: https://nomadproject.io
Tutorials: HashiCorp Learn
Forum: Discuss
Mailing List: Google Groups
Gitter: hashicorp-nomad

Nomad provides several key features:

Deploy Containers and Legacy Applications: Nomad’s flexibility as an orchestrator enables an organization to run containers, legacy, and batch applications together on the same infrastructure. Nomad brings core orchestration benefits to legacy applications without needing to containerize via pluggable task drivers.
Simple & Reliable: Nomad runs as a single binary and is entirely self contained - combining resource management and scheduling into a single system. Nomad does not require any external services for storage or coordination. Nomad automatically handles application, node, and driver failures. Nomad is distributed and resilient, using leader election and state replication to provide high availability in the event of failures.
Device Plugins & GPU Support: Nomad offers built-in support for GPU workloads such as machine learning (ML) and artificial intelligence (AI). Nomad uses device plugins to automatically detect and utilize resources from hardware devices such as GPU, FPGAs, and TPUs.
Federation for Multi-Region, Multi-Cloud: Nomad was designed to support infrastructure at a global scale. Nomad supports federation out-of-the-box and can deploy applications across multiple regions and clouds.
Proven Scalability: Nomad is optimistically concurrent, which increases throughput and reduces latency for workloads. Nomad has been proven to scale to clusters of 10K+ nodes in real-world production environments.
HashiCorp Ecosystem: Nomad integrates seamlessly with Terraform, Consul, Vault for provisioning, service discovery, and secrets management.

Quick Start

Testing

See Learn: Getting Started for instructions on setting up a local Nomad cluster for non-production use.

Optionally, find Terraform manifests for bringing up a development Nomad cluster on a public cloud in the terraform directory.

Production

See Learn: Nomad Reference Architecture for recommended practices and a reference architecture for production deployments.

Documentation

Full, comprehensive documentation is available on the Nomad website: https://www.nomadproject.io/docs

Guides are available on HashiCorp Learn.

Contributing

See the contributing directory for more developer documentation.

Owner

HashiCorp

Consistent workflows to provision, secure, connect, and run any infrastructure for any application.

https://github.com/hashicorp/nomad https://www.nomadproject.io/

Comments

Persistent data on nodes

Nomad should have some way for tasks to acquire persistent storage on nodes. In a lot of cases, we might want to run our own hdfs or ceph cluster on nomad.

That means, things like hdfs' datanodes needs to be able to reserve persistent storage on the node it is launched on. If the whole cluster goes down, once its brought back up, the appropriate tasks should be launched on its original nodes (where possible), so that it can gain access to data it has previously written.
Specify logging driver and options for docker driver

Please correct me if I am wrong, but I couldn't find in the documentation how to pass the log-driver and log-opt arguments to containers when running them as Nomad tasks, e.g.: --log-driver=awslogs --log-opt awslogs-region=us-east-1 --log-opt awslogs-group=myLogGroup --log-opt awslogs-stream=myLogStream

I know I can configure the docker daemon with these arguments, but then I can't specify different logstreams for each container. If this is currently not possible, I would like to request it as a feature. Thank you

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter

Nomad version

Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Operating system and Environment details

Ubuntu 20.04 LTS

Issue

Cannot re-plan jobs due to CSI volumes being claimed. I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using nomad volume create command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.

What confuses me is the warning. It differs every time I run job plan. First I saw

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-data has exhausted its available writer claims": 2 nodes excluded by filter

Then, runnig job plan again a few seconds after, I got

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

Then again,

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 1 nodes excluded by filter
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 1 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.

This is the output of nomad volume status.

Container Storage Interface
ID                           Name                Plugin ID  Schedulable  Access Mode
zookeeper1-data     zookeeper1-data     ceph-csi   true         single-node-writer
zookeeper1-datalog  zookeeper1-datalog  ceph-csi   true         single-node-writer
zookeeper2-data     zookeeper2-data     ceph-csi   true         single-node-writer
zookeeper2-datalog  zookeeper2-datalog  ceph-csi   true         single-node-writer
zookeeper3-data     zookeeper3-data     ceph-csi   true         <none>
zookeeper3-datalog  zookeeper3-datalog  ceph-csi   true         <none>

It says that they are schedulable. This is the output of nomad volume status zookeeper1-datalog:

ID                   = zookeeper1-datalog
Name                 = zookeeper1-datalog
External ID          = 0001-0024-72f28a72-0434-4045-be3a-b5165287253f-0000000000000003-72ec315b-e9f5-11eb-8af7-0242ac110002
Plugin ID            = ceph-csi
Provider             = cephfs.nomad.example.com
Version              = v3.3.1
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

It says there, there are no allocations placed.

Reproduction steps

This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with nomad job stop -purge. No, doing nomad system gc, nomad system reconcile summary, or restarting Nomad does not work.

Expected Result

Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using nomad volume create so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).

Actual Result

See error logs above.

Job file (if appropriate)

I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):

    volume "data" {
      type = "csi"
      read_only = false
      source = "zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

All groups have count = 1.

Ability to select private/public IP for specific task/port
Extracted from #209.

We use Nomad with Docker driver to operate cluster of machines. Some of them have both public and private interfaces. These two-NIC machines run internal services that need to listen only on a private interface, as well as public services, which should listen on a public interface.

So we need a way of specifying whether some task should listen on public or private IP.

I think this can be generalized to the ability to specify subnet mask for a specific port:

resources { network { mbits = 100 port "http" { # Listen on all interfaces that match this mask; the task will not be # started on a machine that has no NICs with IPs in this subnet. netmask = "10.10.0.1/16" } port "internal-bus" { # The same with static port number static = 4050 netmask = "127.0.0.1/32" } } }

This would be the most flexible solution that would cover most, if not all, cases. For example, to listen on all interfaces, as requested in #209, you would just pass 0.0.0.0/0 netmask that matches all possible IPs. Maybe it makes sense to make this netmask the default, i.e. bind to all interfaces if no netmask is specified for a port.

I think this is really important feature, because its lack prevents people from running Nomad in VPC (virtual private cloud) environments, like Amazon VPC, Google Cloud Platform with subnetworks, OVH Dedicated Cloud and many others, as well as any other environments where some machines are connected to more than one network.

Another solution is to allow specifying interface name(s), like eth0, but that wouldn't work in our case because:

different machines may have different order and, thus, different names of network interfaces;

to make things worse, some machines may have multiple IPs assigned to the same interface, e.g. see DigitalOcean's anchor ip which is enabled by default on each new machine.

Example for point 1: assume that I want to start some task on all machines in the cluster, and that I want this task to listen only on private interface to prevent exposing it to the outer world. Consul agent is a nice example of such service.

Now, some machines in the cluster are connected to both public and private networks, and have two NICs:

eth0 corresponds to public network, say, 162.243.197.49/24;

eth1 corresponds to my private network 10.10.0.1/24.

But majority of machines are only connected to a private net, and have only one NIC:

eth0 corresponds to the private net 10.10.0.1/24.

This is fairly typical setup in VPC environments.

You can see that it would be impossible to constrain my service only to private subnet by specifying interface name, because eth0 corresponds to different networks on different machines, and eth1 is even missing on some machines.
Provide for dependencies between tasks in a group

Tasks in a group sometimes need to be ordered to start up correctly.

For example, to support the Ambassador pattern, proxy containers (P[n]) used for outbound request routing by a dependent application may be started only after the dependent application (A) is started. This is because Docker needs to know the name of A to configure shared-container networking when launching P[n].

In the first approximation of the solution, ordering can be simple, e.g., by having the task list in a group be an array.
HTTP UI like consul-ui

Much like consul-ui, it would be nice with a nomad-ui project to visually access and modify jobs etc

Until Nomad has it's own native UI, jippi/hashi-ui provides a Nomad and Consul UI
Tens of thousands of open file descriptors to a single nomad alloc logs directory

Nomad version

Nomad v0.4.1

Operating system and Environment details

Linux ip-10-201-5-129 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Issue

Nomad has tens of thousands of open file descriptors to an alloc log directory.

nomad 2143 root *530r DIR 202,80 4096 8454154 /var/lib/ssi/nomad/alloc/14e62a40-8598-2fed-405e-ca237bc940c6/alloc/logs

Something similar to that repeated ~60000 times. lsof -p 2143 | wc -l returns ~60000

I stopped the alloc but the descriptors are still there.

In addition, the nomad process is approaching 55 GB of memory used.

Unable to get nomad config/get template function_denylist option

@notnoop @tgross hi guys! I made an update for 1.2.4 but got another issue with consul templating:

Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.234Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stdout.fifo timestamp=2022-01-25T08:44:53.234Z
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stdout.fifo timestamp=2022-01-25T08:44:53.234Z
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.234Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stderr.fifo timestamp=2022-01-25T08:44:53.234Z
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stderr.fifo timestamp=2022-01-25T08:44:53.234Z
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.965Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: agent: (runner) creating new runner (dry: false, once: false)
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.966Z [INFO]  agent: (runner) creating watcher
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.966Z [INFO]  agent: (runner) starting
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: agent: (runner) creating watcher
Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: agent: (runner) starting
Jan 25 08:44:54 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:54.307Z [INFO]  client.gc: marking allocation for GC: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4
Jan 25 08:44:54 microworker03.te01-shr.nl3 nomad[4342]: client.gc: marking allocation for GC: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4
Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:58.309Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z
Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z
Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:58.309Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z
Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z

Nomad side:

Template failed: /data/nomad/alloc/3a20b272-9965-8c1f-6ab0-c841e303b623/worker-mpi-resolver/local/platformConfig/nl3.tmpl: execute: template: :1:36: executing "" at <plugin "/data/tools/consul.php">: error calling plugin: function is disabled
--


<br class="Apple-interchange-newline" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">

Originally posted by @bubejur in https://github.com/hashicorp/nomad/issues/11547#issuecomment-1020940729

high memory usage in logmon

I have a cluster of 20 nodes, all running "raw-exec" tasks in PHP.

At random intervals after a while i get a lot of OOM's. I find the server with 100% swap usage and looking like this :

If i restart the nomad agent, it all goes back to normal for a while.

I also get this in nomad log : "2021-01-20T17:52:28.522+0200 [INFO] client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (89) is over the limit (50)" <--- that message is extremely ambiguous as everything runs normal and nomad was just restarted.

Succesfully completed batch job is re-run with new allocation.

Nomad version

Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)

Operating system and Environment details

3.10.0-327.36.3.el7.x86_64

Issue

A batch job executed, completed successfully and then several hours later, when the allocation was garbage collected was re-run.

Reproduction steps

Not sure. Seems to be happening frequently on our cluster though.

Nomad logs

    2018/05/14 23:30:50.541581 [DEBUG] worker: dequeued evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.541765 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/14 23:30:50.546101 [DEBUG] worker: submitted plan at index 355103 for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.546140 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/14 23:30:50.547618 [DEBUG] worker: updated evaluation <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/14 23:30:50.547683 [DEBUG] worker: ack for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa

    2018/05/14 23:30:52.074437 [DEBUG] client: starting task runners for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/14 23:30:52.074769 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '5d0016ac-dd71-6626-f929-6398e80ef28e')
2018-05-14T23:30:52.085-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/5d0016ac-dd71-6626-f929-6398e80ef28e/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/14 23:34:32.288406 [INFO] client: task "REDACTED-task" for alloc "5d0016ac-dd71-6626-f929-6398e80ef28e" completed successfully
    2018/05/14 23:34:32.288438 [INFO] client: Not restarting task: REDACTED-task for alloc: 5d0016ac-dd71-6626-f929-6398e80ef28e
    2018/05/14 23:34:32.289213 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:13.888635 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 01:39:15.389175 [WARN] client: failed to broadcast update to allocation "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 01:39:15.389401 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:15.390656 [DEBUG] client: terminating runner for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/15 01:39:15.390714 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 04:25:10.541590 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 04:25:10.541626 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.119467 [DEBUG] worker: dequeued evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.139904 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/15 05:46:37.169051 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 05:46:37.169149 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to forced collection
    2018/05/15 05:46:37.169194 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.177470 [DEBUG] worker: submitted plan at index 373181 for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.177516 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/15 05:46:37.179391 [DEBUG] worker: updated evaluation <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/15 05:46:37.179783 [DEBUG] worker: ack for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:40.218701 [DEBUG] client: starting task runners for alloc '928b0562-b7ed-a3c7-d989-89519edadee9'
    2018/05/15 05:46:40.218982 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '928b0562-b7ed-a3c7-d989-89519edadee9')
2018-05-15T05:46:40.230-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/928b0562-b7ed-a3c7-d989-89519edadee9/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/15 11:50:17.836313 [INFO] client: task "REDACTED-task" for alloc "928b0562-b7ed-a3c7-d989-89519edadee9" completed successfully
    2018/05/15 11:50:17.836336 [INFO] client: Not restarting task: REDACTED-task for alloc: 928b0562-b7ed-a3c7-d989-89519edadee9
    2018/05/15 11:50:17.836698 [INFO] client.gc: marking allocation 928b0562-b7ed-a3c7-d989-89519edadee9 for GC

Job file (if appropriate)

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${node.unique.id}",
                "Operand": "=",
                "RTarget": "52c7e5be-a5a0-3a34-1051-5209a91a0197"
            }
        ],
        "CreateIndex": 393646,
        "Datacenters": [
            "dc1"
        ],
        "ID": "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2",
        "JobModifyIndex": 393646,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 393673,
        "Name": "REDACTED",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "REDACTED/dispatch-1526033570-3cdd72d9",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Reschedule": null,
        "Stable": false,
        "Status": "dead",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1526403442162340993,
        "TaskGroups": [
            {
                "Constraints": [
                    {
                        "LTarget": "${attr.os.signals}",
                        "Operand": "set_contains",
                        "RTarget": "SIGTERM"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "REDACTED",
                "ReschedulePolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "DelayFunction": "constant",
                    "Interval": 86400000000000,
                    "MaxDelay": 0,
                    "Unlimited": false
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 15000000000,
                    "Interval": 86400000000000,
                    "Mode": "fail"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "command": "REDACTED",
                            "args": [REDACTED]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {REDACTED},
                        "KillSignal": "SIGTERM",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "REDACTED",
                        "Resources": {
                            "CPU": 100,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "Networks": null
                        },
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null
                    }
                ],
                "Update": null
            }
        ],
        "Type": "batch",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 0
    }
}

What I can tell you for sure is that the allocation ran to completion and exited successfully.

We're going to try turning off the reschedule and restart policies to see if that has any effect since we're taking care of re-running these on any sort of job failure anyway.

$failed to submit plan for evaluation: ... no such key \$ \" in keyring error after moving cluster to 1.4.1">

failed to submit plan for evaluation: ... no such key \"\" in keyring error after moving cluster to 1.4.1

Nomad version

Nomad v1.4.1 (2aa7e66bdb526e25f59883952d74dad7ea9a014e)

Operating system and Environment details

Ubuntu 22.04, Nomad 1.4.1

Issue

After moving the Nomad server and clients to v1.4.1, I noticed that sometimes (unfortunately not always) after cycling Nomad server ASGs and Nomad client ASGs with new AMIs, jobs scheduled on the workers can't be allocated. So to be precise:

Pipeline creates new Nomad AMIs via Packer
Pipeline terraforms Nomad server ASG with server config
Pipeline terraforms client ASG or dedicated instances with updated AMI
Lost jobs on worker (like for instance the Traefik ingress job) can't be allocated

This literally never happened before 1.4.X

Client output looks like this:

nomad eval list

ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures 427e9905 50 failed-follow-up plugin-aws-ebs-nodes default pending false 35f4fdfb 50 failed-follow-up plugin-aws-efs-nodes default pending false 46152dcd 50 failed-follow-up spot-drainer default pending false 71e3e58a 50 failed-follow-up plugin-aws-ebs-nodes default pending false e86177a6 50 failed-follow-up plugin-aws-efs-nodes default pending false 2289ba5f 50 failed-follow-up spot-drainer default pending false da3fdad6 50 failed-follow-up plugin-aws-ebs-nodes default pending false b445b976 50 failed-follow-up plugin-aws-efs-nodes default pending false 48a6771e 50 failed-follow-up ingress default pending false

Reproduction steps

Unclear at this point. I seem to be able to somewhat force the issue, when I cycle the Nomad server ASG with updated AMIs.

Expected Result

Client work that was lost, should be rescheduled once the Nomad client comes up and reports readiness.

Actual Result

Lost jobs that can't be allocated on worker with an updated AMI.

nomad status

ID Type Priority Status Submit Date auth-service service 50 pending 2022-10-09T11:32:57+02:00 ingress service 50 pending 2022-10-17T14:57:26+02:00 plugin-aws-ebs-controller service 50 running 2022-10-09T14:48:11+02:00 plugin-aws-ebs-nodes system 50 running 2022-10-09T14:48:11+02:00 plugin-aws-efs-nodes system 50 running 2022-10-09T11:37:04+02:00 prometheus service 50 pending 2022-10-18T21:19:24+02:00 spot-drainer system 50 running 2022-10-11T18:04:49+02:00

Job file (if appropriate)

variable "stage" {
  type        = string
  description = "The stage for this jobfile."
}

variable "domain_suffix" {
  type        = string
  description = "The HDI stage specific domain suffix."
}

variable "acme_route" {
  type = string
}

variables {
  step_cli_version = "0.22.0"
  traefik_version  = "2.9.1"
}

job "ingress" {

  datacenters = [join("-", ["pd0011", var.stage])]

  type = "service"

  group "ingress" {

    constraint {
      attribute = meta.instance_type
      value     = "ingress"
    }

    count = 1

    service {
      name = "traefik"
      tags = [
        "traefik.enable=true",

        "traefik.http.routers.api.rule=Host(`ingress.dsp.${var.domain_suffix}`)",
        "traefik.http.routers.api.entrypoints=secure",
        "traefik.http.routers.api.service=api@internal",
        "traefik.http.routers.api.tls.certresolver=hdi_acme_resolver",
        "traefik.http.routers.api.tls.options=tls13@file",
        "traefik.http.routers.api.middlewares=dspDefaultPlusAdmin@file",

        "traefik.http.routers.ping.rule=Host(`ingress.dsp.${var.domain_suffix}`) && Path(`/ping`)",
        "traefik.http.routers.ping.entrypoints=secure",
        "traefik.http.routers.ping.service=ping@internal",
        "traefik.http.routers.ping.tls.certresolver=hdi_acme_resolver",
        "traefik.http.routers.ping.tls.options=tls13@file",
        "traefik.http.routers.ping.middlewares=dspDefault@file"
      ]

      port = "https"

      check {
        name     = "Traefik Ping Endpoint"
        type     = "http"
        protocol = "http"
        port     = "http"
        path     = "/ping"
        interval = "10s"
        timeout  = "2s"
      }
    }

    network {

      port "http" {
        static = 80
        to     = 80
      }
      port "https" {
        static = 443
        to     = 443
      }
    }

    ephemeral_disk {
      size    = "300"
      sticky  = true
      migrate = true
    }

    task "generate_consul_cert" {
<snip>
    }

    task "generate_nomad_cert" {
<snip>
    }


    task "traefik" {

      driver = "docker"

      env {
        LEGO_CA_CERTIFICATES = join(":", ["${NOMAD_SECRETS_DIR}/cacert.pem", "${NOMAD_SECRETS_DIR}/root_ca_${var.stage}.crt"])
        # LEGO_CA_SYSTEM_CERT_POOL = true
      }

      config {
        image = "traefik:${var.traefik_version}"
        volumes = [
          # Use absolute paths to mount arbitrary paths on the host
          "local/:/etc/traefik/",
          "/etc/timezone:/etc/timezone:ro",
          "/etc/localtime:/etc/localtime:ro",
        ]
        network_mode = "host"
        ports        = ["http", "https"]
      }

      resources {
        cpu    = 800
        memory = 128
      }
      # Controls the timeout between signalling a task it will be killed
      # and killing the task. If not set a default is used.
      kill_timeout = "60s"

      template {
        data        = <<EOH
<snip>
    }
  }
}

Nomad Server logs (if appropriate)

Oct 20 15:00:30 uat-nomad-95I nomad[485]:     2022-10-20T15:00:30.571+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
Oct 20 15:00:42 uat-nomad-95I nomad[485]:     2022-10-20T15:00:42.948+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=827f0dfe-0584-b44a-92e2-9a92ab649c48 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"

Nomad Client logs (if appropriate)

Oct 20 11:55:00 uat-worker-wZz nomad[464]:              Log Level: INFO
Oct 20 11:55:00 uat-worker-wZz nomad[464]:                 Region: europe (DC: pd0011-uat)
Oct 20 11:55:00 uat-worker-wZz nomad[464]:                 Server: false
Oct 20 11:55:00 uat-worker-wZz nomad[464]:                Version: 1.4.1
Oct 20 11:55:00 uat-worker-wZz nomad[464]: ==> Nomad agent started! Log data will stream in below:
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.817+0200 [INFO]  client: using state directory: state_dir=/opt/hsy/nomad/data/client
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.826+0200 [INFO]  client: using alloc directory: alloc_dir=/opt/hsy/nomad/data/alloc
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.826+0200 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.831+0200 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.852+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.856+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.870+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.897+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.900+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.900+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=device
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.906+0200 [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"https://127.0.0.1:8501/v1/catalog/datacenters\": dial tcp 127.0.0.1:8501: connect: connection refused"
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.437+0200 [INFO]  client: started client: node_id=5f21ebef-e0a9-8bd2-775a-61b3e32cac6e
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.437+0200 [WARN]  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.438+0200 [WARN]  client.server_mgr: no servers available
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.439+0200 [WARN]  client.server_mgr: no servers available
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.453+0200 [INFO]  client.consul: discovered following servers: servers=[10.194.73.146:4647, 10.194.74.253:4647, 10.194.75.103:4647]
Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.501+0200 [INFO]  client: node registration complete
Oct 20 11:55:06 uat-worker-wZz nomad[464]:     2022-10-20T11:55:06.856+0200 [INFO]  client: node registration complete
Oct 20 11:55:14 uat-worker-wZz nomad[464]:     2022-10-20T11:55:14.893+0200 [INFO]  client.fingerprint_mgr.consul: consul agent is available
Oct 20 11:55:21 uat-worker-wZz nomad[464]:     2022-10-20T11:55:21.417+0200 [INFO]  client: node registration complete

docs: clarify shutdown_delay jobspec param and service behaviour.

Clarifies that the task and group level shutdown_delay parameters do not influence each other and that they apply to both Nomad and Consul service registrations.

The change also clarifies that service blocks apply to both Nomad and Consul service registrations.

Closes #15602
build(deps): bump json5 from 1.0.1 to 1.0.2 in /website
Bumps json5 from 1.0.1 to 1.0.2.

Release notes

Sourced from json5's releases.

v1.0.2

Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295). This has been backported to v1. (#298)

Changelog

Sourced from json5's changelog.

Unreleased [code, diff]

v2.2.3 [code, diff]

Fix: [email protected] is now the 'latest' release according to npm instead of v1.0.2. (#299)

v2.2.2 [code, diff]

Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

v2.2.1 [code, diff]

Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

v2.2.0 [code, diff]

New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

v2.1.3 [code, diff]

Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

v2.1.2 [code, diff]

... (truncated)

Commits

a62db1e 1.0.2

e0c23fe docs: update CHANGELOG for v1.0.2

62a6540 fix: add proto to objects and arrays

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.
Drivers: Make InternalCapabilities.DisableLogCollection Public

Proposal

The current set of Internal Driver Capabilities have been relatively stable since the introduction of the plugin interface, and some would greatly benefit 3rd party users of Nomad and Custom Task Drivers. This proposal suggests moving the DisableLogCollection feature to the public capabilities API.

While not perfect in isolation (Nomad offers no way to disable log collection for individual tasks, which means when scheduling to a driver with log collection disabled, the task still has an associated set of resources allocated for log storage), this goes a long way towards reducing overhead in a range of cases.

Use-cases

There are many use-cases where a driver author might want to give operators the ability to save on the overhead of deploying an unnecessary logmon for every task. Similarly to the Nomad Docker Driver's usage, the most obvious is where a runtime offers the ability to manage logs externally to Nomad.

This is particularly important when building high-density runtimes (and task drivers) - where no-op logmon overhead can quickly become an issue when deploying X000 allocations/node and logging is handled externally.

Attempted Solutions

We currently deploy a fork of Nomad that has a slightly jankier approach 😅.
support realtime signals (SIGRTMIN/SIGRTMAX)

Problem Hashicorp Nomad Client is discovering Unix OS signal via consul-template/blob/main/signals/signals.go and Consul Templates Unix Signals and it is missing SIGRTMIN/SIGRTMAX signals as of now.

Dependency Follow Hashicorp Nomad Client ---> consul-templates

Proposal

If someone have knowledge about OS Signals and Golang, please add this support. I have created a issue in base library to auto discover these signals https://github.com/hashicorp/consul-template/issues/1691

Use-cases

In my use-case where I am running Apache Impala which is using SIGRTMIN as graceful shutdown signal. I can't send this signal via Hashicorp Nomad Job since its not being discovered by Client and Hashicorp Client doesn't implement any specific logic to find supported signals.

Attempted Solutions

Current work around for this is to wrap Apache Impala in a bash script and handle/manipulate Kill signal from Nomad.

Nomad duplicates logs to stdout when `enable_syslog` is true

Nomad version

Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)

Operating system and Environment details

$ lsb_release  -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

Issue

When enable_syslog is true, Nomad logs both to syslog and to stdout.

Reproduction steps

journalctl shows this but log lines aren't always interspersed one by one but rather in blocks. It's easy to check by passing _TRANSPORT.

Expected Result

$ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=syslog
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (clients) disabling vault SSL verification
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) creating watcher
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
$ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=stdout
$

Actual Result

$ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=syslog
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (clients) disabling vault SSL verification
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) creating watcher
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
$ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=stdout
Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.567+0100 [INFO]  agent: (runner) starting
Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.567+0100 [WARN]  agent: (clients) disabling vault SSL verification
Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.568+0100 [INFO]  agent: (runner) creating watcher
Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.568+0100 [INFO]  agent: (runner) starting
Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.568+0100 [INFO]  agent: (runner) starting
$

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Not applicable.

Nomad Client logs (if appropriate)

Not applicable.

Running task using docker driver on RHEL 9.1 using podman-docker fails with cgroupv2 related error

Nomad version

Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)

Operating system and Environment details

Freshly installed VirtualBox VM using Alma Linux 9.1, following the tutorial:

# dnf install docker # actually installs podman / podman-docker
# dnf config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
# dnf install nomad
# nomad agent -dev -bind 0.0.0.0 -log-level INFO

In another shell:

# nomad job init -short
# nomad run example.job

Issue

Service doesn't start due to a docker / podman / OCI cgroupsv2 error, log:

    2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2023-01-02T18:28:21.154+0100 [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:d73401c2-feb4-c983-f59e-f94c8153957b Address:[2001:638:50d:110c:a00:27ff:feba:f18b]:4647}]"
    2023-01-02T18:28:21.154+0100 [INFO]  nomad.raft: entering follower state: follower="Node at [2001:638:50d:110c:a00:27ff:feba:f18b]:4647 [Follower]" leader-address= leader-id=
    2023-01-02T18:28:21.156+0100 [INFO]  nomad: serf: EventMemberJoin: nomad-test.global 2001:638:50d:110c:a00:27ff:feba:f18b
    2023-01-02T18:28:21.156+0100 [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2023-01-02T18:28:21.156+0100 [INFO]  nomad: started scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2023-01-02T18:28:21.156+0100 [INFO]  nomad: adding server: server="nomad-test.global (Addr: [2001:638:50d:110c:a00:27ff:feba:f18b]:4647) (DC: dc1)"
    2023-01-02T18:28:21.157+0100 [INFO]  client: using state directory: state_dir=/tmp/NomadClient1825089631
    2023-01-02T18:28:21.157+0100 [INFO]  client: using alloc directory: alloc_dir=/tmp/NomadClient3100991855
    2023-01-02T18:28:21.157+0100 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2023-01-02T18:28:21.160+0100 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2023-01-02T18:28:21.166+0100 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2023-01-02T18:28:21.169+0100 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2023-01-02T18:28:21.186+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2023-01-02T18:28:21.186+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2023-01-02T18:28:21.186+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2023-01-02T18:28:21.566+0100 [INFO]  client: started client: node_id=f3911116-a082-524e-96ca-fbe308ce2393
    2023-01-02T18:28:22.340+0100 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2023-01-02T18:28:22.340+0100 [INFO]  nomad.raft: entering candidate state: node="Node at [2001:638:50d:110c:a00:27ff:feba:f18b]:4647 [Candidate]" term=2
    2023-01-02T18:28:22.340+0100 [INFO]  nomad.raft: election won: term=2 tally=1
    2023-01-02T18:28:22.340+0100 [INFO]  nomad.raft: entering leader state: leader="Node at [2001:638:50d:110c:a00:27ff:feba:f18b]:4647 [Leader]"
    2023-01-02T18:28:22.341+0100 [INFO]  nomad: cluster leadership acquired
    2023-01-02T18:28:22.350+0100 [INFO]  nomad.core: established cluster id: cluster_id=abcaf78b-a317-b442-fc65-855c6bfe81a0 create_time=1672680502349978151
    2023-01-02T18:28:22.350+0100 [INFO]  nomad: eval broker status modified: paused=false
    2023-01-02T18:28:22.350+0100 [INFO]  nomad: blocked evals status modified: paused=false
    2023-01-02T18:28:22.358+0100 [INFO]  nomad.keyring: initialized keyring: id=9eed1ad5-4612-6541-8aab-61bf621017c9
    2023-01-02T18:28:22.421+0100 [INFO]  client: node registration complete
    2023-01-02T18:28:23.426+0100 [INFO]  client: node registration complete
    2023-01-02T18:28:48.069+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=3a4f9764-8493-b2a6-84bb-affdfbb0668a task=redis path=/tmp/NomadClient3100991855/3a4f9764-8493-b2a6-84bb-affdfbb0668a/alloc/logs/.redis.stdout.fifo @module=logmon timestamp="2023-01-02T18:28:48.069+0100"
    2023-01-02T18:28:48.069+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=3a4f9764-8493-b2a6-84bb-affdfbb0668a task=redis path=/tmp/NomadClient3100991855/3a4f9764-8493-b2a6-84bb-affdfbb0668a/alloc/logs/.redis.stderr.fifo @module=logmon timestamp="2023-01-02T18:28:48.069+0100"
    2023-01-02T18:28:53.578+0100 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=2eedc8d76ee4dfbe2bdc62463eb45e344e4c226a98af5574d5d7e3c8ee71c71b
    2023-01-02T18:30:05.885+0100 [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=2eedc8d76ee4dfbe2bdc62463eb45e344e4c226a98af5574d5d7e3c8ee71c71b error="API error (500): crun: cannot set memory swappiness with cgroupv2: OCI runtime error"
    2023-01-02T18:30:05.916+0100 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=3a4f9764-8493-b2a6-84bb-affdfbb0668a task=redis error="Failed to start container 2eedc8d76ee4dfbe2bdc62463eb45e344e4c226a98af5574d5d7e3c8ee71c71b: API error (500): crun: cannot set memory swappiness with cgroupv2: OCI runtime error"

Reproduction steps

See above

Expected Result

Works

Actual Result

See above

Job file (if appropriate)

Default nomad init job

Nomad Server logs (if appropriate)

See above

Nomad Client logs (if appropriate)

See above

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications

Nomad

Quick Start

Testing

Production

Documentation

Contributing

Owner

HashiCorp

Comments

Persistent data on nodes

Specify logging driver and options for docker driver

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Ability to select private/public IP for specific task/port

Provide for dependencies between tasks in a group

HTTP UI like consul-ui

Tens of thousands of open file descriptors to a single nomad alloc logs directory

Nomad version

Operating system and Environment details

Issue

Unable to get nomad config/get template function_denylist option

high memory usage in logmon

Succesfully completed batch job is re-run with new allocation.

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad logs

Job file (if appropriate)

failed to submit plan for evaluation: ... no such key \"\" in keyring error after moving cluster to 1.4.1

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

docs: clarify shutdown_delay jobspec param and service behaviour.

build(deps): bump json5 from 1.0.1 to 1.0.2 in /website

v1.0.2

Unreleased [code, diff]

v2.2.3 [code, diff]

v2.2.2 [code, diff]

v2.2.1 [code, diff]

v2.2.0 [code, diff]

v2.1.3 [code, diff]

v2.1.2 [code, diff]

Drivers: Make InternalCapabilities.DisableLogCollection Public

Proposal

Use-cases

Attempted Solutions

support realtime signals (SIGRTMIN/SIGRTMAX)

Proposal

Use-cases

Attempted Solutions

Nomad duplicates logs to stdout when `enable_syslog` is true

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Running task using docker driver on RHEL 9.1 using podman-docker fails with cgroupv2 related error

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result