Bastionzeros Agent and Daemon!

Last update: Oct 12, 2022

Comments: 17

Bzero

Bastionzero

Bastionzero is a simple to use zero trust access SaaS for dynamic cloud environments. Bastionzero is the most secure way to lock down remote access to servers, containers, clusters, and VM’s in any cloud, public or private. For more information go to Bastionzero.

The bzero-agent and bzero-daemon are executables that run on your local machine and target to communicate with the Bastionzero SaaS.

Install

We bundle our daemon with our cli tool zli:

brew tap bastionzero/tap
brew install bastionzero/tap/zli

To install the Agent, you can quickly get started by looking at our helm charts.

Developer processes

We use go to run and test our code. You can build our agent or daemon using the following command for our agent:

cd bctl/agent && go build agent.go

And this command for our daemon:

cd bctl/daemon && go build daemon.go

You can then run the agent and daemon by running the executable.

Where {version} is the version that is defined in the package.json file. This means older versions are still accessible but the latest folder will always overwritten by the codebuild job.

Owner

Bastion Zero

https://github.com/bastionzero/bzero

Comments

Feat/shell
Description of the change

Adds shell plugin support for bzero targets. See https://github.com/bastionzero/zli/pull/313 PR description for testing details.

Related Feature Branch PRs:

https://github.com/bastionzero/webshell-backend/pull/1012 https://github.com/bastionzero/zli/pull/313 https://github.com/bastionzero/webshell-common-ts/pull/60

Relevant release note information

Release Notes:

Related JIRA tickets

Relates to JIRA: CWC-1417

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[x] No

If yes, please explain:
Adds interactive shell plugin and tests
Description of the change

This creates an interactive shell plugin for the SystemD agent. This PR includes unit tests for the open, close, input, resize actions are work.

To run the unittests for this feature

cd bzero/bctl go test bastionzero.com/bctl/v1/bctl/agent/plugin/shel

I've confirmed it passes tests on OSX and AWS centoOS (see image)

One odd thing that I encountered is that the shell launch fails both on linux and osx if NoSetGroups = false which is how the AWS-SSM-Agent is configured.

cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uid, Gid: gid, Groups: groups, NoSetGroups: true}

I was unable to figure out why this would fail in my code but work fine in the bzero SSM-agent. Open to any ideas about what is going on here.

Additional work left undone

Create and attach the shell plugin to a datachannel in the agent (out of scope)

Provide unittests from datachannel to agent (out of scope)

Does not create the local agent user account 'bzuser' a.k.a. the DefaultRunAsUser (out of scope)

Create a mock pty. This was a more complex task than anticipated so I wrote the unittests to use shell account of whoever runs the test

Didn't break action handlers into their own files with receive message. Don't feel particular strongly one way or the other but the code was fairly inter-related and it seemed easy just to keep it in the same file for now.

Relevant release note information

Release Notes:

Related JIRA tickets

Relates to JIRA: CWC-1419

Have you considered the security impacts?

Does this PR have any security impact?

[X] Yes

[ ] No

If yes, please explain:

The shell plugin attempts to place the user in shell with linux user account enforced by the bastion. The dangerous exists that the user could escape from this user account and escalate to the privileges held by the agent. To avoid introducing additional security issues this code attempts to inherent as much as the shell launching code for the bastion-zero ssm-agent.

In future PRs when connecting the shell plugin to the agent, we should ensure that the user not able to override the linux user account set by the bastion.
Universal Connect
Description of the change

The main change introduced in this PR is to simplify the daemon code now that the zli creates the connection resource and gets connection service auth details (all in a single API call). The zli now passes in the additional CLI arguments to the daemon for all types of connections:

connectionId

connectionServiceUrl

connectionServiceAuthToken

This eliminates the need for the daemon to call bastion to create the connection resource or to get the connection auth details. Instead these parameters are set for all plugins here and used in websocket.go to directly connect to the connection node.

Shell Connection Optimizations

baa1b2a6bc1d139752e99cbe2130f0f86c185fd2: Removed a 1s sleep statement in defaultshell on agent before we start reading from stdout. I think this sleep was left in accidentally and in testing havent found any negative consequences of removing it. cc @EthanHeilman since i think this was originally added in your code.

7511556e52ad5bcbf7167c9c21b221bf8616f280: Dont try and refresh the id token when the websocket is being created. This should already be called in buildBZcert when constructing the syn message so doing this in websocket.go was unnecessary.

be3325660bf6c9dd955d588b0058522f12a05069: Added a boolean arg to datachannel so that we can optionally not wait to process incoming input channel message before closing the datachannel. Previously we were always using waitOrTimeout instead which was waiting for 2s before closing the data channel in order to prevent error messages from appearing in the logs. This only seemed necessary for long-lived plugins like (web, db, kube) and not ssh/shell which are going to exit immediately after the datachannel is closed.

Related PRs:

zli: https://github.com/bastionzero/zli/pull/458 backend: https://github.com/bastionzero/webshell-backend/pull/1140 common-ts: https://github.com/bastionzero/webshell-common-ts/pull/76

Testing

Make sure you are on the related feature branches in backend (also requires db migration), zli. Connect to bzero (shell), db, web, kube targets or ssh to bzero targets and everything should be working the same as before only faster. Additionally exiting on shell and ssh should also be working much faster now.

backend branch: feat/universal-connect zli branch: feat/universal-connect

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Universal connect API updates + Shell/SSH connection optimizations

Related JIRA tickets

Relates to JIRA: CWC-1889

Have you considered the security impacts?

Does this PR have any security impact?

[x] Yes

[ ] No

If yes, please explain:

The one potential security concern i see with this change is the connection service auth token is now passed as cli argument to the daemon. This means that it is potentially exposed if someone can list processes running on the user's machine. Here is an example output from ps aux | grep daemon e.g:

sebby 355090 0.0 0.2 1472916 36456 ? Sl 12:45 0:01 /home/sebby/.config/bastionzero-zli-nodejs/daemon -sessionId=833d7d83-0b3c-422a-9706-4cf107c4b876 -sessionToken=ZA%2BHyRY0rvyAAogRQRljRiEi0MMTNX1YDu8Op3jeq9o%3D -serviceURL=sebby.bastionzero.com -authHeader=Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjM4ZjM4ODM0NjhmYzY1OWFiYjQ0NzVmMzYzMTNkMjI1ODVjMmQ3Y2EiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIzMjQ5MzEwMzQyLWhrbXFlN3JuY2dxcnVldGIwaWJkZG9vYjVlc2wxamplLmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tIiwiYXVkIjoiMzI0OTMxMDM0Mi1oa21xZTdybmNncXJ1ZXRiMGliZGRvb2I1ZXNsMWpqZS5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjExMzY3OTc2NTUwMDUwODY1NTU3MiIsImhkIjoiY29tbW9ud2VhbHRoY3J5cHRvLmNvbSIsImVtYWlsIjoic2ViYnlAY29tbW9ud2VhbHRoY3J5cHRvLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJhdF9oYXNoIjoiRmxYaURNVGIxY1JxNC02S0dobG5DZyIsIm5hbWUiOiJTZWJhc3RpZW4gTGlwbWFuIiwicGljdHVyZSI6Imh0dHBzOi8vbGgzLmdvb2dsZXVzZXJjb250ZW50LmNvbS9hL0FBVFhBSnpzb3MzUFhVNWtXaEFjTXdtT1NVZ2NXMWdmVlBvNnNJV19tdWdkPXM5Ni1jIiwiZ2l2ZW5fbmFtZSI6IlNlYmFzdGllbiIsImZhbWlseV9uYW1lIjoiTGlwbWFuIiwibG9jYWxlIjoiZW4iLCJpYXQiOjE2NTQxODgyNjAsImV4cCI6MTY1NDE5MTg2MH0.e3F4im7zsXjkfVK7g6XaysD12FWAJkvrtVfNmnRbpfTWWeCJ4Fl_9JTPFg5f1APyQ97GOsY_Fi62pYKw0zUNchhgcKWpKK20Se01YPfPntjrleKklf9cEOI876hsEWQtoEZnafyYbk3lWG0vZ4ZTqWssfzHCaDZ2y4wQSdNlu9YQaa73AGPhrIeFJooWx-yLID-2HdH3C4xPk8eTk0AvGrIhPzdPUx1JQ28OkzfQDR93uyhv7XJEieGq7U5zDg1e834O2xGQCCoOwgskfe6HdxCy4SEHJkLlIU4_1DNaYzaXvpyTVs-0Lu5cmBSgGly9PywUIBUWIc74UUBXrtcI3g -configPath=/home/sebby/.config/bastionzero-zli-nodejs/dev.json -refreshTokenCommand=/home/sebby/cwc/BastionZero/zli/bin/zli-linux /snapshot/zli/dist/src/index.js refresh -logPath=/home/sebby/.config/bastionzero-logger-nodejs/bastionzero-kube-daemon-dev.log -agentPubKey=kV6XxL+mFYYWweSvXCl18kDLPbjf5Sv23V4bCPThW1E= -connectionId=8d5a36dc-e284-4106-8df9-c6ce092049ae -connectionServiceUrl=https://sebby-connection-service-us-east-1.bastionzero.com/2892f7b6-21ce-4727-98f8-1a546de3e9a3/ -connectionServiceAuthToken=08E3A5BB1C2D72E8792A4EFE0724D7426F206FAB454312C44A3BE8050E730747 -localPort=36127 -localHost=localhost -targetId=52963786-0d8c-4bc7-b602-a57613e0d628 -remotePort=8000 -remoteHost=http://localhost -plugin=web

note the connectionServiceAuthToken=08E3A5BB1C2D72E8792A4EFE0724D7426F206FAB454312C44A3BE8050E730747
Rename Keysplitting to MrTAP
Well, no one said it was going to be easy. In fact, people said it would be "really annoying." But we did it! "Keysplitting" has been relegated to the dustbin of history... mostly.

Bzero-specific changes

The changes to bzero are the most significant of any component, but still manageable. Most backwards compatibility measures are invisible at the application layer:

Custom JSON marshal/unmarshal functions will coerce all legacy type labels to the new ones (i.e. "keysplitting" agent messages will automatically be read in as "mrtap" agent messages; same goes for payloads and validation errors)

All parties will still send legacy messages to accommodate older daemons and agents. Agents/daemons >= this version will be able to receive both legacy and updated messages

Once all agents/daemons are >= this version, we can switch to exclusively sending the new messages (SEND_NEW_MESSAGE_VERSION). We will still need support for receiving legacy messages, though, because all parties between this version and SEND_NEW_MESSAGE_VERSION will still be sending them

Once all agents/daemons are >= SEND_NEW_MESSAGE_VERSION, we can remove the JSON marshal/unmarshal functions that do the coercion. However, we will likely all be dead by then

The full family of PRs:

ZLI

backend

bzero

webshell-common

ssm-session (bet you thought we were done with this!)

Testing

This PR demonstrates the backwards compatibility of new agents with "pre-switch" daemons

backend branch: develop zli branch: develop

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Change "keysplitting" reference to "MrTAP"

Related JIRA tickets

Relates to JIRA: CWC-1374

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[x] No

If yes, please explain:
Websocket Refactor Part I: Split out SignalR and Websocket Code
Description of the change

This document covers the existing flow.

The goal of this PR is twofold.

Isolate and Separate: This is for basic reasons like code extensibility, easier testing, and more readable code. We also wanted to separate out the logic that transports over the connection (websocket) from the logic that speaks the protocol (SignalR) from the code that manages the connection (websocket.go). This will make each highly interchangeable.

Clean and Clarify: The code was really hard to understand because it was sprawling. We had a lot of logic creep and weird things we'd put in there to equivocate differently named params etc.. Our param and header creation was all over the place as well.

This PR is half of the full websocket refactor. This splits out our websocket.go into three parts:

websocket.go remains as the controller of the connection, renaming to come in another PR

signalr is now its own package and isolates all signalr logic

websocket this code isolates all of the interaction with the underlying websocket

I have also added the following packages/helper objects.

A lot of the motivation behind this was that if something needed to be guarded by locks, it should be split out into its own thing that makes it easier to reason about that.

Invocator for use to keep track of signalR invocations messages and the corresponding completion messages

Broker for use to keep track of "subscriber channels", I also created this because we need to be able to Broadcast() and DirectMessage() in multiple directions.

HttpClient, see paragraph below

Another motivator was to prevent logic-leaking. For example, our bzhttp is doing a lot of things and because it's not very general purpose, we had a lot of outside logic leaking in, which required functions like PostNegotiate() and PostRegister() (which wasn't even being used). I refactored the logic from the package into a much more general solution which we can use going forward. I did not entirely replace bzhttp because it touches many many things (e.g. Registration or the web plugin logic), and I wanted to keep the PR concentrated.

Another change I wanted to make is to try to consolidate logic closer to where it was being used. That's why in daemon.go you will see me moving our header and param creation around to be more centralized. This is because it took me forever to understand this and I hope this helps others.

Finally, I made it so that closing the connection will do a best-effort attempt to wait until all messages are sent and any corresponding completion messages are received.

Testing

backend branch: develop zli branch: refactor/signalr

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: split out the underlying connection logic from the rest of our connection creation and processing code

Related JIRA tickets

Relates to JIRA: CWC-1633

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[x] No

If yes, please explain:
Pipelining
Description of the change

PIPELINING!!!!!!!

This PR removes the previous requirement for Keysplitting to be synchronous! Now, you can communicate with your agent and continually send messages without having to wait for the ack responses to those messages every time.

This is a complicated PR and removing the previous message RTT requirement has now made it possible to destroy a lot of our existing confusing flows and replace them with nice normal ones.

New Layer Flows in Daemon

Previously, the Keysplitting code was more integrated with different datachannel functions, but now it has more of a true "side-car" design.

Action -> Datachannel Flow

Action -> Plugin -> Datachannel -> Keysplitting -> Datachannel -> Websocket

Plugin creates outboxQueue chan ActionWrapper which it passes to the Action on creation.

Datachannel creates two go routines: i. listens to the Plugin's Outbox() <- chan ActionWrapper and passes that to MrZap's Inbox(a ActionWrapper) function ii. listens to Keysplitting's Outbox() <- chan KeysplittingMessage and send it to the websocket

When the action pushes anything to the outboxQueue, that is then pushed directly to the Keysplitting side-car which process and puts it in its own outboxQueue and that datachannel sends it.

Datachannel -> Action Flow

We still just call functions to get the message back to the action (Websocket -> Datachannel -> Plugin -> Action). Keysplitting's Validate(KeysplittingMessage) function is still called from the handleKeysplitting() function.

The major difference is that I've removed the ksInputChan on the daemon. This channel was used previously to put our keysplitting messages into a channel so that it wouldn't impact us processing incoming stream messages but now that we don't have to wait for a return message before returning from our keysplitting message, so we don't need this channel at all.

PIPELINING

Our key data structure is our pipelineMap this is an OrderedMap where pipelineMap.Newest() is our most recently built Keysplitting message and pipelineMap.Oldest() is the opposite. This is key'ed by the hash of the message value: hash(message) -> message.

NOTE: I have completely removed the hpointer and expectedHPointer variables from the daemon side. hpointer is now satisfied by our pipeline keys and expectedHPointer is replaced by lastAck which is equal to the last Ack (either syn/ack or data/ack) message.

Basic Output Pipelining

NOTE: In order to get pipelining to work I had to remove the Timestamp field from the data/ack message because this field meant that we could never predict the object.

Plugin.outboxQueue -> Plugin.Outbox() -> Keysplitting.Inbox() -> Keysplitting.pipeline() -> Keysplitting.Outbox() -> Datachannel.send()

We're now going to explain the steps once the Inbox() function is called until the message reaches the Keysplitting outbox.

Keysplitting takes an ActionWrapper and tries to pipeline it, this eventually results in the Inbox() call.
NOTE: ActionPayload used to be []byte, which meant a lot of marshalling in actions, now we only do it once in BuildResponse() in our Keysplitting code

type ActionWrapper struct { Action string ActionPayload interface{} }

Keysplitting is going to check if there's a previous message that we haven't received an ack for (in which case we'll predict the ack based on the most recently sent message before building our response) OR it will build our new message off our most recent ack (lastAck).

Build Response!

Add it to our pipelineMap!

Add it to our outboxQueue!

Message Validation

This hasn't changed much but I'll cover it since there are small changes. Keysplitting.Validate() is called from handleKeysplitting() whenever we receive a new message.

Validate signature on message

Check that this is a response to a message we've sent

Set our lastAck to whatever we received

Delete the message this is an ack to from our pipelineMap

Error Recovery

We only recover IF from handleError() in datachannel:

We're not already recovering

We haven't already tried more than the max number

It's a KeysplittingValidationError type message from Recover() in keysplitting:

The hpointer field (hash of message the error was thrown on) is not empty

The error is pointing to a message we sent

When we call Keysplitting.Recover(), we send a syn. Once we receive the syn/ack, we will grab the nonce. If the nonce corresponds to a message we've received, then we'll send all messages after that message OTHERWISE we'll resend all messages. This works because after our initial syn, syn/ack exchange, the target will respond to any new syns with a syn/ack where the nonce is actually the hash of the last received and correctly validated message. This means that when we recover we're actually syncing the state of the hash chains corresponding to the current state of the Keysplitting hash chain according to the agent and this recovery mechanism allows the daemon to sync its Keysplitting state to that. This was Sebby's idea. sebby mvp.

New Plugin Creation and Destruction

Plugin Creation

There is no more Feed() flow, no more Food. Creating and new action and plugin functions now take explicit arguments!

Server starts up

Server receives a request which results in some communication with the agent

Server is responsible for (in this order): i. Creating a plugin (explicit args) ii. Passing that to a new datachannel iii. Starting the desired action in the plugin

Plugin Destruction

Because the datachannel receives a plugin when it starts up, it can already start listening to that plugin dying (even before the action is started up). All plugins now provide a Done() <- chan struct{} function which the datachannel can listen to and then die when signaled.

After the plugin dies, the datachannel EITHER:

Agent: sends any messages that are still in its send queue and really dies once that queue is silent for 1 second.

Daemon: receives messages until the time between receiving messages reaches 2 seconds and we wait a total, maximum time of 10 seconds.

Testing

This PR should be indistinguishable in the functionality of the regular agent here are some suggestions that I like to do when testing functionality of plugins:

Web

Hitting our grafana dev instance

espn.com

Hit some illegitimate or misconfigured virtual target

DB

Hit the psql db we have locally on our dev bzero-agent machines

iperf

Hit some illegitimate or misconfigured virtual target

Shell

Connecting with a legitimate user

Connecting with an illegitimate user

Kube

https://docs.google.com/document/d/1DkT4Bs10ZakzcBlRLmbHK_E6MXDoIl1g9UD_uE7-FGE

backend branch: zli branch: pipelining

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Removes the previous requirement for MrZAP to be synchronous! Now, you can communicate with your agent and continually send messages without having to wait for the ack responses to those messages every time.

Related JIRA tickets

Relates to JIRA: CWC-1494, CWC-1644, CWC-1502, CWC-1831, CWC-1832

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[ ] No

If yes, please explain:
Moving control channel to connection node
Description of the change

Moves the control channel websocket from bastion to a connection node as well as implementing a new control channel authentication flow that goes through the connection service.

Below is an overview of changes included but see the design doc for more comprehensive details.

Backend changes implemented in https://github.com/bastionzero/webshell-backend/pull/1199.

Backend Signed Messages

The agent authentication to various backend services now relies on sending EdDSA signed messages. These are separate from the Mr.Zap messages that the agent signs and sends to the daemon and are not of type AgentMessage. We also sign these messages directly with EdDSA ed25519 curve without first hashing (unlike Mr.Zap)- this is fine because hashing is included in the signature scheme. Instead these messages all embed BackendAgentMessage which includes a type and a timestamp field which are there to prevent replays on these signed messages.

These messages are sent to the backend via two separate request parameters message and signature where message is defined as the base64 encoded json string serialization of the message struct. A future enhancement would be to send the message/signature together in the standardized JWS encoded format (which supports ed25519 sigs) however due to limitations in library support on the backend side this was not implemented at this time- see https://commonwealthcrypto.atlassian.net/browse/CWC-2008.

Agent Identity Token

We introduce a new JWT AgentIdentityToken that is issued by the bastion and used for agent authentication to various backend components. To receive a AgentIdentityToken the agent will make a request to bastion's new /api/v2/agent/identity/<targetId> endpoint. The agent will send a signed GetAgentIdentityTokenRequest message. After verifying the signature bastion provides this token which is also signed using bastion's JWK and expires in 7 days.

The agent now stores this AgentIdentityToken in the vault and everytime it fetches from the vault checks if the token is still valid. If its no longer valid (expired, bastion key may have rotated) it will try and call out to bastion to refresh the token. Bastion is oidc compliant and provides a /.well-known/openid-configuration which includes a JWKS uri that contains its current signing keys so that we can just used standard oidc libraries in order to verify this token. This token is included in requests that require it as a http bearer authorization header.

Control Channel Auth Flow

The control channel auth flow is now as follows

Agent gets a valid AgentIdentityToken from bastion

Agent gets the connection service url (connection orchestrator) from bastion

Agent sends a GET request to /control-channel of the orchestrator in order to get assigned a connection node

This includes a signed GetControlChannel message as well as the AgentIdentityToken header

Orchestrator will return a unique control channel ID and connection node url to use to open the control channel websocket to a specific connection node.

Agent opens up the control channel websocket to the connection node url returned in 3

This includes a signed OpenControlChannel message as well as the AgentIdentityToken header

The above protocol steps are run in the connect routine of the controlchannelconnection. If any individual step fails (we get an error from the backend) we restart the protocol using exponential backoff. If at any point the control channel websocket disconnects the agent will again recover by entering the same connect routine which will result in opening a control channel to a new connection node.

Open/Close DataChannel

The open/close data channel messages have been moved from the agent control channel to the agent data channel connection (as specific control messages there). This means that these messages can be sent by the same connection node that the agent/daemon data channel websocket connections are made to and dont require going an extra hop to the connection node that contains the agent control channel.

Agent Data Channel

The agent data channel websocket authentication has also been changed to use a similar mechanism as the control channel. When receiving a OpenWebsocket control message the agent will open a new websocket connection to a connection node and send a signed OpenAgentWebsocketMessage message as well the AgentIdentityToken header. Because this authentication mechanism is different we introduced a new versioned hub in the backend hub/agent/v2 in order to maintain backwards compatibility.

The OpenWebsocket control channel message that triggers the agent to open a new data channel websocket has now been modified in the backend to be sent by the connection orchestrator directly when a daemon initiates a new connection. However this is now completely decoupled from the daemon connect flow (before we only opened the websocket synchronously after the daemon has connected) and the agent can connect right away before the daemon.

Health Checks

Health checks work similarly to how they did before however now they are being sent to a connection node instead of bastion and the connection node is directly responsible for disconnecting the control channel websocket if it doesnt receive timely heartbeat messages from the agent. I did however made the following adjustments:

Agent hearbeat is now every 2 min instead of 20s (the corresponding timeout on the backend is 10min)

Moved ValidKubeUsers out of the heartbeat message and into a separate control message (since this was specific to kube cluster agents only). The agent now also caches the valid users and will only send a control channel update message when these users change.

The heartbeat message now contains some simple telemetry about agent status. For now this only includes a single new field NumDataChannel which is the number of active data channels opened across all connections in the agent. This is just a start and we plan on enhancing these heartbeat messages to include more agent health telemetry in the future.

Testing

Describe how to test this PR....

backend branch: feat/control-channel-to-connection-node zli branch: feat/control-channel-to-connection-node

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Moves the control channel websocket from bastion to connection nodes as well as implementing a new agent data/control channel authentication flows.

Related JIRA tickets

Relates to JIRA: CWC-1583

Have you considered the security impacts?

Does this PR have any security impact?

[x] Yes

[ ] No

If yes, please explain:

This changes the authentication mechanism that is used for both agent control and data channel websockets.
Lazy User Creation
Description of the change

This code empowers the agent to create a user if it sees a user trying to connect to one that doesn't exist. This is so that we can support bzero-user and ssm-user without having to create both on every machine and allowing us to elegantly transition later.

Creates a new sudoers file and adds users to that file

# Created by the BastionZero Agent on 2022-06-02 16:38:50 +0000 UTC ssm-user ALL=(ALL) NOPASSWD:ALL bzero-user ALL=(ALL) NOPASSWD:ALL

Testing

This functionality should work for both ssh and shell.

Add one (or both) of the following users to your target connect policy: "ssm-user", "bzero-user"

Connect as one of those users

Neither of those users should exist on the box but you will be able to connect, once you've connected you can verify that they exist and do have sudoer priviledges.

backend branch: zli branch:

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Adds the ability to login as a user that doesn't exist on the machine and will create that user (if it's allowed to).

Related JIRA tickets

Relates to JIRA: CWC-1901

Have you considered the security impacts?

Does this PR have any security impact?

[x] Yes

[ ] No

If yes, please explain: The one thing that I want to bring attention to, is that when we create a sudoers file in the /etc/sudoers.d folder, we create it with 640 so that we can maintain the ability to write to it. Usually, this file is supposed to have 440 permissions. I don't think this is a very big deal and I will say that only creating users on a as-need basis is the better security move as opposed to adding them all to the sudoers file from the get-go and locking the file down.
Fix/web request chunking
Description of the change

This PR fixes some limitations we should be catching in our web plugin

We limit the size of request body (very common practice) to 10MB (very common value)

We limit the size of the request content length to 150MB for those kinds of requests that won't be caught by simply limiting the request body e.g. for multipart/form-data

We read the request body into a buffer of a certain size and send it in arbitrary chunks to the target which stores it in its entirety on the box and only once it's received the entire thing does it create an http request and send it off to the remote target.

The code and methodology could be improved and is captured inCWC-1647

Testing

Create files of different sizes: This should fail:

$ dd if=/dev/zero of=200MB.txt count=1024 bs=205000 $ ./bin/zli-macos connect <web target> $ curl -F "[email protected]" http://127.0.0.1:6200 curl: (26) Failed to open/read local data from file/application

This should succeed: If you hit the default web server sid setup on the bzero target you'll see an error (but an error from a successful http request). You need a target configured to hit http://localhost:8000 on the dev bzero agent.

$ dd if=/dev/zero of=1MB.txt count=1024 bs=1025 $ ./bin/zli-macos connect <web target> $ curl -F "[email protected]" http://127.0.0.1:6200 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Error response</title> </head> <body> <h1>Error response</h1> <p>Error code: 501</p> <p>Message: Unsupported method ('POST').</p> <p>Error code explanation: HTTPStatus.NOT_IMPLEMENTED - Server does not support this operation.</p> </body> </html>

Relevant release note information

Release Notes:

Related JIRA tickets

Relates to JIRA: CWC-1641, CWC-1647

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[x] No

If yes, please explain:
Self registration
Description of the change

This allows us to startup the agent with an activation token or api key and have the agent register with bastion.

Relevant release note information

Release Notes:

Related JIRA tickets

Relates to JIRA: CWC-XXX

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[ ] No

If yes, please explain:
MrZAP Unit Tests (Daemon) Plus Improvements
Description of the change

Adds unit tests for the daemon's keysplitting.go file.

This PR makes the following changes:

Move currentIdToken refresh logic + loading of zli keysplitting config out of keysplitting.go to a separate package: tokenrefresh. In a future PR, this may be refactored further, so that daemon keysplitting doesn't always refresh when it may not have to. CC: @lipmas, @lgmugnier

Abstracting this logic out of keysplitting.go keeps keysplitting logic separate from token refresh logic, and allows us to mock the token refresher in unit tests, so that we don't know have to create real JSON file on disk when we want to test keysplitting logic in unit tests.

Removes keysplitting code that handles out-of-order DataAcks. See reasons in this comment: https://github.com/bastionzero/bzero/pull/85#discussion_r881063340

It's hard to tell if there is an out-of-order issue. I've been testing this removal manually by using the new daemon and testing different commands.

Another test I might run is to revert to to the old daemon and add a debug log statement to see if outOfOrderAcks is ever > 0.

Other than these two methods above--I'm not sure how else to test if we have an out-of-order problem except by tracing the code from daemon-->backend-->agent and seeing that we're not (and SignalR is not) doing anything funky by spawning threads / doing async work that causes messages to be sent out-of-order

Change pipelineLimit from a global, internal package variable to a global, external package constant.

Made it a constant because otherwise two instances of daemon.Keysplitting{} can leak modifications of this limit to one another. I found this issue when running the unit tests in parallel ginkgo -p and in random order --randomize-all --randomize-suites. One of the tests checks the behavior when the daemon is pre-pipelining (which used to set the global variable to pipelineLimit = 1) and that leaked into another test that expected pipelineLimit to be 8 (the default).

I still preserve the constraint that pipelineLimit can change from 8-->1 by adding a pipelineLimit struct-level variable. This is not shared with other daemon keysplitting structs. It is initialized to the default 8 by referring the global constant.

Made it external, so the unit tests can test the behavior when the max pipelining limit is reached.

Change maxErrorRecoveryTries from a global, internal package constant to a global, external package constant.

Made it external, so the unit tests can test the behavior that recovery has a limit and should not recover again if we reach the max.

Create daemon/keysplitting/errors.go which holds error types for some of the errors that the daemon Keysplitting struct can return in its methods. We use these types to assert that specific errors are returned when testing failure paths in the unit tests (using MatchError). Here is an example:

https://github.com/bastionzero/bzero/blob/eb693e8cdfb1af55c713d2cc2d67178b634b24bf/bctl/daemon/keysplitting/keysplitting_test.go#L177-L180

Return different error if bzerolib/keysplitting/bzcert Verify() fails validating the initial id token vs. current id token. When the initial id token fails to verify, it's usually an indication that the user must login in again because the IdP rotated their signing key.

Allow console destinations other than os.Stdout in bzerolib/logger. We can make this change because zerolog.ConsoleWriter's Out configuration option takes in any io.Writer.

Remove hard-coded console writer destination of os.Stdout

Replace writeToConsole bool argument in logger.New() with []io.Writer

Create NewWithStdOutConsoleWriter() which is equivalent to calling the old New() with writeToConsole = true

Create NewWithNoConsoleWriters() which is equivalent to calling the old New() with writeToConsole = false

This change is used during the keysplitting unit tests, so we can initialize the SUT's logger with GinkgoWriter allowing us to see logs printed by the SUT if a test fails:

https://github.com/bastionzero/bzero/blob/eb693e8cdfb1af55c713d2cc2d67178b634b24bf/bctl/daemon/keysplitting/keysplitting_test.go#L160-L162

This PR fixes the following bugs:

Agent responds with different schema version in recovery's SynAck:

Previously, resent Data messages would use the schema version from the previous handshake. This has been fixed by setting schema version first before trying to resend.

I've added a unit test to check this behavior.

Data races due to not synchronizing usage of internal state variables like pipelineMap, recovering, errorRecoveryAttempt, and other state variables.

Note: I found this data race by running the tests with the -race flag. We can't enable this flag by default until we fix all other data races (will create ticket. EDIT: CWC-1913) that the data racer complained about in other packages beside daemon/keysplitting.

Fixed by locking stateLock (renamed from pipelineLock) mutex before accessing internal state variables that can be accessed/modified on different goroutines.

In order to fix this, I also had to change the way the mutex was being used in BuildSyn. Previously, BuildSyn() would lock the mutex and not unlock it when the function returned; BuildSyn() now unlocks the mutex when it returns. I had to make this change otherwise I can't call lock again in Validate() (deadlock).

I still preserve the behavior that one cannot send Data (call Inbox()) until handshake is complete by synchronizing on a new boolean isHandshakeComplete:

https://github.com/bastionzero/bzero/blob/eb693e8cdfb1af55c713d2cc2d67178b634b24bf/bctl/daemon/keysplitting/keysplitting.go#L260-L271

and I've added unit tests to check that I haven't broken this behavior.

I've also changed the if to a for loop because that is the recommended behavior when using sync.Cond.Wait() as outlined in the Go documentation:

Because c.L is not locked when Wait first resumes, the caller typically cannot assume that the condition is true when Wait returns. Instead, the caller should Wait in a loop:

Source: https://pkg.go.dev/sync#Cond.Wait

It says "typically", so we might actually be fine not using a for loop, but just in case I've changed it to a loop as recommended.

Testing

Describe how to test this PR....

backend branch: zli branch:

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Adds keysplitting unit tests for daemon

Related JIRA tickets

Relates to JIRA: CWC-1847, CWC-1929, CWC-1930

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[X] No

If yes, please explain:
Connection Idle Timeout
Description of the change

Every connection will now send an IdleTimeout in the DaemonConnectedMessage so that the agent can close idle connections after this timeout. This timeout is serialized as a number of nanoseconds by the backend so we use a custom json unmarshal function to convert this to a time.Duration.

This timeout is currently hard-coded in the backend for every connection to 7 days when the connection is created. However in the future we can allow admins to control this value at connection granularity by adding additional context to connection policies or more simply by setting an organizational default.

Testing

See backend changes in https://github.com/bastionzero/webshell-backend/pull/1475 and testing instructions there.

backend branch: feat/idle-timeout zli branch: charts branch:

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Adds an IdleTimeout to all connections that will close connections after no client activity is detected for the timeout duration.

Related JIRA tickets

Relates to JIRA: CWC-2257

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[x] No

If yes, please explain:
Feat/pwdb source
Description of the change

Description here (why is it needed, what does it do)....

Testing

Describe how to test this PR....

backend branch: zli branch: charts branch:

Ready to run system tests?

[ ] Yes

Relevant release note information

Release Notes:

Related JIRA tickets

Relates to JIRA: CWC-XXX

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[ ] No

If yes, please explain:
Pwdb plugin
Description of the change

Description here (why is it needed, what does it do)....

Testing

Describe how to test this PR....

backend branch: zli branch: charts branch:

Ready to run system tests?

[ ] Yes

Relevant release note information

Release Notes:

Related JIRA tickets

Relates to JIRA: CWC-XXX

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[ ] No

If yes, please explain:
Store split private keys in a per-agent configuration file
This is one of those PRs that has no immediate effect, but supports other aspects of the solution and will be easier to review and merge in isolation

Description of the change

As part of the passwordless DB architecture, we need a way for agents to store mappings between key shards and the database targets to which they authenticate. To support this, the agent will use a new configuration object with this structure:

/* oldest key (toy example) */ [{"key":{"associatedPublicKey":{"n":null,"e":0},"d":null,"e":null},"targetIds":["62e7cbbe-f730-4aed-bb77-6f06749066be"]}, /* more recent key (toy example) */ {"key":{"associatedPublicKey":{"n":"2RxHnc7Yo7bAZbSNWrd4Qf1tXWLT0qPBQMNJiOcQdXkw9oRvjcD4LiBEtl3C4mjych/5s1OGwDCV5CNqeZMniqGL53vyEiRGfqAZes+L+1HmlYITzxAhFIISqNraWpTCVpKiSXV9Kd1+tLP7fJrviWPtPg1c86XR1MLdowEfk0xN5V0hc2ZRZqLgDlLCtLOlN3zD8AZF0lHyaVkbbBmsawej1y99o8fJlH56lmFcB3EB4HpQ9D0adg5R+qhH5A/mhevgISdsg+PHTzeGFG7tPRIWOc7b6sVZyXn8kswQcpaXosU8cVyCH91BZXGUEc8HV2Rtnglw1mXBE98uhevHOxGXD/0aA8nM2GnPI2Bb5l3YBTg4Iolt0EFqN52rC01sKqQDtLP18bE5pTrae+BCLzP3QKCl8fYCJdOqNK/9hN4BkDW+a78jdH9o1BB0WP4H+4kW6N20YDV9Z+/63ICl6JH0cSgAl4iEtukzqZKfxb2v5z1q9i7JQZbkc/ZmoIzsRBqv8QCd4DnTuUd4LhZftRGWdT6RKvxDsUFVceU5VK7qfjX/C+7fJuY1MmGI4KlegDh9yhut25LCaXO3In6FBravWuLKD9RDB/A/o9wgG4ZykqSQcvaZnU1yU6U3uWXMUu4KyhZU0G3yAKAGd6k+o9qwdPq6N/4znvp6jbq7n6c=","e":65537},"d":"MTIz","e":"NDU="},"targetIds":["62e7cbbe-f730-4aed-bb77-6f06749066be"]}]

When a new key is added, it goes at the end of the list. When a user is trying to authenticate to a virtual target, the agent will select the most recent key available that maps to that target.

For the customer beta, we will support customers updating agents manually using a new bzero command, but this is not required for the demo-ready alpha.

Some other notes:

Fixed a broken test in the agentconfig test suite

Removed a ginkgo import from a non-test file that was messing up our help options

Tweaked how the systemd config client was using locks, and added a convenience wrapper for the AcquireLock flow so that we don't have to be so verbose

If the update to go1.18 goes through, I can simplify this code a bit so that we don't need separate methods for fetching KeyShardConfig and AgentConfig data

Testing

Check out the feat/generate-cert branches in ZLI and backend (this will require a database migration and redeploying dev)

Make sure you have a database target set up using your agent as a proxy

Run zlil generate certificate --all

(You should have a cert successfully returned to you)

But more importantly, connect to the agent machine as root and check cat /etc/bzero/bzero-user-keys.yaml

You should see the key shard you just generated in /etc/bzero/keyshards.json

backend branch: develop zli branch: develop charts branch:

Ready to run system tests?

[x] Yes

Relevant release note information

Release Notes: Store split private keys in a per-agent configuration file

Related JIRA tickets

Relates to JIRA: CWC-2130

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[x] No

If yes, please explain:
Allow user to provide orgId+provider flag when registering bzero agent
Description of the change

We were only permitting user to pass in orgId when using Kubernetes agent. This PR makes it so bzero agent flag orgId is passed in to Registration struct. It also adds -orgProvider flag, so that user can pass that in as well.

Testing

Describe how to test this PR....

backend branch: zli branch: charts branch:

Ready to run system tests?

[X] Yes

Relevant release note information

Release Notes: Fix bug where -orgId flag was not respected. Add -orgProvider flag in case user wants to explicitly state their IdP provider.

Related JIRA tickets

Relates to JIRA: CWC-2263

Have you considered the security impacts?

Does this PR have any security impact?

[X] Yes

[ ] No

If yes, please explain:

We should preserve the security requirement that lets users explicitly set these values that are used for BZCert verification. Otherwise, the agent always accepts whatever the Bastion tells it on initial registration.
Adds JWKS service accounts to bzero agent
Description of the change

Adds JWKS service accounts to bzero agent. This is currently a draft as it likely needs some work to integrate with the daemon and bastion changes. The area of most concern for me right now is this line of code:

https://github.com/bastionzero/bzero/blob/b573b862831ec086dfe79a6cf28653b3f378a0ab/bctl/daemon/keysplitting/bzcert/bzcert.go#L97

Since the daemon hasn't configured verifier to know about the service account.

Testing

Describe how to test this PR....

backend branch: zli branch:

Ready to run system tests?

[ ] Yes

Relevant release note information

Release Notes:

Related JIRA tickets

Relates to JIRA: CWC-XXX

Have you considered the security impacts?

Does this PR have any security impact?

[ ] Yes

[ ] No

If yes, please explain:

Bastionzeros Agent and Daemon!

Bzero

Bastionzero

Install

Developer processes

Owner

Bastion Zero

Comments

Feat/shell

Description of the change

Relevant release note information

Related JIRA tickets

Have you considered the security impacts?

Adds interactive shell plugin and tests

Description of the change

Additional work left undone

Relevant release note information

Related JIRA tickets

Have you considered the security impacts?

Universal Connect

Description of the change

Shell Connection Optimizations

Testing

Ready to run system tests?

Relevant release note information

Related JIRA tickets

Have you considered the security impacts?

Rename Keysplitting to MrTAP

Bzero-specific changes

The full family of PRs:

Testing

Ready to run system tests?

Relevant release note information

Related JIRA tickets

Have you considered the security impacts?

Websocket Refactor Part I: Split out SignalR and Websocket Code

Description of the change

Testing

Ready to run system tests?

Relevant release note information

Related JIRA tickets

Have you considered the security impacts?

Pipelining

Description of the change

New Layer Flows in Daemon

Action -> Datachannel Flow

Datachannel -> Action Flow

PIPELINING

Basic Output Pipelining

Message Validation

Error Recovery

New Plugin Creation and Destruction

Plugin Creation

Plugin Destruction

Testing

Web

DB

Shell

Kube

Ready to run system tests?

Relevant release note information

Related JIRA tickets

Have you considered the security impacts?

Moving control channel to connection node

Description of the change

Backend Signed Messages

Agent Identity Token

Control Channel Auth Flow

Open/Close DataChannel

Agent Data Channel

Health Checks

Testing

Ready to run system tests?

Relevant release note information

Related JIRA tickets

Have you considered the security impacts?

Lazy User Creation

Description of the change

Testing

Ready to run system tests?