Overview
The Vault 1.15.x upgrade guide contains information on deprecations, important or breaking changes, and remediation recommendations for anyone upgrading from Vault 1.14. Please read carefully.
Consul service registration
As of version 1.15, service_tags
supplied to Vault for the purpose of Consul
service registration
will be case-sensitive.
In previous versions of Vault tags were converted to lowercase which led to issues,
for example when tags contained Traefik rules which use case-sensitive method names
such as Host()
.
If you previously used Consul service registration tags ignoring case, or relied on the lowercase tags created by Vault, then this change may cause unexpected behavior.
Please audit your Consul storage stanza to ensure that you either:
- Manually convert your
service_tags
to lowercase if required - Ensure that any system that relies on the tags is aware of the new case-preserving behavior
Rollback metrics
Vault no longer measures and reports the metrics vault.rollback.attempts.{MOUNTPOINT}
and vault.route.rollback.{MOUNTPOINT}
by default. The new default metrics are vault.rollback.attempts
and vault.route.rollback
, which do not contain the mount point in the metric name.
To continue measuring vault.rollback.attempts.{MOUNTPOINT}
and
vault.route.rollback.{MOUNTPOINT}
, you must explicitly enable mount-specific
metrics in the telemetry
stanza of your Vault configuration with the
add_mount_point_rollback_metrics
option.
Application of Sentinel Role Governing Policies (RGPs) via identity groups
As of versions 1.15.0
, 1.14.4
, and 1.13.8
, the Sentinel RGPSs derived from membership in identity groups apply
only to entities in the same and child namespaces, relative to the identity group.
Also, the group_policy_application_mode
only applies to
to ACL policies. Vault Sentinel Role Governing Policies (RGPs) are not affected by group policy application mode.
Known issues and workarounds
Transit Encryption with Cloud KMS managed keys causes a panic
Affected versions
- 1.13.1+ up to 1.13.8 inclusively
- 1.14.0+ up to 1.14.4 inclusively
- 1.15.0
Issue
Vault panics when it receives a Transit encryption API call that is backed by a Cloud KMS managed key (Azure, GCP, AWS).
Note
The issue does not affect encryption and decryption with the following key types:- PKCS#11 managed keys
- Transit native keys
Workaround
None at this time
Transit Sign API calls with managed keys fail
Affected versions
- 1.14.0+ up to 1.14.4 inclusively
- 1.15.0
Issue
Vault responds to Transit sign API calls with the following error when the request uses a managed key:
requested version for signing does not contain a private part
Note
The issue does not affect signing with the following key types:- Transit native keys
Workaround
None at this time
Panic in AWS auth method during IAM-based login
Affected versions
- 1.15.0
Issue
A panic can occur in the AWS auth method during IAM-based login when a client config does not exist.
Workaround
The panic can be avoided by writing an empty client config:
vault write -f auth/aws/config/client
Collapsed navbar does not allow you to click inside the console or namespace picker
Affected versions
The UI issue affects Vault versions 1.14.0+ and 1.15.0+. A fix is expected for Vault 1.16.0.
Issue
The Vauil UI currently uses a version of HDS that does not allow users to click within collapsed elements. In particular, the dev console or namespace picker become inaccessible when viewing the components in smaller viewports.
Workaround
Expand the width of the screen until you deactivate the collapsed view. Once the full navbar is displayed, click the desired components.
File audit devices do not honor SIGHUP signal to reload
Affected versions
- 1.15.0
Issue
The new underlying event framework for auditing causes Vault to continue using
audit log files instead of reopening the file paths even when you send
SIGHUP
after log rotation. The
issue impacts any Vault cluster with file
audit devices enabled.
Not honoring the SIGHUP
signal has two key consequences when moving or
deleting audit files.
If you move or rename your audit log file locally, Vault continues to log data to the original file. For example, if you archive a file locally:
$ mv /var/log/vault/audit.log /var/log/vault/archive/audit.log.bak
Vault continues to write data to /var/log/vault/archive/audit.log.bak
instead of logging audit entries to a newly created file at
/var/log/vault/audit.log
.
If you delete your audit log file, the OS unlinks the file from the directory structure, but Vault still has the file open. Vault continues to write data to the deleted file, which continues to consume disk space as it grows. When Vault is sealed or restarted, the OS deletes the previously unlinked file, and you will lose all data logged to the audit file after it was tagged for deletion.
The issue with file
audit devices not honoring SIGHUP
signals is fixed as a
patch release in Vault 1.15.1
.
Workaround
Set the VAULT_AUDIT_DISABLE_EVENTLOGGER
environment variable to true
to
disable the new underlying event framework and restart Vault:
$ export VAULT_AUDIT_DISABLE_EVENTLOGGER=true
On startup, Vault reverts to the audit behavior used in 1.14.x
.
Internal error when vault policy in namespace does not exist
If a user is a member of a group that gets a policy from a namespace other than the one they’re trying to log into, and that policy doesn’t exist, Vault returns an internal error. This impacts all auth methods.
Affected versions
- 1.13.8 and 1.13.9
- 1.14.4 and 1.14.5
- 1.15.0 and 1.15.1
A fix has been released in Vault 1.13.10, 1.14.6, and 1.15.2.
Workaround
During authentication, Vault derives inherited policies based on the groups an entity belongs to. Vault returns an internal error when attaching the derived policy to a token when:
- the token belongs to a different namespace than the one handling authentication, and
- the derived policy does not exist under the namespace.
You can resolve the error by adding the policy to the relevant namespace or deleting the group policy mapping that uses the derived policy.
As an example, consider the following userpass auth method failure. The error is due to the fact that Vault expects a group policy under the namespace that does not exist.
# Failed login
$ vault login -method=userpass username=user1 password=123
Error authenticating: Error making API request.
URL: PUT http://127.0.0.1:8200/v1/auth/userpass/login/user1
Code: 500. Errors:
* internal error
To confirm the problem is a missing policy, start by identifying the relevant entity and group IDs:
$ vault read -format=json identity/entity/name/user1 | \
jq '{"entity_id": .data.id, "group_ids": .data.group_ids} '
{
"entity_id": "420c82de-57c3-df2e-2ef6-0690073b1636",
"group_ids": [
"6cb152b7-955d-272b-4dcf-a2ed668ca1ea"
]
}
Use the group ID to fetch the relevant policies for the group under the ns1
namespace:
$ vault read -format=json -namespace=ns1 \
identity/group/id/6cb152b7-955d-272b-4dcf-a2ed668ca1ea | \
jq '.data.policies'
[
"group_policy"
]
Now that we know Vault is looking for a policy called group_policy
, we can
check whether that policy exists under the ns1
namespace:
$ vault policy list -namespace=ns1
default
The only policy in the ns1
namespace is default
, which confirms that the
missing policy (group_policy
) is causing the error.
To fix the problem, we can either remove the missing policy from the
6cb152b7-955d-272b-4dcf-a2ed668ca1ea
group or create the missing policy under
the ns1
namespace.
To remove group_policy
from group ID 6cb152b7-955d-272b-4dcf-a2ed668ca1ea
,
use the vault write
command to set the applicable policies to just include
default
:
$ vault write \
-namespace=ns1 \
identity/group/id/6cb152b7-955d-272b-4dcf-a2ed668ca1ea \
name="test" \
policies="default"
Verify the fix by re-running the login command:
$ vault login -method=userpass username=user1 password=123
Vault is storing references to ephemeral sub-loggers leading to unbounded memory consumption
Affected versions
This memory consumption bug affects Vault Community and Enterprise versions:
- 1.13.7 - 1.13.9
- 1.14.3 - 1.14.5
- 1.15.0 - 1.15.1
This change that introduced this bug has been reverted as of 1.13.10, 1.14.6, and 1.15.2
Issue
Vault is unexpectedly storing references to ephemeral sub-loggers which prevents them from being cleaned up, leading to unbound memory consumption for loggers. This came about from a change to address a previously known issue around sub-logger levels not being adjusted on reload. This impacts many areas of Vault, but primarily logins in Enterprise.
Workaround
There is no workaround.
Sublogger levels not adjusted on reload
Affected versions
This issue affects all Vault Community and Vault Enterprise versions.
Issue
Vault does not honor a modified log_level
configuration for certain subsystem
loggers on SIGHUP.
The issue is known to specifically affect resolver.watcher
and
replication.index.*
subloggers.
After modifying the log_level
and issuing a reload (SIGHUP), some loggers are
updated to reflect the new configuration, while some subsystem logger levels
remain unchanged.
For example, after starting a server with log_level: "trace"
and modifying it
to log_level: "info"
the following lines appear after reload:
[TRACE] resolver.watcher: dr mode doesn't have failover support, returning
...
[DEBUG] replication.index.perf: saved checkpoint: num_dirty=5
[DEBUG] replication.index.local: saved checkpoint: num_dirty=0
[DEBUG] replication.index.periodic: starting WAL GC: from=2531280 to=2531280 last=2531536
Workaround
The workaround is to restart the Vault server.
URL change for KV v2 secrets engine
Affected versions
1.15.0+
Issue
Recent improvements to the Vault UI updated the URL structure of the KV v2 secrets engine that affect existing URLs.
Previously, URLs for KV v2 used the pattern:
ui/vault/secrets/hma/show/${secretPath}
. With the recent refactor, KV v2 URLs
now use the following pattern:
ui/vault/secrets/hma/kv/${encodedUriComponent(secretPath)}/details
.
Opening older URLs now result in 404 errors.
Workaround
Currently, no workaround exists.
Improvements that include automatic redirects for older URLs are planned for 1.15.4.
Fatal error during expiration metrics gathering causing Vault crash
Affected versions
This issue affects Vault Community and Enterprise versions:
- 1.13.9
- 1.14.5
- 1.15.1
A fix has been issued in Vault 1.13.10, 1.14.6, and 1.15.2.
Issue
A recent change to Vault to improve state change speed (e.g. becoming active or standby) introduced a concurrency issue which can lead to a concurrent iteration and write on a map, causing a fatal error and crashing Vault. This error occurs when gathering lease and token metrics from the expiration manager. These metrics originate from the active node in a HA cluster, as such a standby node will take over active duties and the cluster will remain functional should the original active node encounter this bug. The new active node will be vulnerable to the same bug, but may not encounter it immediately.
There is no workaround.
Audit devices could log raw data despite configuration
Affected versions
- 1.15.0 - 1.15.4
Issue
Enabling an audit device which specifies the log_raw
option
could lead to raw data being logged to other audit devices, regardless of whether they
are configured to use log_raw
.
The issue with raw data potentially appearing in logs where HMAC data as expected,
is fixed as a patch release in Vault 1.15.5
.
Workaround
Do not enable any audit devices in Vault that use log_raw
. If any audit devices
are currently enabled with log_raw
set to true
they should be disabled.
To view the options for audit devices via the CLI, use the --detailed
flag for the
vault audit list
command:
$ vault audit list --detailed
The output will resemble the following, with log_raw
shown under Options
on
any device which has it enabled:
Example output:
Path Type Description Replication Options
---- ---- ----------- ----------- -------
file1/ file n/a replicated file_path=/var/log/vault/log1.json
file2/ file n/a replicated file_path=/var/log/vault/log2.json log_raw=true
Disable any device with the log_raw
option set to true
using the command
vault audit disable {path}
(file2
in the above output):
$ vault audit disable file2
See also: Disable audit via API.
Deadlock can occur on performance secondary clusters with many mounts
Affected versions
- 1.15.0 - 1.15.5
- 1.14.5 - 1.14.9
- 1.13.9 - 1.13.13
Issue
Vault 1.15.0, 1.14.5, and 1.13.9 introduced a worker pool to schedule periodic rollback operations on all mounts. This worker pool defaulted to using 256 workers. The worker pool introduced a risk of deadlocking on the active node of performance secondary clusters, leaving that cluster unable to service any requests.
The conditions required to cause the deadlock on the performance secondary:
- Performance replication is enabled
- The performance primary cluster has more than 256 non-local mounts. The more mounts the cluster has, the more likely the deadlock becomes
- One of the following occurs:
- A replicated mount is unmounted or remounted OR
- A replicated namespace is deleted OR
- Replication paths filters are used to filter at least one mount or namespace
Workaround
Set the VAULT_ROLLBACK_WORKERS
environment variable to a number larger than
the number of mounts in your Vault cluster and restart Vault:
$ export VAULT_ROLLBACK_WORKERS=1000
PKI OCSP GET requests can return HTTP redirect responses
If a base64 encoded OCSP request contains consecutive '/' characters, the GET request will return a 301 permanent redirect response. If the redirection is followed, the request will not decode as it will not be a properly base64 encoded request.
As a workaround, OCSP POST requests can be used which are unaffected.
Impacted versions
Affects all current versions of 1.12.x, 1.13.x, 1.14.x, 1.15.x, 1.16.x
Vault Enterprise Performance Standby nodes audit all request headers
Affected versions
- 1.15.0 - 1.15.7
Issue
Due to an issue in the new event framework, Performance Standby nodes in a Vault Enterprise cluster do not correctly receive configuration regarding which request headers should be written to the audit log.
Rather than no headers appearing in the audit logs by default, Vault Enterprise logs all headers on Performance Standby nodes.
The header issue was resolved in 1.15.8
.
Workaround
Set the VAULT_AUDIT_DISABLE_EVENTLOGGER
environment variable to true
to
disable the new underlying event framework and restart Vault:
$ export VAULT_AUDIT_DISABLE_EVENTLOGGER=true
On startup, Vault reverts to the audit behavior used in 1.14.x
.
Performance Standbys revert to Standby mode on unseal
Affected versions
- 1.14.12
- 1.15.8
- 1.16.2
Issue
If you previously set a value for retention_months
via the
sys/internal/counters/config
endpoint, upgrading to Vault Enterprise versions 1.14.12, 1.15.8, and 1.16.2
will cause Performance Standby
nodes to revert to Standby mode.
Adding nodes with Vault Enterprise versions 1.14.12, 1.15.8, or 1.16.2 to a
cluster with an older versioned leader will see any previously set
retention_months
value and attempt to write the new minimum value of 48
. The
storage write will result in a read-only error:
[ERROR] core: performance standby post-unseal setup failed: error="cannot write to readonly storage"
You can verify the status of your nodes by checking the /sys/health endpoint.
Deployments that rely on scaling across Performance Standbys will now forward all requests to the active node, increasing the utilization of the active node.
Post-upgrade cluster membership
During the last step of a full upgrade, the old leader steps down, causing one of the Standby nodes to become leader.A fix for the read-only storage error has been prioritized and escalated. The fix will be in releases 1.14.13, 1.15.9 and 1.16.3.
Important
If you have already upgraded to versions 1.14.12, 1.15.8, or 1.16.2, please refer to the workaround section for options.Workaround
Once the leader of the cluster has been updgraded to version 1.14.12, 1.15.8, or
1.16.2, the workaround is to update the retention_months
value on the active
node via the
sys/internal/counters/config
endpoint:
$ vault write sys/internal/counters/config retention_months=48
This storage entry will be written to all nodes in the cluster, allowing them to immediately unseal as Performance Standbys.
After the new retention_months
value is written to storage on the active node,
adding new nodes to the cluster will not cause the read-only error.