Add support for KMS encryption for Amazon S3

instaclustr · May 15, 2023 · fc25f7f · fc25f7f
1 parent 9152f11
commit fc25f7f
Show file tree

Hide file tree

Showing 157 changed files with 3,081 additions and 1,144 deletions.
diff --git a/README.adoc b/README.adoc
@@ -1270,6 +1270,166 @@ execution will count down from the time this command was firstly executed.
 You have also possibility to specify datacenters to remove by `--dcs` flag (might be specified multiple times
 for each dc separately)
 
+## Client-side encryption with AWS KMS
+
+In order to perform the encryption of your SSTables, so they are stored in a remote AWS S3 bucket already encrypted,
+we leverage AWS KMS client-side encryption by https://github.com/aws/amazon-s3-encryption-client-java[this library].
+
+### s3v2 protocol
+
+*s3v2* protocol has to be used to use KMS client encryption.
+
+Historically, Esop was using AWS API of version 1, however the library which makes client-side encryption possible
+is using API of version 2. The version 1 and version 2 API can live in one project simultaneously. As AWS KMS encryption
+feature in Esop is rather new, we decided to code one additional S3 module which is using V2 API, and
+we left V1 API implementation untouched if users still prefer it for whatever reason. We might eventually switch to
+V2 API completely and drop the code using V1 API in the future.
+
+To use client-side encryption, the protocol needs to be set to `s3v2` instead of `s3` (which uses V1 API).
+A user also needs to supply KMS key id to encrypt data with. The creation of KMS key is out of scope of this document
+however keep in mind that such a key has to be symmetric.
+
+The example of encrypted backup is shown below:
+
+----
+java -jar esop.jar backup \
+    --storage-location=s3v2://instaclustr-oss-esop-bucket
+    --data-dir /my/installation/of/cassandra/data/data \
+    --entities=ks1 \
+    --snapshot-tag=snapshot-1 \
+    --kmsKeyId=3bbebd10-7e5f-4fad-997a-89b51040df4c
+----
+
+Notice we use `s3v2` as protocol. We also sed `kmsKeyId` referencing name of KMS key in AWS to use for encryption.
+
+KMS key ID is also read from system property `AWS_KMS_KEY_ID` or environment property of the same name.
+Key ID from the command line has precedence over system property which has precedence over environment property.
+
+If `--storage-location` is not fully specified, Esop will try to connect to a runnning node via JMX, and it resolves
+what cluster and datacenter it belongs to and what node ID it has.
+
+The uploading logic of a particular SSTable file is as follows. First we need
+to refresh the object to update its last modification date, the logic which leads
+to it is this:
+
+* try to list tags of a remote object / key in a bucket
+** if such key is not found, we need to upload a file
+* if we are using encrypting backup (by having `--kmsKeyId` set), we prepare a tag
+which has `kmsKey` as a key and KMS key ID as a value
+* if tags of a remote key are not set or if they are not containg `kmsKey` tag,
+that means that the remote object exists but it is not encrypted. Hence, we
+will need to upload it again, but encrypted this time
+* if we are not skipping the refresh, we will copy the file with `kmsKey` tag
+
+Upon the actual upload, we check if `kmsKeyId` is set from the command line (or system / env properties)
+and based on that we will use encrypting or non-encrypting S3 client. Encrypting S3
+client wraps non-encrypting client. If encrypting client is used, everything
+which it uploads will be encrypted on the client and sent to AWS S3 bucket
+already encrypted.
+
+By the nature of Esop's directory layout and uploading logic, we see that
+if there was a backup which was not encrypted, we may decide later on that
+we start to encrypt. Let's cover this logic in the following example:
+
+Let's have a backup consisting of 3 SSTables, S1, S2 and S3 respectively.
+
+----
+bucket:
+    S1
+    S2  - all tables are not encrypted
+    S3
+----
+
+Later, we inserted new data into SSTable S4 and S5, so we have S1 - S5 on disk. However, now we want to encrypt. We might end up having this in a bucket:
+
+----
+bucket:
+    S1
+    S2 - all tables are not encrypted
+    S3
+    S4 - encrypted
+    S5 - encrypted
+----
+
+If we did it like this, we would end up having a backup partly encrypted which is not desired. For
+this reason, if we see that there is an object in S3 bucket already, we need to read its _tags_
+to see what key it was encrypted with. If it was not encrypted (it is not tagged), we know
+that we need to upload it again, now encrypted. Hence, eventually, all SSTables of a new backup will be encrypted.
+
+If there is a backup which was not encrypted and some backup was, these two backups may have some
+SSTables common. Imagine this scenario:
+
+----
+bucket:
+    S1 not encrypted, backup 1
+    S2 not encrypted, backup 1
+    S3 not encrypted, backup 1
+----
+
+As we started to encrypt and we want to backup, now, imagine that S1 and S2 were compacted into S4 and there were additional S5 and S6 encrypted:
+
+----
+bucket:
+    S1 not encrypted, backup 1, compacted into S4
+    S2 not encrypted, backup 1, compacted into S4
+    S3 not encrypted, backup 1
+    S4 encrypted, backup 2 - compacted S1 and S2
+    S5 encrypted, backup 2
+    S6 encrypted, backup 2
+----
+
+We see that we are going to backup S3, S4 (compacted S1 and S2), S5 and S6. S3 is already uploaded,
+but it is not encrypted, so S3 will be re-uploaded and encrypted. S4, S5 and S6 are not present remotely yet so all of them will be encrypted and uploaded.
+
+After doing so, we see this in the bucket:
+
+----
+bucket:
+    S1 not encrypted, backup 1, compacted into S4
+    S2 not encrypted, backup 1, compacted into S4
+    S3 encrypted, backup 1 and backup 2     // S3 is encrypted from now on
+    S4 encrypted, backup 2 - compacted S1 and S2
+    S5 encrypted, backup 2
+    S6 encrypted, backup 2
+----
+
+Backup no.1 consists of SSTables S1, S2 (both non-encrypted) and S3 (encrypted). Backup no.2 consists of S3 - S6 all of which are encrypted.
+
+Now, if we remove backup 1, only S1 and S2 SSTables will be removed because S3 is part of
+the backup 2 as well. As we remove all non-encrypted backups, we will be left with backups which contain SSTables which are encrypted. Hence, we converted a bucket with non-encrypted backups to encrypted only.
+
+This logic introduces these questions:
+
+* What if I have already encrypted backup, and I want to use a different KMS key?
+* How would restore look like when my backup contains SSTables which are both encrypted and in plaintext? How it would look like when I want to restore but there are different keys used?
+
+To answer the first question is rather easy. If you want to use a different KMS key, that is the same
+situation as if we were going to upload but no key was used. If we detect that already uploaded
+object was encrypted with a different KMS key (by reading its tags) from a key we want to use now,
+we just need to re-upload such SSTable and encrypt it with a different KMS key.
+All other logic already exaplained is same.
+
+Restoration will read tags of a remote object to see what KMS key it was encrypted with. If remote
+object was stored as plaintext, no wrapping S3 encryption client is used. If KMS key
+used is same as we supplied on the command line, the already initialized S3 encrypting client is used.
+If a particular object was encrypted with a KMS key we do not have S3 encrypting client for yet,
+such client is dynamically created as part of the restoration process and it will be cached to be re-used
+for the decryption of any other object using same KMS key.
+The net result of this logic is that a backup may consist of SSTables encrypted with
+whatever KMS key and as long as such KMS key exists in AWS KMS and we
+can reference it, it will be decrypted just fine.
+
+We *do not* encrypt Esop's manifest files. This is purely practical. If we were encrypting a manifest as well,
+operators would need to decrypt downloaded manifest from a bucket on their own by some other tool. As manifest
+does not contain any sensitive information and it serves solely as a metadata file to see what a particular backup
+consists of, we chose to not encrypt it to make life for operators just easier. Manifest file is the only file
+which is not encrypted - all other files are.
+
+We also decided to not store kmsKeyId in a manifest. It is better if a particular object is tagged with its key id
+it was encrypted with rather than store it in a manifest. If we used different kmsKeys, manifests would start to
+be obsolete and restoration of such backup would not be possible as key was already changed. Tags will make
+restoration in this scenario possible.
+
 ## Logging
 
 We are using logback. There is already `logback.xml` embedded in the built JAR. However if you

diff --git a/pom.xml b/pom.xml
@@ -15,7 +15,8 @@
         <instaclustr.commons.version>1.5.0</instaclustr.commons.version>
         <azure-storage.version>8.6.6</azure-storage.version>
         <google-cloud-libraries.version>26.0.0</google-cloud-libraries.version>
-        <aws-java-sdk.version>1.11.782</aws-java-sdk.version>
+        <aws-java-sdk.version>1.12.441</aws-java-sdk.version>
+        <s3.encryption.client.version>3.0.0</s3.encryption.client.version>
         <slf4j.version>1.7.30</slf4j.version>
         <logback.version>1.2.3</logback.version>
 
@@ -33,7 +34,7 @@
         <maven.javadoc.plugin.version>3.1.1</maven.javadoc.plugin.version>
         <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
         <maven.surefire.plugin.version>2.22.2</maven.surefire.plugin.version>
-        <git.command.plugin.version>2.2.4</git.command.plugin.version>
+        <git.command.plugin.version>4.9.10</git.command.plugin.version>
         <nexus.staging.maven.plugin.version>1.6.8</nexus.staging.maven.plugin.version>
         <cassandra.maven.plugin.version>3.6</cassandra.maven.plugin.version>
 
@@ -101,6 +102,13 @@
                 <type>pom</type>
                 <scope>import</scope>
             </dependency>
+            <dependency>
+                <groupId>software.amazon.awssdk</groupId>
+                <artifactId>bom</artifactId>
+                <version>2.20.45</version>
+                <type>pom</type>
+                <scope>import</scope>
+            </dependency>
             <dependency>
                 <groupId>com.google.cloud</groupId>
                 <artifactId>libraries-bom</artifactId>
@@ -134,6 +142,22 @@
             <groupId>com.amazonaws</groupId>
             <artifactId>aws-java-sdk-s3</artifactId>
         </dependency>
+
+        <dependency>
+            <groupId>software.amazon.encryption.s3</groupId>
+            <artifactId>amazon-s3-encryption-client-java</artifactId>
+            <version>${s3.encryption.client.version}</version>
+        </dependency>
+
+        <dependency>
+            <groupId>software.amazon.awssdk</groupId>
+            <artifactId>kms</artifactId>
+        </dependency>
+
+        <dependency>
+            <groupId>software.amazon.awssdk</groupId>
+            <artifactId>apache-client</artifactId>
+        </dependency>
 
         <dependency>
             <groupId>com.microsoft.azure</groupId>
@@ -256,13 +280,21 @@
                 <version>${git.command.plugin.version}</version>
                 <executions>
                     <execution>
+                        <id>get-the-git-infos</id>
                         <goals>
                             <goal>revision</goal>
                         </goals>
+                        <phase>initialize</phase>
                     </execution>
                 </executions>
                 <configuration>
-                    <dotGitDirectory>${project.basedir}/.git</dotGitDirectory>
+                    <generateGitPropertiesFile>true</generateGitPropertiesFile>
+                    <generateGitPropertiesFilename>${project.build.outputDirectory}/git.properties</generateGitPropertiesFilename>
+                    <includeOnlyProperties>
+                        <includeOnlyProperty>^git.build.(time|version)$</includeOnlyProperty>
+                        <includeOnlyProperty>^git.commit.id.(abbrev|full)$</includeOnlyProperty>
+                    </includeOnlyProperties>
+                    <commitIdGenerationMode>full</commitIdGenerationMode>
                 </configuration>
             </plugin>
             <plugin>

diff --git a/src/main/java/com/instaclustr/esop/azure/AzureBackuper.java b/src/main/java/com/instaclustr/esop/azure/AzureBackuper.java
@@ -7,6 +7,7 @@
 import com.google.inject.assistedinject.Assisted;
 import com.google.inject.assistedinject.AssistedInject;
 import com.instaclustr.esop.azure.AzureModule.CloudStorageAccountFactory;
+import com.instaclustr.esop.impl.ManifestEntry;
 import com.instaclustr.esop.impl.RemoteObjectReference;
 import com.instaclustr.esop.impl.backup.BackupCommitLogsOperationRequest;
 import com.instaclustr.esop.impl.backup.BackupOperationRequest;
@@ -67,7 +68,7 @@ protected void cleanup() throws Exception {
     }
 
     @Override
-    public FreshenResult freshenRemoteObject(final RemoteObjectReference object) throws Exception {
+    public FreshenResult freshenRemoteObject(ManifestEntry manifestEntry, final RemoteObjectReference object) throws Exception {
         final CloudBlockBlob blob = ((AzureRemoteObjectReference) object).blob;
 
         final Instant now = Instant.now();
@@ -91,11 +92,11 @@ public FreshenResult freshenRemoteObject(final RemoteObjectReference object) thr
     }
 
     @Override
-    public void uploadFile(final long size,
+    public void uploadFile(final ManifestEntry manifestEntry,
                            final InputStream localFileStream,
                            final RemoteObjectReference objectReference) throws Exception {
         final CloudBlockBlob blob = ((AzureRemoteObjectReference) objectReference).blob;
-        blob.upload(localFileStream, size);
+        blob.upload(localFileStream, manifestEntry.size);
     }
 
     @Override

diff --git a/src/main/java/com/instaclustr/esop/azure/AzureBucketService.java b/src/main/java/com/instaclustr/esop/azure/AzureBucketService.java
@@ -1,10 +1,11 @@
 package com.instaclustr.esop.azure;
 
-import static java.lang.String.format;
-
 import java.net.URISyntaxException;
 import java.util.stream.StreamSupport;
 
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
 import com.google.inject.assistedinject.Assisted;
 import com.google.inject.assistedinject.AssistedInject;
 import com.instaclustr.esop.azure.AzureModule.CloudStorageAccountFactory;
@@ -21,8 +22,8 @@
 import com.microsoft.azure.storage.blob.BlobRequestOptions;
 import com.microsoft.azure.storage.blob.CloudBlobClient;
 import com.microsoft.azure.storage.blob.CloudBlobContainer;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
+
+import static java.lang.String.format;
 
 public class AzureBucketService extends BucketService {
 

diff --git a/src/main/java/com/instaclustr/esop/azure/AzureModule.java b/src/main/java/com/instaclustr/esop/azure/AzureModule.java
@@ -1,13 +1,11 @@
 package com.instaclustr.esop.azure;
 
-import static com.google.common.base.Strings.isNullOrEmpty;
-import static com.instaclustr.esop.guice.BackupRestoreBindings.installBindings;
-import static com.instaclustr.kubernetes.KubernetesHelper.isRunningAsClient;
-import static java.lang.String.format;
-
 import java.net.URISyntaxException;
 import java.util.Map;
 
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
 import com.google.inject.AbstractModule;
 import com.google.inject.Provider;
 import com.google.inject.Provides;
@@ -18,8 +16,11 @@
 import com.microsoft.azure.storage.CloudStorageAccount;
 import com.microsoft.azure.storage.StorageCredentialsAccountAndKey;
 import io.kubernetes.client.apis.CoreV1Api;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
+
+import static com.google.common.base.Strings.isNullOrEmpty;
+import static com.instaclustr.esop.guice.BackupRestoreBindings.installBindings;
+import static com.instaclustr.kubernetes.KubernetesHelper.isRunningAsClient;
+import static java.lang.String.format;
 
 public class AzureModule extends AbstractModule {