Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 objects lifecycle management #2721

Open
mzueva opened this issue Jul 7, 2022 · 2 comments
Open

S3 objects lifecycle management #2721

mzueva opened this issue Jul 7, 2022 · 2 comments
Assignees
Labels
kind/enhancement New feature or request state/has-doc Issues that have documentation

Comments

@mzueva
Copy link
Collaborator

mzueva commented Jul 7, 2022

Background

Requirements to the S3 objects lifecycle management :

  • Allow to specify multiple transition rules for a bucket based on file prefix and/or glob pattern
  • All available S3 storage archive classes shall be supported (S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive) as well as deletion of objects
  • User shall receive a notification on lifecycle events:
    • User shall receive a notification when data is close up to N days to transition. Notification shall be sent with configurable delay
    • User shall have an option to delay data transition using link from notification email
    • User shall receive a notification once data transition is initiated
    • Default template shall be available for such notifications, also notifications (text, subject, delays) shall be configurable per bucket
  • User shall be able to restore archived files and folders using CLI and GUI

Approach

  • A new object LifecyclePolicy shall be attached to an S3 bucket. It is similar to AWS BucketLifecycleConfiguration object and contains set of transition rules including filters (prefix, tags), storage class and transition days. This object shall also contain notification settings. E.g. in this object we specify that files with StorageType=bcl tag shall be deleted after 90 days and files with StorageType=fastq tag shall be moved to S3 Glacier Deep Archive after 180 days and deleted after 5 years.
  • When files are uploaded to storage after pipeline execution or from GUI/CLI files shall be automatically tagged. Tagging preferences shall be defined in a System Preference, e.g. [{"*.bcl" : "StorageType=bcl"}] or as a json file for a pipeline. Note no more than 10 tags are allowed. Upload API including generate URL api methods shall support tagging from SystemPreference as well.
  • Pipe CLI shall provide command for object native tagging using S3 batch requests

Object Lifecycle Monitor

A new daemon service shall monitor buckets with LifecyclePolicy attached using the following algorithm:

  • Monitoring is done for a configured folder in the bucket. Each folder under this prefix is considered to be a dataset. For a file structure below we have tree datasets (run1, run2, run3) for prefix data/
 bucket/
     data/
         run1/
         run2/
         run3/
  • Each dataset is processed individually. When a new dataset is detected a new LifecyclePolicy shall be created for this dataset from default template attached to the storage. A dataset without assigned LifecyclePolicy is considered to be new
  • Each file in the dataset is checked against the lifecycle policy. File is considered eligible when:
    • it matches tags and prefix from the policy
    • it not is target storage class
    • its alive time is greater or equal then configured days value
    • TBD: how to check file? If we use glob preference used for tagging, it is fast, but potentially we may consider not tagged files to be eligible for transition. Checking actual tags in S3 is very slow.
  • If some of the files are eligible for transition in N days and this matches notification settings (global or bucket), user shall receive a notification with link to delay transition. An API method shall be implemented to change expiration days for a dataset by user request
  • If some of the files are eligible for transition now a new AWS BucketLifecycleConfiguration shall be created with matching path/tags filter and expiration days 0, to be applied immediately. Check that such policy doesn't exist yet. TBD: All BCLs in a path will be transferred at once, even if some files were uploaded earlier/later
  • Any existing AWS BucketLifecycleConfiguration policy shall be dropped after configurable amount of days

Restoring files
API and CLI methods shall be implemented to request object restoring from archive storage classes and monitor restoring process.

  • restore folder
  • get restoring status
  • delete from glacier
  • notification on completion TBD
  • permanent restoring (change class back to Standard) TBD
@mzueva mzueva added the kind/enhancement New feature or request label Jul 7, 2022
@mzueva mzueva assigned mzueva, SilinPavel and tcibinan and unassigned tcibinan Jul 7, 2022
mzueva added a commit that referenced this issue Aug 16, 2022
* issue 2721 changing storage lifecycle policy model, now it is possible to fine tune it with rule approach [WIP]

* issue 2721 refactoring

* issue 2721 change lifecyclePolicy type in StoragePolicy from object to String, to be able to work flexible with it in StorageProvider level

* issue 2721 renaming

* issue 2721 changes to be compatible with S3 Lifecycle Policy rules

* issue 2721 changing object model for datastorage lifecycle policy

* issue 2721 changing object model for datastorage lifecycle policy + CRUD

* issue 2721 added verification for storage lifecycle object CRUD operations

* issue 2721 logging

* issue 2721 refactor

* issue 2721 tests for storage objects lifecycle model

* issue 2721 rollback for changes of previous approach

* issue 2721 cleanup

* issue 2721 new approach

* issue 2721 StorageLifecycleRuleProlongation as entity

* issue 2721 refactor, added StorageLifecycleRuleExecution

* issue 2721 CRU operation for StorageLifecycleRuleExecution

* issue 2721  tests are added

* issue 2721 added constraint for lifecycle rule object

* Issue #2721 Move SQL migration to current date

Co-authored-by: mzueva <mariia_zueva@epam.com>
@SilinPavel
Copy link
Member

Comment on restoring files implementation:

Main aspects:

  • Server only initiate process of restoring files - Restore action created with status INITIATED
  • sls actually starts this process by creating a batch operation job to restore specified files (aws cloud) - status changed to RUNNING
  • After job is created in the next loops of sls service will try to check restoring status by head each involving file and see if Restoring header change value to false and have information about restored date if so - will update restoring status to SUCCEEDED and set appropriate restoredTill value
  • If there is already RUNNING restoring process - all related INITIATED actions will wait until it will be done
  • If there is several related INITIATED actions only one (the latest) will be applied, all others will be CANCELLED:

Example 1 of cancellation process:

We have the next hierarchy of object:

  • /dataset/
    • file1
    • file2
    • file3

User initiate restore of folder /dataset/
Another user initiate restore of file /dataset/file1

In this case all restores will be applied, firstly /dataset/ will be restored and after that /dataset/file1 will be restored

Example 2 of cancellation process:

The same hierarchy as for Example 1.

But now:
User initiate restore of file /dataset/file1
Another user initiate restore of folder /dataset/

In this case only restore for /dataset/ will be applied because /dataset/ includes /dataset/file1 and it is latest restore action

NShaforostov added a commit that referenced this issue Nov 7, 2022
NShaforostov added a commit that referenced this issue Nov 7, 2022
- 'Sensitive storages' (#1036)
- 'Tool image history' (#1140)
- 'Custom run capabilities' (#2234, #2295, #2323)
- 'Storage Lifecycle' (#2721, #2759)
@NShaforostov
Copy link
Collaborator

Docs were added via #2547 and can be found here.

@NShaforostov NShaforostov added the state/has-doc Issues that have documentation label Nov 7, 2022
mzueva added a commit that referenced this issue Jan 17, 2023
* issue 2721 draft for notification templates for DATASTORAGE_LIFECYCLE_ACTION and DATASTORAGE_LIFECYCLE_RESTORE_ACTION

* issue 2721 deployment configuration for SLS

* issue 2721 Deployment: Added configuration of storage.lifecycle.service.cloud.config on deploy

* issue 2721 Deployment: email template change

* issue 2721 Deployment: fix html template

* issue 2721 Deployment: expand deployment

* issue 2721 Deployment: fix yml

* issue 2721 Deployment: update AWS prerequisites with new roles and policies

* issue 2721 Deployment: logging approach

* issue 2721 Deployment: update deployment process with respect to creds from region

* issue 2721 Deployment: log backup days to parameters

* issue 2721 Deployment: fix links in emails

* issue 2721 Deployment: documentation changes

* Issue #2721: Storage Lifecycle Service: Deployment - template cleanups

* Fix SLS notifications wording

Co-authored-by: Ekaterina_Kazachkova <ekaterina_kazachkova@epam.com>
Co-authored-by: mzueva <mariia_zueva@epam.com>
mzueva added a commit that referenced this issue Jan 18, 2023
* issue 2721 changing storage lifecycle policy model, now it is possible to fine tune it with rule approach [WIP]

* issue 2721 refactoring

* issue 2721 change lifecyclePolicy type in StoragePolicy from object to String, to be able to work flexible with it in StorageProvider level

* issue 2721 renaming

* issue 2721 changes to be compatible with S3 Lifecycle Policy rules

* issue 2721 changing object model for datastorage lifecycle policy

* issue 2721 changing object model for datastorage lifecycle policy + CRUD

* issue 2721 added verification for storage lifecycle object CRUD operations

* issue 2721 logging

* issue 2721 refactor

* issue 2721 tests for storage objects lifecycle model

* issue 2721 rollback for changes of previous approach

* issue 2721 cleanup

* issue 2721 new approach

* issue 2721 StorageLifecycleRuleProlongation as entity

* issue 2721 refactor, added StorageLifecycleRuleExecution

* issue 2721 CRU operation for StorageLifecycleRuleExecution

* issue 2721  tests are added

* issue 2721 added constraint for lifecycle rule object

* Issue #2721 Move SQL migration to current date

Co-authored-by: mzueva <mariia_zueva@epam.com>
mzueva added a commit that referenced this issue Jan 18, 2023
* issue 2721 draft for notification templates for DATASTORAGE_LIFECYCLE_ACTION and DATASTORAGE_LIFECYCLE_RESTORE_ACTION

* issue 2721 deployment configuration for SLS

* issue 2721 Deployment: Added configuration of storage.lifecycle.service.cloud.config on deploy

* issue 2721 Deployment: email template change

* issue 2721 Deployment: fix html template

* issue 2721 Deployment: expand deployment

* issue 2721 Deployment: fix yml

* issue 2721 Deployment: update AWS prerequisites with new roles and policies

* issue 2721 Deployment: logging approach

* issue 2721 Deployment: update deployment process with respect to creds from region

* issue 2721 Deployment: log backup days to parameters

* issue 2721 Deployment: fix links in emails

* issue 2721 Deployment: documentation changes

* Issue #2721: Storage Lifecycle Service: Deployment - template cleanups

* Fix SLS notifications wording

Co-authored-by: Ekaterina_Kazachkova <ekaterina_kazachkova@epam.com>
Co-authored-by: mzueva <mariia_zueva@epam.com>
mzueva added a commit that referenced this issue Feb 21, 2023
SilinPavel added a commit that referenced this issue Mar 2, 2023
SilinPavel added a commit that referenced this issue Mar 2, 2023
…g lifecycle events, to reduce number of "empty" cycles
SilinPavel added a commit that referenced this issue Mar 2, 2023
SilinPavel added a commit that referenced this issue Mar 2, 2023
SilinPavel added a commit that referenced this issue Mar 6, 2023
SilinPavel added a commit that referenced this issue Mar 6, 2023
…g lifecycle events, to reduce number of "empty" cycles
SilinPavel added a commit that referenced this issue Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request state/has-doc Issues that have documentation
Projects
None yet
Development

No branches or pull requests

4 participants