Skip to content

PDP 22 (BK Based Storage (Tier 2) for Pravega)

Derek Moore edited this page Jul 9, 2021 · 1 revision

Proposal for implementation of BK based storage (Tier-2) for Pravega

Status: Under Discussion

Related issues: 1949, 431

Motivation

There are multiple reasons for which users or Pravega may want a Tier-2 based on Apache bookkeeper. Here are a few of these:

  1. Trial deployments: The user just wants a very small distributed Pravega deployment without the hassle of deploying and managing a tier-2 storage. Currently Pravega needs a deployment of tier-1 (Apache Bookkeeper) as well as Tier-2. This may be too much of effort to try out Pravega initially.
  2. Deployment with small footprint: Users may want to use Pravega for not storing data long term. By using Apache Bookkeeper as Tier-2, they can use it as a temporary tier-2 for the short life time of the data.
  3. Two BK deployments: User deploys two Apache bookkeeper deployments, one configured for storing small amount of data on fast disks. This is used as tier-1. Another is configured to use relatively slower disks which can store large amounts of data.

Approaches

1. Write data only once to bookkeeper:

Currently data is written to tier-1 in the form of durablelog and then written to Tier-2 permanently. If user has single deployment of Apache bookkeeper, it may make sense to store data directly in tier-1 instead of storing the data in the form of durablelog and then writing to tier-2.

If the user is using the same instance of apache bookkeeper deployment, it may be inefficient to write to the same deployment as a durablelog and then come back and write to the same as a tier-2 storage.

Status: Implementing this involves changing the basic design of Pravega. This needs more discussions and change in basic approach. It is not recommended to follow this approach.

2. Use Apache bookkeeper as any other tier-2

This approach involves implementation of the Storage interface using bookkeeper.

We have decided that we will go with option 2.

Storage Ledger:

Apache bookkeeper provides low level primitives for a write ahead log. We need to build infrastructure around it called "Storage ledger" to ensure that it can be used as a general purpose storage implementation. Irrespective of the approach above, a storage ledger implementation is necessary. A storage ledger ties together number of bookkeeper ledgers together to represent one storage entity similar to a file. The metadata for this entity is stored in zookeeper.

Here are some of the interfaces that we need to implement the Storage interface and the way it can be implemented:

  1. Append at an offset / end: Bookkeeper ledger APIs support this.

  2. Read at a given offset (Random access for read) Bookkeeper allows read at a given entry through LedgerHandle.read API. We need to build an algorithm to map offset to entry-id.

  3. Concat (atomic if possible) Concat will be a metadata operation.

  4. Truncate (optional currently) This can be a metadata operation.

  5. Update read/write permission This will be a metadata operation.

  6. Store metadata Metadata will be stored in zookeeper.

Currently Pravega already has one implementation of storage ledger in BookkeeperLog.java. This is more geared towards the durable log implementation and does not have a good amount of behavior as expected from storage interface.

Open Question: Shall we improve the existing BookkeeperLog to implement a generic storage OR implement a different storage ledger for tier-2 purpose.

Update: It was decided that we will have separate implementations for bookkeeper based tier-1 and tier-2. This is mainly because the expectations and behavior of tier-1 and tier-2 are different.

Ownership of StorageLedgers and impact of ReadOnlySegmentStore

In case ownership of the StorageLedger changes before the owner closes it successfully, it is possible that LAC may not be updated to the latest one. Because of this, data may not be visible to the current owner immediately. Pravega uses tier-2 to store segment states as well. This needs to be updated as well as immediately visible for correct representation of the segment state. To overcome this shortcoming, openRead and openWrite both need to fence and own the ledger completely. Readonly as well as read-write access needs to explicitly own the ledger.

Clone this wiki locally