Compound Operations #946

jshook · 2023-01-24T19:50:19Z

jshook
Jan 24, 2023
Maintainer

Multiple native drivers support batches, some inline and some as a series of accumulated statements against a batch interface. Batches are very relevant for high-throughput scenarios. For example, with CQL, unlogged batches against the same partition achieve the highest row-by-row throughput of any method. It is important to enable testing of these scenarios.

Terms

Cycles and Operations

Historically, a given cycle in NoSQLBench was associated with a single operation comprised of a single statement.

---
title: BEFORE
---
graph LR;
cycle--> |seed for|op;
op --> stmt;

However, with current NB APIs, the following is more accurate:

---
title: With NB5 APIs
---
graph LR;
cycle  ---> |seed for|op1(op1);
cycle ---> |seed for|op2(op2);
cycle ---> |seed for|op3("...");
op1 --> |gen| op2;
op2 --> |gen| op3;

Operations and Statements

As well, previous implementations of statement-oriented protocols like CQL used the cycle per statement paradigm. This breaks down when constructing logical batches of statements, where each interior statement needs its own cycle value as a seed to create meaningful examples.

---
title: CURRENT
---
graph LR;
cycle --> |seed for| statement --> op;

---
title: NEEDED
---
graph LR;
cycle --> |seed for| batch --> |contains| statement1 & statement2 & ... --> op;

Putting these together will not describe the common case, but it will make clear the relationships between each layer, as in the most degenerate case:

---
title: Layered View
---
graph LR;
cycle  ---> |seed for| batch1(batch) --> |contains|statement1(statement) & statement2(statement) & more1("...") --> op1(op1)
cycle ---> |seed for|batch2(batch) --> |contains|statement234(statement) & statement235(statement) & more2("...") --> op2(op2);
cycle ---> |seed for|op3("...");
op1 --> |gen| batch2;
op2 --> |gen| op3;

This very clearly illustrates a case which hasn't been implemented before in the NB5 machinery.

Each cycle can produce multiple operations, each of which can be based on a simple statement or a compound type like a batch statement.

This raises key design questions for how to present this to users:

How do we re-use the templating system used for op templates in a layered way to enable compound statement construction with the same tooling?
Do we allow the use of named blocks or other op template sequencing methods?
How do we bring cycle value semantics into this picture so that users?

Previous Methods

At one time, there was a mechanism to create a batch statement in NB with an older driver and highly specialized logic (within the CQL driver). This used marker fields on the op template to signify the beginning of a batch as well as the end of a batch. While this provided some batch functionality for users with minimal changes, it completely broke the semantics for metrics and further caused operational churn with NB which was wasteful. It would be best to create a proper abstraction for a compound statement template within the core APIs.

stheodosius-sky · 2023-02-20T11:13:20Z

stheodosius-sky
Feb 20, 2023

Hi @jshook,

We have investigated 2 aspects of Couchbase batch operations:

How does Couchbase perform batch operations in their own benchmarking tool (Pillowfight) and how do they track metrics?
What support does the Java SDK have for batch operations?

I appreciate that this is just one approach and doesn't directly address the questions you've raised. However, it hopefully offers some useful insight when discussing a generic approach that works for many databases.

Batch operations in Pillowfight

Pillowfight appears to use a queue to track which operations should be executed as part of a batch. In the snippet below, they use lcb_sched_enter to mark the start of a new queue and lcb_sched_leave to mark the end. Otherwise, operations are built and executed as normal - presumably the libcouchbase library handles if the operation should be executed immediately or deferred. This sounds similar to the approach used for the old CQL driver.

        lcb_sched_enter(instance);
        for (size_t ii = 0; ii < config.opsPerCycle; ++ii) {
            hasItems = scheduleNextOperation();
        }
        if (hasItems) {
            lcb_sched_leave(instance);

For metrics Pillowfight registers a callback for each operation, that increments a counter. These callbacks are called asynchronously once a response is recieved from Couchbase and ensure that the benchmark is only tracking successful operations.

Batch operations in Java

The Java SDK has no traditional batch specific API e.g. batchInsert(List<Document>), and doesn't expose the same API for queuing requests as libcouchbase. Instead, batch operations are achieved using reactive or asynchronous programming. For example:

// Iterate over a list of documents to fetch.
List<GetResult> results = Flux.fromIterable(documents)
    .flatMap(document -> reactiveCollection.get(document.getId()))
    .collectList()
    .block(); // Wait until all operations have completed.

AsyncCollection collection = ...;
CompletableFuture<GetResult> result = collection.get(id);

We would need to perform some further investigation if we want to understand how the SDK batches the requests to the database behind the scenes.

On a related note, we noticed that older versions of NoSQLbench supported async operations. Is this still a supported approach for implementing adaptors in NB5? I suspect that database clients which abstract the batching mechanism away from the user like this would not need to use a specific compound syntax in the NoSQLbench workload?

1 reply

jshook Feb 28, 2023
Maintainer Author

Thanks for the notes on couchbase.

Ideally, the NB API for compound operations allows for an idiomatic usage of any programming patterns you would normally see in an application. We've been able to do this before even when mixing in fluent code, by retaining builders across cycles. This will almost always be limited to per-thread state management, which is trival in the current request-per-thread model.

The async mode before is still partially in the code base and may be brought back online at some point, when we can provide a suitable wrapper around non-async methods. This mode was effectively put on pause during the significant API updates which were needed to simplify everything else.

The rest of this is about async scheduling.

Ideally, the core scheduling logic is wired for async by default. There are some clear trade-offs for this, however:

Async is not always a boon.
The upcoming features in J19 (Virtual Threads) and beyond make juggling concurrency in the request-per-thread model much more trivial for developers and efficient in the runtime. This might make the extra complexity of explicitly juggling ops within threads a net negative. I suspect many concurrent designs will revert to the classic request-per-thread wiring since virtual threads will effectively offer most of what explicit async scheduling designs do now.
This is not always available in client APIs

We have to be careful about how we describe these details, as there are many places that something is called "async". The operations are still async in terms of on-wire messaging and ordering, and the order of operations between threads is still ungoverned (by design) so effectively appear async from the server side. The key difference now is simply that we aren't maintaining a target number of operations in flight per thread, which can slightly affect the maximum throughput of a client. Still, there were some patterns in the old async API which we may re-use if/when we overhaul how we schedule operations.

All that said, there are some specific reasons to have async access patterns above and beyond what is possible now. "Compound Operations" is not specific enough to distinguish between scheduling and batching concerns, and they are both meaningful in my view.

One idea that has come up is to enable async dispatching within a sequence, which more closely aligns what we mean by "async" with how it is used for real world access patterns. In this form, the complexity of it is relatively contained to a given thread. As part of our discussion on linearized operations, we will need to distinguish op chains by dependency on other ops. If those ops are dispatched concurrently where not prohibited by dependencies, then we already have a lot of flexibility in workload design. The features could be leveraged explicitly per op as well as implicitly through data flow, either or both. This can happen whether or not a driver has an async-per-op API, as long as it is pipeline asynchronously on the service side.

Comments, questions, ideas?

jshook · 2023-02-28T00:48:22Z

jshook
Feb 28, 2023
Maintainer Author

Some sketches for how CQL batches could be described:

simple batch repeat

blocks:
 batchblock1:
  op1a:
   raw: "insert into .... values ({val1}, {val2}, ..."
   repeat: 1000
   mutator: Mul(37L);

The presence of repeat indicates that this should be handled in a batch (in CQL terms), while the valuie of mutator specifies a pertubation function to use on the input cycle in between each statement. Both would be reserved op template fields. A driver adapter API would need to signify its ability to handle this op template with a decorator interface for batch statements. If provided, it would be called with the verified and resolved repeat and mutator, with all op details otherwise the same as for any other mapper or dispenser. A driver adapter can declare its own semantics for what "batch" means. Other terms may be allowed if there are specific ones which make sense.

While it would already be possible for an op mapper to handle this as a special case, it should really be pushed into the standard NB APIs so that users and developers have specific expectations for how compound operations work. Further this allows NB to do some coherence checking on behalf of both.

Caveats:

This only allows a specific statement to be used repeatedly in a batch, which doesn't fit more meaningful forms for Paxos, etc. Thus it is really an ingest optimizer only, which is still signficant.
This aligns an operation with many statements, API calls, mutations, or whatever.
Extant op templates can not be re-used.

batch boundary markers

blocks:
 batchblock2:
  op1b:
   batch: begin
  op2b:
   prepared: "insert into ..."
   repeat: 300
  op3b:
   prepared: "update ..."
   batch: commit
  op1d:
   batch: commit
   prepared: "insert into ..."
   end-batch: true

In this mode, batch: begin and batch: commit indicate batch boundaries across cycle values. For op1d, an error could be thrown, but a more useful behavior would be to allow batch: commit to lazy init a batch for this cycle.

Caveats:

Many cycles have empty operations, which affects metrics accuracy.
(good) Users can specify arbitrary contents for a batch, using the same cycle-by-cycle values which they may already be familiar with.
No mutator is required here, since each logical statement still corresponds to a specific cycle value.

batch by reference

blocks:
 user-ops:
  ops:
   lookup-user: "select * ..."
   update-user: "insert into ..."
 lookup-and-update:
   op:
    batch:
     use-tags: "block:user-ops"
     repeat: 1000
     mutator: Mul(37L)

In this case, the batch is defined in terms of other op templates. It uses a couple of behaviors together:

Condensed YAML structure, which is already desugared to the more canonical form.
Referencing existing op templates by tag.

This seems to offer the most DRY flexibility, with the caveat that it adds potentially confusiong indirection.
This could be extended to includes from other files, which is a double-edged sword: Workload portability is guaranteed when all contents need to be local.

5 replies

jeffbanks Feb 28, 2023
Collaborator

blocks:
 user-ops:
  ops:
   lookup-user: "select * ..."
   update-user: "insert into ..."
   repeat: 2
 user-ops2:
   lookup-user2: "select * ..."
   update-user2: "insert into ..."
 lookup-and-update:
   op:
    batch:
     use-tags: "block:user-ops"
       repeat: 1000
       mutator: Mul(37L)
     use-tags: "block:user-ops2"
       repeat: 4
       mutator: Mul(199L)
  lookup-and-update2:
    op:
      batch:
        use-tags: "block:user-ops1"
          repeat: 1000
          mutator: Mul(100L)

jeffbanks Feb 28, 2023
Collaborator

Our (Mike and Jeff) initial comments:

We seem to like the batch by reference better, but still considering the simple batch repeat.

Question on this one above:

Does inheritance apply to the operations in a batch from the reference use-tag?
Does use-tags allow for it's own child properties. For example: repeat and mutator.

MikeYaacoubStax Mar 1, 2023
Collaborator

My only questions for deciding whether to go with simple batch repeat or batch by reference has to do with the caveats mentioned:

Do we imagine Extant op templates as a core use for the batch repeat feature now or in the future?
Potentially confusing indirection in batch by reference proved to be the main focus point that Jeff and I discussed. Imagining different ways of using batch by reference as it gets more complicated has risen naturally in the conversation and is anticipated to be the biggest pain point that we believe new users will have with it, so can we imagine a set of clear rules that govern the use of batch by reference to make it easier to implement?

jshook Mar 1, 2023
Maintainer Author

Our (Mike and Jeff) initial comments:

We seem to like the batch by reference better, but still considering the simple batch repeat.
I concur, and also think there may be some complimentary yet unambiguous patterns which we can generalize here.
Particularly, the idea of including by reference seems to make sense as a general meta-programming feature. It would be useful in places besides batches or compound statements. I'm going to write up a separate sketch for that.

In terms of immediate utility, the simple version is going to be the easiest to get working and in the hands of users. As long as the include op templates pattern works seamlessly with the set batch boundaries pattern, then users could use either or both.

Question on this one above:

Does inheritance apply to the operations in a batch from the reference use-tag?

I think this should be explicit in either case, but by default we should start with simply including structural elements as they appear in the raw data structure. Thus, if an op template is included in place which has its own local bindings, then they come along as part of the op template, but any block or doc level overrides are evaluated in the target location of the include. So if you had the following:

blocks:
 block1:
  bindings:
   b1: NumberNameToString()
  ops:
   op1: "testing {b1}"
 block2:
  use-ops: tags=block:block1

Then, the (post-processed) structure would be:

blocks:
 block1:
  bindings:
   b1: NumberNameToString()
  ops:
   op1: "testing {b1}"
 block2:
  ops:
   op1: "testing {b1}"

... which would actually be an error, since the bind point {b1} would not be qualified by a binding definition. Yet, this illustrates the quite literal structural substitution based on simple templating methods.

If we want a version which uses the structurally normalized view, it would be possible, but they should be distinct cases. When that is supported, the question of precedence comes up, wherein you may want to override or default with respect to the source or target data.

This also shows that we could borrow the use-ops reserved word convention for others, like use-bindings. Perhaps use is simply the version that takes all elements matching a tag, and use-ops, use-bindings, or other reserved words do the same.

In terms of precedence of locally defined or re-used elements, the point of re-use is important, since everythint is meant to be ordered already. Where the use is used, is where the source elements should be injected.

Does use-tags allow for it's own child properties. For example: repeat and mutator.

I think YAML rules will disallow this structure, since the value of a field must be either a scalar or a structure. As in, I think syntax highlighting will make it look valid, but it will throw an error. If we want to accessorize a use-point, then we would have to flatten the properties underneath it like this:

blocks:
 block0:
  use-ops:
    tags: block=foo
    bindings: ...

... which may also be valid. One key concern in all of this is to avoid adding too much, or too much at once such that it is difficult to cover with documentation and testing which makes it usable and reliable. As long as each case is well defined in its behavior, and provides an incremental slope of compexity AND utility, I'm for making things more flexible in general.

jshook Mar 1, 2023
Maintainer Author

My only questions for deciding whether to go with simple batch repeat or batch by reference has to do with the caveats mentioned:

Do we imagine Extant op templates as a core use for the batch repeat feature now or in the future?

After considering some of the test scenarios we will want to be able to support, it does seem to be the case that we will want to be able to formulate access patterns as both sequence of discrete linearized operations as well as batch of grouped transactions. As long as this doesn't included variant syntactical forms (such as "IF ..." which allows linearized ops to be connected), then it would be a simple twist to convert the decomposed form into a composed form simply by including the extant operations into a batch context.

Potentially confusing indirection in batch by reference proved to be the main focus point that Jeff and I discussed. Imagining different ways of using batch by reference as it gets more complicated has risen naturally in the conversation and is anticipated to be the biggest pain point that we believe new users will have with it, so can we imagine a set of clear rules that govern the use of batch by reference to make it easier to implement?

Yes, I think this is required in every case for new workload templating syntax. Specifically, the cases need to be convered and tested in the uniform workload specification tests.

jshook · 2024-03-12T21:09:48Z

jshook
Mar 12, 2024
Maintainer Author

An update on this work, based on inspiration from recent ad-hoc implementations of batching modes.

In CQL, we have the ability (in nascent form) to do batch operations by extending the internal op templating APIs slightly. The (work-in-progress) user-facing form of this looks like:

ops:
  op1:
    batch:
      repeat: 100
      op_template:
        ( everything that could go under a previous cql op template can go here )

This works, although it has some issues.

There is a clear difference between array like batch operations and those which allow for non-regular operations, such as a sequence of operations which implement a transaction.
The above format doesn't do either of these very directly.

Going forward, we should maintain as much uniformity of the patterns users are familiar with in both cases. This means that the array-like pattern should be almost exactly like the single op form, and the non-array form should be the same as the general op sequence pattern within a block. Ideally, these would be orthogonal to each other such that they can both be composed into an op template without special restrictions.

Some sketches on how this might look:

# just a bulk array-like operation, in whatever form is appropriate for the adapter
block1:
  ops:
    op1_1:
      stmt: |
         statement syntax goes here {with_binding}
      batch-repeat: 100

# a bulk operation which is based on a sequence of heterogeneous operations
block2:
  ops:
    op2_1:
      batch-sequence:
        subop1: |
          example statement to get userid by username
        subop2: |
          example statement to get preferences by userid

# combining these orthogonally
block3:
  ops:
    op3_1:
      ratio: 3
      batch-repeat: 50
      batch-sequence:
        subop3: |
          example statement to get userid by username
        subop4: |
          example statement to get preferences by userid
    op3_2:
      stmt: |
        some unrelated statement
      ratio: 2

In the block3 case, a mixed-op sequence is used to construct a batch, but this sequence is repeated 50 times within the batch, so it is essentially a 100-op batch consisting of two different op types. Additionally, it occurs at ratio 3 within the active op templates, whereas the unrelated op3_2 occurs at ratio 2, meaning that op ratios operate as expected. The block3 case would have 5 logical NB operations, consisting of a total of 302 actual CQL statements. (3 * (50 * 2)) + (2 * 1).

The batch-sequence form becomes mutually exclusive with other op forms, just as any of the op type selectors for an adapter.

0 replies

jshook · 2024-04-02T18:24:04Z

jshook
Apr 2, 2024
Maintainer Author

Clarification of Terms

Considering the different forms of compound operations across drivers, it is clear that we need to identify, name, and uniquely describe the categories of operations which cut across different protocols. This will serve two purposes 1) clarify usage patterns for users when they are designing workloads, and 2) align API and implementation features to support each of these more cohesively.

Modeling Compound Operations

From the usage perspective, these will be described from the workload modeling view first. These high-level terms are NoSQLBench terms, and should map consistently to the supported feature within any given driver directly, where supported.

Batch Array

batch-array-size:

Array structured operations are those which have a uniform record structure over some number of repeated operations, and which are easy to create by extending a given pattern into some native batch or array form. These should be identified very simply by reserved op fields which imply batch array for any driver adapter that can support it, and when used with a driver adapter that can't support it, an error should be thrown.

Example:

ops:
 op1:
  batch-array-size: 12
  stmt: |
   any valid syntax

In the above example, the only thing required to convert a single op template into a batch array template is the inclusion of a reserved batch-array-size op field.

Batch Sequence

Just like the main op sequence in a workload, a given operation should be able to be expressed as a sequence of unique operations with each step having a relative ratio. It should be possible to model batch arrays in batch sequence syntax, although doing so will require extraneous syntax in the op template.

batch-sequence:

Example:

ops:
 op2:
  batch-sequence:
   subop1:
    stmt: |
     any valid syntax
    ratio: 3
   subop2 | # sugared syntax
    more valid syntax

This is very different from the batch array form. In this form, the structure and semantics of a typical op sequence (as modeled within a block, for example) are repeated at a lower level within a single op. A more complex example that shows multiple transactions modeled in different operations looks like this:

ops:
 op3:
  batch-sequence:
   subop1: "INSERT INTO app.users_by_email (user_email,user_id) VALUES ({user_email},{user_id});"
   subop2: "INSERT INTO app.account_log (ts,user_name,user_id) VALUES ({timestamp},{user_name},{user_id});"
 op4:
  batch-sequence:
   ...

Columnar Batch

columnar-batch-size

In some protocols, it is possible to provide a set of rows or values very similar to that of a batch array, although the internal structure is transposed such that the outer grouping is by field and the values within that field are themselves arrays. This is a distinct pattern which deserves its own category.

Batch Variants

In some protocols, there are different kinds of batch operations. For example, in CQL, you can have unlogged batches, and ogged batches. When the variant is required to be specified, the reserved op field batch-type is used.

Data Binding Challenges

Historically, binding synthetic data to an op template to yield an executable operation has followed a rigid pattern. Each operation has a mapping between a set of bindings by name, into a flat view of an operation. Each binding can be used zero or more times within an op template, but re-using a binding name in any sense of a collection or sequence is not supported. This is limiting for the purposes of batch operations, forcing users to model their data to cater to an emergent pattern:

users execute the compound operation with a skip interval on cycle values, by prepending their batch bindings with a Mul(15L) function. This basically means that for each cycle value, the input to the batch operation will be at even multiples of 15, and it is expected that each batch has 15 logical operations which can then consume these values in step-wise fashion between the intervals.

This is quite unintuitive unless you already know what is happening inside the core NB machinery. For now, this pattern will likely remain how batch operations are used, but it is clear that a better solution is needed which rationalizes the coupling between bindings and array-style references of them. This should be handled in another design effort.

0 replies

dave2wave · 2024-11-12T19:33:45Z

dave2wave
Nov 12, 2024
Maintainer

See #257

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compound Operations #946

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Compound Operations #946

jshook Jan 24, 2023 Maintainer

Terms

Cycles and Operations

Operations and Statements

Previous Methods

Replies: 5 comments · 6 replies

stheodosius-sky Feb 20, 2023

Batch operations in Pillowfight

Batch operations in Java

jshook Feb 28, 2023 Maintainer Author

jshook Feb 28, 2023 Maintainer Author

simple batch repeat

batch boundary markers

batch by reference

jeffbanks Feb 28, 2023 Collaborator

jeffbanks Feb 28, 2023 Collaborator

MikeYaacoubStax Mar 1, 2023 Collaborator

jshook Mar 1, 2023 Maintainer Author

jshook Mar 1, 2023 Maintainer Author

jshook Mar 12, 2024 Maintainer Author

jshook Apr 2, 2024 Maintainer Author

Clarification of Terms

Modeling Compound Operations

Batch Array

Batch Sequence

Columnar Batch

Batch Variants

Data Binding Challenges

dave2wave Nov 12, 2024 Maintainer

jshook
Jan 24, 2023
Maintainer

Replies: 5 comments 6 replies

stheodosius-sky
Feb 20, 2023

jshook Feb 28, 2023
Maintainer Author

jshook
Feb 28, 2023
Maintainer Author

jeffbanks Feb 28, 2023
Collaborator

jeffbanks Feb 28, 2023
Collaborator

MikeYaacoubStax Mar 1, 2023
Collaborator

jshook Mar 1, 2023
Maintainer Author

jshook Mar 1, 2023
Maintainer Author

jshook
Mar 12, 2024
Maintainer Author

jshook
Apr 2, 2024
Maintainer Author

dave2wave
Nov 12, 2024
Maintainer