Replies: 5 comments 6 replies
-
Hi @jshook, We have investigated 2 aspects of Couchbase batch operations:
I appreciate that this is just one approach and doesn't directly address the questions you've raised. However, it hopefully offers some useful insight when discussing a generic approach that works for many databases. Batch operations in PillowfightPillowfight appears to use a queue to track which operations should be executed as part of a batch. In the snippet below, they use lcb_sched_enter(instance);
for (size_t ii = 0; ii < config.opsPerCycle; ++ii) {
hasItems = scheduleNextOperation();
}
if (hasItems) {
lcb_sched_leave(instance); For metrics Pillowfight registers a callback for each operation, that increments a counter. These callbacks are called asynchronously once a response is recieved from Couchbase and ensure that the benchmark is only tracking successful operations. Batch operations in JavaThe Java SDK has no traditional batch specific API e.g. // Iterate over a list of documents to fetch.
List<GetResult> results = Flux.fromIterable(documents)
.flatMap(document -> reactiveCollection.get(document.getId()))
.collectList()
.block(); // Wait until all operations have completed. AsyncCollection collection = ...;
CompletableFuture<GetResult> result = collection.get(id); We would need to perform some further investigation if we want to understand how the SDK batches the requests to the database behind the scenes. On a related note, we noticed that older versions of NoSQLbench supported async operations. Is this still a supported approach for implementing adaptors in NB5? I suspect that database clients which abstract the batching mechanism away from the user like this would not need to use a specific compound syntax in the NoSQLbench workload? |
Beta Was this translation helpful? Give feedback.
-
Some sketches for how CQL batches could be described: simple batch repeatblocks:
batchblock1:
op1a:
raw: "insert into .... values ({val1}, {val2}, ..."
repeat: 1000
mutator: Mul(37L); The presence of While it would already be possible for an op mapper to handle this as a special case, it should really be pushed into the standard NB APIs so that users and developers have specific expectations for how compound operations work. Further this allows NB to do some coherence checking on behalf of both. Caveats:
batch boundary markersblocks:
batchblock2:
op1b:
batch: begin
op2b:
prepared: "insert into ..."
repeat: 300
op3b:
prepared: "update ..."
batch: commit
op1d:
batch: commit
prepared: "insert into ..."
end-batch: true In this mode, Caveats:
batch by referenceblocks:
user-ops:
ops:
lookup-user: "select * ..."
update-user: "insert into ..."
lookup-and-update:
op:
batch:
use-tags: "block:user-ops"
repeat: 1000
mutator: Mul(37L) In this case, the batch is defined in terms of other op templates. It uses a couple of behaviors together:
This seems to offer the most DRY flexibility, with the caveat that it adds potentially confusiong indirection. |
Beta Was this translation helpful? Give feedback.
-
An update on this work, based on inspiration from recent ad-hoc implementations of batching modes. In CQL, we have the ability (in nascent form) to do batch operations by extending the internal op templating APIs slightly. The (work-in-progress) user-facing form of this looks like: ops:
op1:
batch:
repeat: 100
op_template:
( everything that could go under a previous cql op template can go here ) This works, although it has some issues.
Going forward, we should maintain as much uniformity of the patterns users are familiar with in both cases. This means that the array-like pattern should be almost exactly like the single op form, and the non-array form should be the same as the general op sequence pattern within a block. Ideally, these would be orthogonal to each other such that they can both be composed into an op template without special restrictions. Some sketches on how this might look: # just a bulk array-like operation, in whatever form is appropriate for the adapter
block1:
ops:
op1_1:
stmt: |
statement syntax goes here {with_binding}
batch-repeat: 100
# a bulk operation which is based on a sequence of heterogeneous operations
block2:
ops:
op2_1:
batch-sequence:
subop1: |
example statement to get userid by username
subop2: |
example statement to get preferences by userid
# combining these orthogonally
block3:
ops:
op3_1:
ratio: 3
batch-repeat: 50
batch-sequence:
subop3: |
example statement to get userid by username
subop4: |
example statement to get preferences by userid
op3_2:
stmt: |
some unrelated statement
ratio: 2 In the block3 case, a mixed-op sequence is used to construct a batch, but this sequence is repeated 50 times within the batch, so it is essentially a 100-op batch consisting of two different op types. Additionally, it occurs at ratio 3 within the active op templates, whereas the unrelated op3_2 occurs at ratio 2, meaning that op ratios operate as expected. The block3 case would have 5 logical NB operations, consisting of a total of 302 actual CQL statements. (3 * (50 * 2)) + (2 * 1). The |
Beta Was this translation helpful? Give feedback.
-
Clarification of TermsConsidering the different forms of compound operations across drivers, it is clear that we need to identify, name, and uniquely describe the categories of operations which cut across different protocols. This will serve two purposes 1) clarify usage patterns for users when they are designing workloads, and 2) align API and implementation features to support each of these more cohesively. Modeling Compound OperationsFrom the usage perspective, these will be described from the workload modeling view first. These high-level terms are NoSQLBench terms, and should map consistently to the supported feature within any given driver directly, where supported. Batch Arraybatch-array-size: Array structured operations are those which have a uniform record structure over some number of repeated operations, and which are easy to create by extending a given pattern into some native batch or array form. These should be identified very simply by reserved op fields which imply batch array for any driver adapter that can support it, and when used with a driver adapter that can't support it, an error should be thrown. Example: ops:
op1:
batch-array-size: 12
stmt: |
any valid syntax In the above example, the only thing required to convert a single op template into a batch array template is the inclusion of a reserved Batch SequenceJust like the main op sequence in a workload, a given operation should be able to be expressed as a sequence of unique operations with each step having a relative ratio. It should be possible to model batch arrays in batch sequence syntax, although doing so will require extraneous syntax in the op template. batch-sequence: Example: ops:
op2:
batch-sequence:
subop1:
stmt: |
any valid syntax
ratio: 3
subop2 | # sugared syntax
more valid syntax This is very different from the batch array form. In this form, the structure and semantics of a typical op sequence (as modeled within a block, for example) are repeated at a lower level within a single op. A more complex example that shows multiple transactions modeled in different operations looks like this: ops:
op3:
batch-sequence:
subop1: "INSERT INTO app.users_by_email (user_email,user_id) VALUES ({user_email},{user_id});"
subop2: "INSERT INTO app.account_log (ts,user_name,user_id) VALUES ({timestamp},{user_name},{user_id});"
op4:
batch-sequence:
... Columnar Batchcolumnar-batch-size In some protocols, it is possible to provide a set of rows or values very similar to that of a batch array, although the internal structure is transposed such that the outer grouping is by field and the values within that field are themselves arrays. This is a distinct pattern which deserves its own category. Batch VariantsIn some protocols, there are different kinds of batch operations. For example, in CQL, you can have unlogged batches, and ogged batches. When the variant is required to be specified, the reserved op field Data Binding ChallengesHistorically, binding synthetic data to an op template to yield an executable operation has followed a rigid pattern. Each operation has a mapping between a set of bindings by name, into a flat view of an operation. Each binding can be used zero or more times within an op template, but re-using a binding name in any sense of a collection or sequence is not supported. This is limiting for the purposes of batch operations, forcing users to model their data to cater to an emergent pattern:
This is quite unintuitive unless you already know what is happening inside the core NB machinery. For now, this pattern will likely remain how batch operations are used, but it is clear that a better solution is needed which rationalizes the coupling between bindings and array-style references of them. This should be handled in another design effort. |
Beta Was this translation helpful? Give feedback.
-
See #257 |
Beta Was this translation helpful? Give feedback.
-
Multiple native drivers support batches, some inline and some as a series of accumulated statements against a batch interface. Batches are very relevant for high-throughput scenarios. For example, with CQL, unlogged batches against the same partition achieve the highest row-by-row throughput of any method. It is important to enable testing of these scenarios.
Terms
Cycles and Operations
Historically, a given cycle in NoSQLBench was associated with a single operation comprised of a single statement.
However, with current NB APIs, the following is more accurate:
Operations and Statements
As well, previous implementations of statement-oriented protocols like CQL used the cycle per statement paradigm. This breaks down when constructing logical batches of statements, where each interior statement needs its own cycle value as a seed to create meaningful examples.
Putting these together will not describe the common case, but it will make clear the relationships between each layer, as in the most degenerate case:
This very clearly illustrates a case which hasn't been implemented before in the NB5 machinery.
Each cycle can produce multiple operations, each of which can be based on a simple statement or a compound type like a batch statement.
This raises key design questions for how to present this to users:
Previous Methods
At one time, there was a mechanism to create a batch statement in NB with an older driver and highly specialized logic (within the CQL driver). This used marker fields on the op template to signify the beginning of a batch as well as the end of a batch. While this provided some batch functionality for users with minimal changes, it completely broke the semantics for metrics and further caused operational churn with NB which was wasteful. It would be best to create a proper abstraction for a compound statement template within the core APIs.
Beta Was this translation helpful? Give feedback.
All reactions