Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3544] Boolean Matrix Support #1849

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

louislepage
Copy link
Contributor

No description provided.

@phaniarnab
Copy link
Contributor

Thanks @louislepage, for starting the project.
Please add the Apache license to the new file, DenseBlockTrueBool.java.
Looking forward to seeing how you use this new denseblock for the binary matrix operations.

@louislepage
Copy link
Contributor Author

First results from25 runs of 10k x 10k sized double matrices show an average speed-up:

Matrix Dimensions: 10000 x 10000
Operator: GreaterThan
Runtime statistics in ms:
FP64:
LongSummaryStatistics{count=25, sum=60166, min=1712, average=2406,640000, max=3633}
BOOLEAN:
LongSummaryStatistics{count=25, sum=54211, min=1742, average=2168,440000, max=2521}
TRUE_BOOLEAN with boolean arithmetics:
LongSummaryStatistics{count=25, sum=33950, min=1009, average=1358,000000, max=1931}

Note, that the BOOLEAN DenseBlock uses Bitset, but simple boolean arrays as in the TRUE_BOOLEAN case, seem to perform better, at least regarding runtime.

Further Steps

  • more tests with different comparison operators
  • test memory efficiency
  • improve implementation of differentiation between boolean arithmetic and numeric/double arithmetics

Comment on lines 1405 to 1417
// if ValueFunction is comparison, return matrix is true boolean and boolean arithmetics are activated, use boolean arithmetics
if (op.fn instanceof ValueComparisonFunction && ret.denseBlock != null && (ret.getDenseBlock() instanceof DenseBlockTrueBool || ret.getDenseBlock() instanceof DenseBlockBool) ){
//TODO: Optimize casting to boolean denseblock types :/
if(ret.getDenseBlock() instanceof DenseBlockTrueBool){
for(int r=rl; r<ru; r++)
for(int c=0; c<clen; c++) {
double v1 = m1.quickGetValue(r, c);
double v2 = m2.quickGetValue(r, c);
boolean vb = ((ValueComparisonFunction) op.fn).compare( v1, v2 );

//react what is happening in appendValuePlain()
ret.allocateDenseBlock(false);
((DenseBlockTrueBool) ret.getDenseBlock()).set(r,c,vb);
lnnz += vb ? 1 : 0;
}
} else {
for(int r=rl; r<ru; r++)
for(int c=0; c<clen; c++) {
double v1 = m1.quickGetValue(r, c);
double v2 = m2.quickGetValue(r, c);
boolean vb = ((ValueComparisonFunction) op.fn).compare( v1, v2 );

//react what is happening in appendValuePlain()
ret.allocateDenseBlock(false);
((DenseBlockBool) ret.getDenseBlock()).set(r,c,vb);
lnnz += vb ? 1 : 0;
}
}

} else {
for(int r=rl; r<ru; r++)
for(int c=0; c<clen; c++) {
double v1 = m1.quickGetValue(r, c);
double v2 = m2.quickGetValue(r, c);
double v = op.fn.execute( v1, v2 );
ret.appendValuePlain(r, c, v);
lnnz += (v!=0) ? 1 : 0;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please take the new code out to a new method and call that method from unsafeBinary?

FP32, FP64, INT32, INT64, BOOLEAN, STRING, UNKNOWN,
FP32, FP64, INT32, INT64, BOOLEAN, TRUE_BOOLEAN, STRING, UNKNOWN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reuse the existing BOOLEAN value type instead of adding a new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reuse the old BOOLEAN type, yes. The reason for having two boolean types right now is to compare the bitset boolean denseblocks and the "true" boolean array denseblocks.

I would suggest renaming the "old" BOOLEAN, which currently is the bitset implementation, to BITSET and using BOOLEAN for the boolean array. That naming would make things more clear.
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. You can rename those to BITSET and BOOLEAN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to revert this renaming. This Valuetype is used for so much more than just denseblocks, which I did not get the first time...

I will reuse the existing Valuetype.BOOLEAN and use another mechanism to create BooleanArrays when desired.
In the end, it makes a lot more sense when both DenseBlockBool implementations use the same ValueType, since both store booleans in the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is a better approach and makes integrating the boolean type into the system easier.

@phaniarnab
Copy link
Contributor

Thanks for the commits @louislepage. I have a few questions regarding the experiment setup.
How are you running and benchmarking the experiments? Are you running a dml script with time()?
How did you generate the input matrices? Are they dense or sparse?

@louislepage
Copy link
Contributor Author

Thanks for the feedback @phaniarnab !

Since the boolean denseblocks are in an early stage and not yet usable via DML, I wrote a simple java program that creates the matrices and runs the operators on them. It runs the operators on two randomly generated double matrices of a predefined size. In my example, I took the average of 25 computations with random matrices. The time was taken using javas Instant.now() and then calculated as milli seconds, which were stored in a LongSummaryStatistics object to compute averages etc.:

        Instant start = Instant.now();
        mb1.binaryOperations(op, mb2, ret);
        Instant finish = Instant.now();
        long timeElapsed = Duration.between(start, finish).toMillis();
        timeStatistics.accept(timeElapsed);

As described here, the tool can be found in this repo. It should be public.

The main reason for this approach is that I did not yet want to add full support for boolean matrices since I can not tell if that might break things in other places.
However, I think it should be fine to use boolean matrices whenever running unsafeBinary with operators that are Comparision functions. If the data of dense blocks are only accessed using the get() functions, we can always return the double versions of the boolean values there to maintain compatibility with the FP64 implementation. To access the values as boolean in the future, I added a getBoolean() function to the boolean denseblocks.

Let me know if I should add this support for boolean matrices by using translations to doubles in the get() method to maintain compatibility.

But first I should test which kind of boolean implementation performs better, boolean array or bitset.

@phaniarnab
Copy link
Contributor

Measuring time in Java is fine. However, I recommend using System.nanoTime() instead.
Use boolean/bitset inputs of different sizes.

@louislepage
Copy link
Contributor Author

New file, new missing license, sorry...
Will fix asap!

@louislepage
Copy link
Contributor Author

Experiment:

To analyze and compare the performance of DenseBlockBool as Bitset and as BooleanArray to the current FP64 implementation, a java program was created that uses the classes from systemds (systemds-boolean-matrix-evaluation). For each given matrix size, it does five computations with pseudo-randomly generated matrices for double to boolean operators and for boolean to boolean operators.
Both the total runtime of each operation and the memory used are stored. Finally, the average over the five runs is taken.
The Matrix sizes are chosen to represent up to 16mio total entries, but with different dims. We used square matrices of up to 4000x4000 and non-square ones of up to 100x160000. In the non-square case, one dimension was always fixed to a length of 100.

Evaluation of Double to Bool (GreaterThan)

As input, two dense FP64 Matrices are randomly generated based on a given seed. As the output, a MatrixBlock with a DenseBlock of the desired type (FP64, Bitset, or BooleanArray) is created.
The timer is started, the operator called with the two input and the output matrices, the timer is stopped, and the result matrix is returned for validation. Memory is computed as the difference between the memory used before the in- and output matrices are created and after the result is computed (but before any Matrix blocks were dereferenced).

matrix_operations_DoubeToBool_large

The plot shows the runtimes on the left and memory usage on the right. In blue, the results of the bitset implementation are plotted. In green is the current Double/FP64 implementation, and in orange is the Boolean Array implementation. The different markers represent the results of square (square marker), Nx100 (circle), and 100xN (cross) matrices.

As expected, both runtime and memory usage increase linearly with an increased number of entries in the matrix.

The results show a visible performance increase for computation regarding runtime when using BooleanArray. This is most likely due to faster, direct access to the smaller arrays and fewer translations between boolean results and their double representation to store them in the result matrix since the operators can return boolean values, which can then be directly stored in the result matrices.
Interestingly the boolean array outperforms the Bitset here. The Bitset was even slower than the FP64 implementation. Without further investigation, this is probably due to fewer instructions necessary to access the boolean array compared to more when storing a value in a bitset. Also, it is worth noting that the bitset stores the values internally as long and therefore has to access an 8bytes large long whenever accessing a value. This might influence caching.

Regarding memory, both boolean implementations took less space. As expected, the Bitset takes even less space since each boolean in the boolean arrays still takes up one byte, but the bitset stores the booleans as individual bits that are collected into long.

There is no difference in runtime or memory when using non-square matrices as long as the total number of entires is the same.

Evaluation of Boolen To Boolean (Xor)

As input, two dense Matrices of the given type (FP64, Bitset, or BooleanArray) with boolean values are randomly generated based on a given seed. As the output, a MatrixBlock with a DenseBlock of the desired type is created.
The timer is started, and the operator is called with the two input and the output matrices in a loop for 20 iterations. The timer is stopped, and the result matrix is returned for validation. Memory is computed as the difference between the memory used before the in- and output matrices are created and after the loop ended (but before any Matrix blocks were dereferenced).

matrix_operations_BoolToBool_large

The plot shows the runtimes on the left and memory usage on the right. In blue, the results of the bitset implementation are plotted. In green is the current Double/FP64 implementation, and in orange is the Boolean Array implementation. The different markers represent the results of square (square marker), Nx100 (circle), and 100xN (cross) matrices.

As expected, an improved runtime for the boolean arrays is also visible in this case and both runtime and memory usage increase linearly with an increased number of entries in the matrix.

We can also see an improvement in memory usage when using a boolean implementation.
The memory used is less than when running the Double to Boolean operator, even when using FP64, but this is probably due to reusing the result matrix as input to the next computation. Therefore after the first iteration, there are only two matrices in memory.

Memory usage for the Boolean to Boolean case, when using actual Booleans as input, has decreased significantly since all matrices in memory are small, not only the output matrix. It is also visible that the bitset again outperforms the boolean array regarding memory usage.

There is no difference in runtime or memory when using non-square matrices as long as the total number of entries is the same.

Conclusion:

Since both boolean implementations improve memory usage, but only the boolean array outperforms the FP64 implementation in runtime, I suggest using the boolean array as the DenseBlock for any operation that results in only boolean data.

Let me know if there are any further open questions.

@Baunsgaard
Copy link
Contributor

Have you tried pop count on the bit set for the sum operation? This should speed up the bit set significantly for sum.

@louislepage
Copy link
Contributor Author

louislepage commented Jul 4, 2023

Have you tried pop count on the bit set for the sum operation? This should speed up the bit set significantly for sum.

I did look into how some operators would probably have a speedup when using bitset because they can be performed on the entire bitset and not value by value.
OR, AND, and XOR are also candidates for this (e.g. BitSet.or(BitSet bitset).
But, it would mean a reimplementation of all of these operators. And a mechanism to decide when to use value by value and when Bitset on Bitset.

However, the GraterThan and other Comparison Functions were already slower for bitset , than when run on FP64, and these have to run value by value. This would mean that one would need a dynamic typing mechanism to decide during runtime which kind of boolean denseblock would be most beneficial for the upcoming operations. I doubt that translating from boolean array to bitset and vice-versa for each operation would be worth it.

@phaniarnab and I discussed this briefly in a meeting but came to the result, that in the end, there should be only one kind of boolean array implementation for now.

Therefore I proposed the boolean array since it seems to have some speedup over all operators while using the same value-by-value operations as the FP64 implementation did. It would provide a speedup while being backward-compatible with the current operator implementation, so to say.

But I agree that there most certainly are situations where bitset could be significantly faster.

I think that having both boolean implementations and deciding at runtime what should be used would be very interesting, but for now, I do not understand systemds well enough to confidently say if this would be feasible and how complex this would be.

@louislepage
Copy link
Contributor Author

Some changes from main were throwing errors for the overflow tests and, following the contributing guidelines, I had to rebase on the main again and ran into conflicts. I resolved them but had to force-push this branch to accommodate these changes.

@louislepage
Copy link
Contributor Author

Can you restart the pipeline once?
I think a couple of errors are due to issues with spark and caching and might not be directly code related.

I cannot replicate the errors locally and do not understand how the changes in this branch could have changed anything regarding privacy handling or spark execution.

If you do see an issue with tre test results that might be code dependent, I would be very grateful for any tips.
Thanks!

@phaniarnab
Copy link
Contributor

Can you restart the pipeline once? I think a couple of errors are due to issues with spark and caching and might not be directly code related.

I cannot replicate the errors locally and do not understand how the changes in this branch could have changed anything regarding privacy handling or spark execution.

If you do see an issue with tre test results that might be code dependent, I would be very grateful for any tips. Thanks!

Yes. They may not be related to your changes.

@louislepage
Copy link
Contributor Author

Thanks, now it was successful!
And with this, I do not have anything else to add to the project.
Let me know if you need further information on the code and experiments.


Anyways, I still see the same

org.apache.sysds.runtime.controlprogram.federated.FederatedWorkerHandlerException: Failed to getVariable

and

Exception : class io.netty.channel.AbstractChannel$AnnotatedConnectException
Message   : Connection refused: localhost/127.0.0.1:35013
LEVEL : 3
Exception : class java.net.ConnectException
Message   : Connection refused

I do see the same errors in other successful jobs: https://github.com/apache/systemds/actions/runs/5478347246/jobs/9978929681?pr=1850

Now I am interested in how these tests are designed... 😅

@phaniarnab
Copy link
Contributor

Thanks @louislepage. The plots conclude this task. Thanks for your contribution. 👍🏽
If there are any scripts that are needed to reproduce the plots, please share those with us.

@louislepage
Copy link
Contributor Author

Great, thanks for your mentoring!

The jupyter notebook and result CSVs with which I created the plots can be found under louislepage/systemds-boolean-matrix-evaluation/python-eval in the repo where you can also find the java code that I used to gather the data.

@j143 j143 requested a review from phaniarnab December 4, 2023 02:57
@j143 j143 added this to the systemds-3.2.0 milestone Dec 4, 2023
@phaniarnab
Copy link
Contributor

This feature needs more thought and testing before it is ready for merging. I will leave it for the next release. @j143

@j143 j143 modified the milestones: systemds-3.2.0, next-release Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

4 participants