Skip to content

Commit

Permalink
More blog post fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
szarnyasg committed Dec 1, 2024
1 parent e956239 commit 41cafa0
Show file tree
Hide file tree
Showing 12 changed files with 44 additions and 44 deletions.
2 changes: 1 addition & 1 deletion _posts/2021-06-25-querying-parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ Below are the timings for both of these queries.

Pandas takes significantly longer to complete this query. That is because Pandas not only needs to read each of the three Parquet files in their entirety, it has to concatenate these three separate Pandas DataFrames together.

## Concatenate Into a Single File
## Concatenate into a Single File

We can address the concatenation issue by creating a single big Parquet file from the three smaller parts. We can use the `pyarrow` library for this, which has support for reading multiple Parquet files and streaming them into a single large file. Note that the `pyarrow` parquet reader is the very same parquet reader that is used by Pandas internally.

Expand Down
2 changes: 1 addition & 1 deletion _posts/2022-05-27-iejoin.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ and occur when two inequality conditions are present in a join predicate.
Database implementations often rely on slow `O(N^2)` algorithms that compare every pair of rows
for these operations.
Instead, DuckDB leverages its fast sorting logic to implement two highly optimized parallel join operators
for these kinds of range predicates, resulting in 20-30x faster queries.
for these kinds of range predicates, resulting in 20-30× faster queries.
With these operators, DuckDB can be used effectively in more time-series-oriented use cases.

<!--more-->
Expand Down
2 changes: 1 addition & 1 deletion _posts/2022-09-30-postgres-scanner.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ To investigate the performance of the Postgres Scanner, we ran the well-known TP
|21 | 0.09| 1.53| 0.35|
|22 | 0.03| 0.15| 0.15|

Stock Postgres is not able to finish queries 17 and 20 within a one-minute timeout because of correlated subqueries containing a query on the lineitem table. For the other queries, we can see that DuckDB with the Postgres Scanner not only finished all queries, it also was faster than stock Postgres on roughly half of them, which is astonishing given that DuckDB has to read its input data from Postgres through the client/server protocol as described above. Of course, stock DuckDB is still 10x faster with its own storage, but as discussed at the very beginning of this post this requires the data to be imported there first.
Stock Postgres is not able to finish queries 17 and 20 within a one-minute timeout because of correlated subqueries containing a query on the lineitem table. For the other queries, we can see that DuckDB with the Postgres Scanner not only finished all queries, it also was faster than stock Postgres on roughly half of them, which is astonishing given that DuckDB has to read its input data from Postgres through the client/server protocol as described above. Of course, stock DuckDB is still 10× faster with its own storage, but as discussed at the very beginning of this post this requires the data to be imported there first.

## Other Use Cases

Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-02-24-jupysql.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ Data frame takes 357 MB

<!-- #endregion -->

As you can see, we're using almost 10x as much memory as the file size. Given this overhead, we must be much more conservative about what *larger-than-memory* means, as "medium" files might not fit into memory once loaded. But this is just the beginning of our memory problems.
As you can see, we're using almost 10× as much memory as the file size. Given this overhead, we must be much more conservative about what *larger-than-memory* means, as "medium" files might not fit into memory once loaded. But this is just the beginning of our memory problems.

When plotting data, we often need to preprocess it before it's suitable for visualization. However, if we're not careful, these preprocessing steps will copy our data, dramatically increasing memory. Let's show a practical example.

Expand Down
10 changes: 5 additions & 5 deletions _posts/2023-05-17-announcing-duckdb-080.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,12 +146,12 @@ FROM 'data/glob/crawl/stackoverflow/**/*.csv';

## Storage Improvements

[**Lazy-Loading Table Metadata**](https://github.com/duckdb/duckdb/pull/6715). DuckDB’s internal storage format stores metadata for every row group in a table, such as min-max indexes and where in the file every row group is stored. In the past, DuckDB would load this metadata immediately once the database was opened. However, once the data gets very big, the metadata can also get quite large, leading to a noticeable delay on database startup. In this release, we have optimized the metadata handling of DuckDB to only read table metadata as its being accessed. As a result, startup is near-instantaneous even for large databases, and metadata is only loaded for columns that are actually used in queries. The benchmarks below are for a database file containing a single large TPC-H `lineitem` table (120x SF1) with ~770 million rows and 16 columns:
[**Lazy-Loading Table Metadata**](https://github.com/duckdb/duckdb/pull/6715). DuckDB’s internal storage format stores metadata for every row group in a table, such as min-max indexes and where in the file every row group is stored. In the past, DuckDB would load this metadata immediately once the database was opened. However, once the data gets very big, the metadata can also get quite large, leading to a noticeable delay on database startup. In this release, we have optimized the metadata handling of DuckDB to only read table metadata as its being accessed. As a result, startup is near-instantaneous even for large databases, and metadata is only loaded for columns that are actually used in queries. The benchmarks below are for a database file containing a single large TPC-H `lineitem` table (120× SF1) with ~770 million rows and 16 columns:

| Query | v0.6.1 | v0.7.1 | v0.8.0 | Parquet |
|------------------------|--------|--------|-------|---------|
| SELECT 42 | 1.60s | 0.31s | 0.02s | - |
| FROM lineitem LIMIT 1; | 1.62s | 0.32s | 0.03s | 0.27s |
| Query | v0.6.1 | v0.7.1 | v0.8.0 | Parquet |
|-------------------------|-------:|-------:|--------:|--------:|
| `SELECT 42` | 1.60s | 0.31s | 0.02s | - |
| `FROM lineitem LIMIT 1` | 1.62s | 0.32s | 0.03s | 0.27s |


## Clients
Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-05-26-correlated-subqueries-in-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ JOIN (
AND distance = min_distance;
```

By performing the de-correlation manually, the performance of SQLite and Postgres improves significantly. However, both systems remain over 30x slower than DuckDB.
By performing the de-correlation manually, the performance of SQLite and Postgres improves significantly. However, both systems remain over 30× slower than DuckDB.

| DuckDB | Postgres | SQLite |
|-------:|---------:|-------:|
Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-07-07-python-udf.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,7 @@ exec_external(con)
| UDF | 5.63 | 112.848 |


Here we can see that there is no significant regression in performance when utilizing UDFs. However, you still have the benefits of safer execution and the utilization of SQL. In our example, we can also notice that the external function materializes the entire query, resulting in a 5x higher peak memory consumption compared to the UDF approach.
Here we can see that there is no significant regression in performance when utilizing UDFs. However, you still have the benefits of safer execution and the utilization of SQL. In our example, we can also notice that the external function materializes the entire query, resulting in a higher peak memory consumption compared to the UDF approach.

## Conclusions and Further Development

Expand Down
8 changes: 4 additions & 4 deletions _posts/2023-09-15-asof-joins-fuzzy-temporal-lookups.md
Original file line number Diff line number Diff line change
Expand Up @@ -496,10 +496,10 @@ Running the benchmark, we get results like this:
| IEJoin | 3.522 |
| State Join | 192.460 |

The runtime improvement of AsOf over IEJoin here is about 9x.
The runtime improvement of AsOf over IEJoin here is about .
The horrible performance of the Hash Join is caused by the long (100K) bucket chains in the hash table.

The second benchmark tests the case where the probe side is about 10x smaller than the build side:
The second benchmark tests the case where the probe side is about 10× smaller than the build side:

```sql
CREATE OR REPLACE TABLE probe AS
Expand Down Expand Up @@ -540,7 +540,7 @@ FROM probe p INNER JOIN state s
| AsOf | 0.077 |
| IEJoin | 49.508 |

Now the runtime improvement of AsOf over IEJoin is huge (~500x)
Now the runtime improvement of AsOf over IEJoin is huge (~500×)
because it can leverage the partitioning to eliminate almost all of the equality mismatches.

The Hash Join implementation does much better here because
Expand Down Expand Up @@ -596,7 +596,7 @@ CREATE OR REPLACE TABLE probe1 AS
```

The build tables are much larger and have approximately
10/100/1000x the number of entries as the 15 element tables:
10/100/1000× the number of entries as the 15 element tables:

```sql
-- 10:1
Expand Down
28 changes: 14 additions & 14 deletions _posts/2024-03-29-external-aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -304,32 +304,32 @@ The following table is a summary of the hardware used.

| Specs | `c6id.metal` | Laptop | Ratio |
|:------------|-------------:|-------:|-------:|
| Memory | 256 GB | 16 GB | 16x |
| CPU cores | 64 | 8 | 8x |
| CPU threads | 128 | 8 | 16x |
| Memory | 256 GB | 16 GB | 16× |
| CPU cores | 64 | 8 | |
| CPU threads | 128 | 8 | 16× |
| Hourly cost | $6.45 | $0.00 | NaN |

Although the CPU cores of the AWS EC2 instance are not directly comparable with those of my laptop, the instance clearly has much more compute power and memory available.
Despite the large differences in hardware, DuckDB can complete all 10 queries without a problem:

| Query | `c6id.metal` | Laptop | Ratio |
|------:|-------------:|-------:|-------:|
| 1 | 0.08 | 0.74 | 9.25x |
| 2 | 0.09 | 0.76 | 8.44x |
| 3 | 8.01 | 156.63 | 19.55x |
| 4 | 0.26 | 2.07 | 7.96x |
| 5 | 6.72 | 145.00 | 21.58x |
| 6 | 17.12 | 19.28 | 1.13x |
| 7 | 6.33 | 124.85 | 19.72x |
| 8 | 6.53 | 126.35 | 19.35x |
| 9 | 0.32 | 1.90 | 5.94x |
| 10 | 8.58 | 264.14 | 30.79x |
| 1 | 0.08 | 0.74 | 9.25× |
| 2 | 0.09 | 0.76 | 8.44× |
| 3 | 8.01 | 156.63 | 19.55× |
| 4 | 0.26 | 2.07 | 7.96× |
| 5 | 6.72 | 145.00 | 21.58× |
| 6 | 17.12 | 19.28 | 1.13× |
| 7 | 6.33 | 124.85 | 19.72× |
| 8 | 6.53 | 126.35 | 19.35× |
| 9 | 0.32 | 1.90 | 5.94× |
| 10 | 8.58 | 264.14 | 30.79× |

The runtime of the queries is reported in seconds, and was obtained by taking the median of 3 runs on my laptop using DuckDB 0.10.1.
The `c6id.metal` instance results were obtained from the [benchmark website](https://duckdblabs.github.io/db-benchmark/).
Despite being unable to _fit_ all unique groups in my laptop's memory, DuckDB can _compute_ all unique groups and return them.
The largest query, query 10, takes almost 4.5 minutes to complete.
This is over 30x longer than with the beefy `c6id.metal` instance.
This is over 30× longer than with the beefy `c6id.metal` instance.
The large difference is, of course, explained by the large differences in hardware.
Interestingly, this is still faster than Spark on the `c6id.metal` instance, which takes 603.05 seconds!

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ LIMIT 5;
```

| id | name_short | name_long | country | latitude | longitude |
|----:|------------|-----------------------|---------|----------|-----------|
|----:|------------|-----------------------|---------|---------:|----------:|
| 266 | Den Bosch | 's-Hertogenbosch | NL | 51.69 | 5.29 |
| 269 | Dn Bosch O | 's-Hertogenbosch Oost | NL | 51.70 | 5.32 |
| 227 | 't Harde | 't Harde | NL | 52.41 | 5.89 |
Expand Down
26 changes: 13 additions & 13 deletions _posts/2024-06-26-benchmarks-over-time.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: "Benchmarking Ourselves over Time at DuckDB"
author: Alex Monahan
thumb: "/images/blog/thumbs/benchmarking-duckdb.svg"
image: "/images/blog/thumbs/benchmarking-duckdb.png"
excerpt: "In the last 3 years, DuckDB has become 3-25x faster and can analyze ~10x larger datasets all on the same hardware."
excerpt: "In the last 3 years, DuckDB has become 3-25× faster and can analyze ~10× larger datasets all on the same hardware."
tags: ["benchmark"]
---

Expand Down Expand Up @@ -125,7 +125,7 @@ If window functions are filtered out of the chart, we see a smoother trend.
You may also notice that starting with version 0.9 in September 2023, the performance appears to plateau.
What is happening here?
First, don't forget to zoom in!
Over the last year, DuckDB has still improved over 3x!
Over the last year, DuckDB has still improved over !
More recently, the DuckDB Labs team focused on scalability by developing algorithms that support larger-than-memory calculations.
We will see the fruits of those labors in the scale section later on!
In addition, DuckDB focused exclusively on bug fixes in versions 0.10.1, 0.10.2, and 0.10.3 in preparation for an especially robust DuckDB 1.0.
Expand Down Expand Up @@ -178,7 +178,7 @@ If you remember the version that you last tested, you can compare how much faste

DuckDB has invested substantially in building a [fast and robust CSV parser]({% post_url 2023-10-27-csv-sniffer %}).
This is often the first task in a data analysis workload, and it tends to be undervalued and underbenchmarked.
DuckDB has **improved CSV reader performance by nearly 3x**, while adding the ability to handle many more CSV dialects automatically.
DuckDB has **improved CSV reader performance by nearly **, while adding the ability to handle many more CSV dialects automatically.

### Group By

Expand All @@ -192,7 +192,7 @@ DuckDB has **improved CSV reader performance by nearly 3x**, while adding the ab
});
</script>

Group by or aggregation operations are critical steps in OLAP workloads, and have therefore received substantial focus in DuckDB, **improving over 12x in the last 3 years**.
Group by or aggregation operations are critical steps in OLAP workloads, and have therefore received substantial focus in DuckDB, **improving over 12× in the last 3 years**.

In November 2021, version 0.3.1 enabled multithreaded aggregation by default, providing a significant speedup.

Expand Down Expand Up @@ -221,7 +221,7 @@ You can see that this was achieved while continuing to improve performance for t
</script>

Join operations are another area of focus for analytical databases, and DuckDB in particular.
Join speeds have **improved by 4x in the last 3 years**!
Join speeds have **improved by 4 in the last 3 years**!

Version 0.6.1 in December 2022 introduced improvements to the out-of-core hash join that actually improved the smaller-than-memory case as well.
Parallel data loading from 0.6.1 also helps in this benchmark as well, as some results are the same size as the input table.
Expand All @@ -241,7 +241,7 @@ This focus has also benefitted the smaller-than-memory case and has led to the i
});
</script>

Over the time horizon studied, window functions have **improved a dramatic 25x**!
Over the time horizon studied, window functions have **improved a dramatic 25×**!

Window function performance was improved substantially with the 0.9.0 release in September 2023.
[14 different performance optimizations contributed](https://github.com/duckdb/duckdb/issues/7809#issuecomment-1679387022).
Expand All @@ -265,15 +265,15 @@ We leave benchmarking that feature for future work!
</script>

Often DuckDB is not the final step in a workflow, so export performance has an impact.
Exports are **10x faster now!**
Exports are **10× faster now!**
Until recently, the DuckDB format was not backward compatible, so the recommended long term persistence format was Parquet.
Parquet is also critical to interoperability with many other systems, especially data lakes.
DuckDB works well as a workflow engine, so exporting to other in-memory formats is quite common as well.

In the September 2022 release (version 0.5.1) we see significant improvements driven by switching from Pandas to Apache Arrow as the recommended in-memory export format.
DuckDB's underlying data types share many similarities with Arrow, so data transfer is quite quick.

Parquet export performance has improved by 4–5x over the course of the benchmark, with dramatic improvements in versions 0.8.1 (June 2023) and 0.10.2 (April 2024).
Parquet export performance has improved by 4– over the course of the benchmark, with dramatic improvements in versions 0.8.1 (June 2023) and 0.10.2 (April 2024).
Version 0.8.1 added [parallel Parquet writing](https://github.com/duckdb/duckdb/pull/7375) while continuing to preserve insertion order.

The change driving the improvement in 0.10.2 was more subtle.
Expand Down Expand Up @@ -318,7 +318,7 @@ This allows DuckDB to fit seamlessly into other workflows.
This benchmark measures how fast DuckDB can scan and aggregate various data formats.

To enable comparisons over time, we switch from Pandas to Arrow at version 0.5.1 as mentioned.
DuckDB is **over 8x faster in this workload**, and the absolute time required is very short.
DuckDB is **over faster in this workload**, and the absolute time required is very short.
DuckDB is a great fit for this type of work!

#### Scanning Apache Arrow vs. Pandas vs. Parquet
Expand All @@ -337,7 +337,7 @@ Once again, we examine all three formats over the entire time horizon.

When scanning data, Apache Arrow and Pandas are more comparable in performance.
As a result, while Arrow is clearly preferable for exports, DuckDB will happily read Pandas with similar speed.
However, in this case, the in-memory nature of both Arrow and Pandas allow them to perform 2–3x faster than Parquet.
However, in this case, the in-memory nature of both Arrow and Pandas allow them to perform 2– faster than Parquet.
In absolute terms, the time required to complete this operation is a very small fraction of the benchmark, so other operations should be the deciding factor.

## Scale tests
Expand All @@ -357,7 +357,7 @@ Analyzing larger-than-memory data is a superpower for DuckDB, allowing it to be
In version 0.9.0, launched in September 2023, [DuckDB's hash aggregate was enhanced to handle out-of-core (larger than memory) intermediates](https://github.com/duckdb/duckdb/pull/7931).
The details of the algorithm, along with some benchmarks, are available in [this blog post]({% post_url 2024-03-29-external-aggregation %}).
This allows for DuckDB to aggregate one billion rows of data (50GB in size) on a MacBook Pro with only 16GB of RAM, even when the number of unique groups in the group by is large.
This represents at least a 10x improvement in aggregate processing scale over the course of the 3 years of the benchmark.
This represents at least a 10× improvement in aggregate processing scale over the course of the 3 years of the benchmark.

DuckDB's hash join operator has supported larger-than-memory joins since version 0.6.1 in December 2022.
However, the scale of this benchmark (coupled with the limited RAM of the benchmarking hardware), meant that this benchmark could still not complete successfully.
Expand All @@ -376,7 +376,7 @@ DuckDB's performance on the same hardware has improved dramatically, and at the

Source: [Our World in Data](https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?yScale=linear&time=2021..latest&facet=metric&uniformYAxis=0)

The price of RAM has declined by 2.2x and the price of SSD storage has decreased by 2.7x from 2021 to 2023 alone.
The price of RAM has declined by 2. and the price of SSD storage has decreased by 2. from 2021 to 2023 alone.
Thanks to the combination of DuckDB enhancements and hardware prices, the scale of analysis possible on a single node has increased by substantially more than an order of magnitude in just 3 years!

## Analyzing the Results Yourself
Expand Down Expand Up @@ -406,7 +406,7 @@ Additionally, critical workloads like data import, CSV parsing, and data export
The complete developer experience is critical for DuckDB!

Finally, DuckDB now supports larger-than-memory calculations across all operators: sorting, aggregating, joining, and windowing.
The size of problem that you can handle on your current compute resources just got 10x bigger, or more!
The size of problem that you can handle on your current compute resources just got 10× bigger, or more!

If you have made it this far, welcome to the flock! 🦆
[Join us on Discord](https://discord.duckdb.org/), we value your feedback!
Expand Down
Loading

0 comments on commit 41cafa0

Please sign in to comment.