Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming load 60GB csv will OOM #6910

Closed
Xuanwo opened this issue Jul 30, 2022 · 13 comments · Fixed by #6945
Closed

Streaming load 60GB csv will OOM #6910

Xuanwo opened this issue Jul 30, 2022 · 13 comments · Fixed by #6945
Assignees

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Jul 30, 2022

Follow https://databend.rs/doc/learn/analyze-ontime-with-databend-on-ec2-and-s3

Start (release build) databend-query with local fs.

:( curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:0" -H "field_delimiter:\t" -F "upload=@t_ontime.csv" -XPUT http://root:@127.0.0.1:8000/v1/streaming_load



curl: (56) Recv failure: Connection reset by peer
curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:0" -H  -F   0.79s user 25.15s system 16% cpu 2:40.03 total

Databend Query

:) ./before-databend-query
Databend Query

Version: v0.7.146-nightly-94af6a2-simd(rust-1.64.0-nightly-2022-07-30T03:25:03.396932975Z)
Log:
    File: enabled level=INFO dir=./.databend/logs
    Stderr: disabled level=INFO
Meta: embedded at ./.databend/meta_embedded
Storage: fs://root=_data

MySQL
    listened at 127.0.0.1:3307
    connect via: mysql -uroot -h127.0.0.1 -P3307
Clickhouse(native)
    listened at 127.0.0.1:9000
    connect via: clickhouse-client --host 127.0.0.1 --port 9000
Clickhouse(http)
    listened at 127.0.0.1:8124
    usage:  echo 'create table test(foo string)' | curl -u root: '127.0.0.1:8124' --data-binary  @-
echo '{"foo": "bar"}' | curl -u root: '127.0.0.1:8124/?query=INSERT%20INTO%20test%20FORMAT%20JSONEachRow' --data-binary @-
Databend HTTP
    listened at 127.0.0.1:8000
    usage:  curl -u root: --request POST '127.0.0.1:8000/v1/query/' --header 'Content-Type: application/json' --data-raw '{"sql": "SELECT avg(number) FROM numbers(100000000)"}'

[1]    1085010 killed     ./before-databend-query
./before-databend-query  1672.18s user 2811.37s system 1291% cpu 5:47.12 total

Dmesg

[100202.445966] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.wezfurlong.wezterm-6c0843b02ff44de4974b3bf2e82f095a.scope,task=before-databend,pid=1085010,uid=1000
[100202.446149] Out of memory: Killed process 1085010 (before-databend) total-vm:76541232kB, anon-rss:39267488kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:111932kB oom_score_adj:200
[100205.025219] oom_reaper: reaped process 1085010 (before-databend), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Runtime info:

:) neofetch
                   -`                    xuanwo@xuanwo-work
                  .o+`                   ------------------
                 `ooo/                   OS: Arch Linux x86_64
                `+oooo:                  Host: 90RC00EGCP LEGION REN7000P-26AMR
               `+oooooo:                 Kernel: 5.18.14-zen1-1-zen
               -+oooooo+:                Uptime: 1 day, 4 hours, 2 mins
             `/:-:++oooo+:               Packages: 1099 (pacman), 2 (nix-user)
            `/++++/+++++++:              Shell: zsh 5.9
           `/++++++++++++++:             Resolution: 3840x2160
          `/+++ooooooooooooo/`           DE: Plasma 5.25.3
         ./ooosssso++osssssso+`          WM: KWin
        .oossssso-````/ossssss+`         Theme: Breeze Light [Plasma], Breeze [GTK2/3]
       -osssssso.      :ssssssso.        Icons: [Plasma], breeze [GTK2/3]
      :osssssss/        osssso+++.       Terminal: WezTerm
     /ossssssss/        +ssssooo/-       CPU: AMD Ryzen 9 5950X (32) @ 3.400GHz
   `/ossssso+/:-        -:/+osssso+-     GPU: AMD ATI Radeon RX 6800/6800 XT / 6900 XT
  `+sso+:-`                 `.-/+oso:    Memory: 24989MiB / 64163MiB
 `++:.                           `-/+/
 .`                                 `/

cc @youngsofun, can you take a look?

@youngsofun
Copy link
Member

youngsofun commented Aug 1, 2022

@Xuanwo have you tried copy into for this CSV? does it oom too?

the file is too big for my local disk...

except for suspected mem leak, I will start from #6936

@youngsofun
Copy link
Member

.zip is not supported?

@Xuanwo
Copy link
Member Author

Xuanwo commented Aug 1, 2022

@Xuanwo have you tried copy into for this CSV? does it oom too?

Yes. This query will OOM on my 64GiB PC.

.zip is not supported?

Yep, zip is an archive format and is more complex than a simple compression format like gz.

We don't support zip and tar so far.

For more information apache/opendal#312

@sundy-li
Copy link
Member

sundy-li commented Aug 1, 2022

Could u try use lower parallelism -H 'max_threads: 16' ?

@Xuanwo
Copy link
Member Author

Xuanwo commented Aug 1, 2022

Could u try use lower parallelism -H 'max_threads: 16' ?

It works.

After decreasing max_threads from the default value on my PC (32) to 16, I can finish this query:

:) curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:0" -H "field_delimiter:\t" -H 'max_threads: 16'  -F "upload=@t_ontime.csv" -XPUT http://root:@127.0.0.1:8000/v1/streaming_load
{"id":"59f22406-f666-4f54-91df-aa46f993f16e","state":"SUCCESS","stats":{"rows":202687655,"bytes":147120170524},"error":null}curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:0" -H  -H   4.99s user 127.40s system 10% cpu 21:51.63 total

During the insert process, the highest memory recorded is 33.9GiB.

@flaneur2020
Copy link
Member

there's a size limit & timeout in doris's stream load:

Since Stream load is based on the BE initiative to import and distribute data, the recommended amount of imported data is between 1G and 10G. Since the default maximum Stream load import data volume is 10G, the configuration of BE streaming_load_max_mb needs to be modified if files exceeding 10G are to be imported.
Stream load default timeout is 300 seconds, according to Doris currently the largest import speed limit, about more than 3G files need to modify the import task default timeout.

https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual?_highlight=stream#data-volume

@PsiACE
Copy link
Member

PsiACE commented Aug 1, 2022

I noticed that sled supports the io_uring feature, perhaps we could consider enabling it?

curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:0" -H "field_delimiter:\t"  -F 'upload=@"../../databend/mini_ontime.csv"'  -XPUT http://root:@127.0.0.1:8001/v1/streaming_load
{"id":"8ac244ea-036c-4d81-98d1-c7d24ccd6747","state":"SUCCESS","stats":{"rows":20000000,"bytes":14409762131},"error":null}

before: 9m59s

after: 9m12s

@youngsofun
Copy link
Member

I try to insert into

CREATE TABLE ontime2 ... engine = null; ,

the memory did not grow.

so I guess the bug is in the insert pipeline. @sundy-li

@sundy-li
Copy link
Member

sundy-li commented Aug 2, 2022

Ok, it's in TransformBlockCompact

@sundy-li
Copy link
Member

sundy-li commented Aug 2, 2022

I try to insert into

CREATE TABLE ontime2 ... engine = null; ,

the memory did not grow.

so I guess the bug is in the insert pipeline. @sundy-li

I tested null engine, it caused oom too.

@youngsofun
Copy link
Member

this file should use TSV format to load

@Xuanwo
Copy link
Member Author

Xuanwo commented Aug 2, 2022

this file should use TSV format to load

What's the difference? And this will prevent us from OOM?

@youngsofun
Copy link
Member

current implementation assume :
CSV use " or ' to enclose strings
TSV does not use enclosing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants