Orc performance issue #630

richox · 2024-10-23T06:32:45Z

Describe the bug
we found orc scan has poor perfomance while running tpcds benchmark:

the same scan operator is times slower than parquet (from tpcds q3).

To Reproduce
Steps to reproduce the behavior:

generate parquet and orc datasets using /tpcds/datagen.
run bechmarks on both datasets using /tpcds/benchmark-runner.
compare the performance of NativeParquetScan and NativeOrcScan.

Expected behavior
orc should have the similar performance comparing to parquet.

Screenshots

Edit
the main reason is that orc-rust reads all data without column pruning and predicate filtering, after applying column pruning with datafusion-contrib/datafusion-orc#133 , the performance will be much better:

currently orc is still 20%~30% slower than parquet, which maybe related to unsupported predicate filtering.

This was referenced Oct 23, 2024

skip reading unused columns in stripes datafusion-contrib/datafusion-orc#132

Closed

improve NativeOrcScan #631

Merged

richox pinned this issue Oct 24, 2024

lihao712 closed this as completed in #631 Oct 28, 2024

richox reopened this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orc performance issue #630

Orc performance issue #630

richox commented Oct 23, 2024 •

edited

Loading

Orc performance issue #630

Orc performance issue #630

Comments

richox commented Oct 23, 2024 • edited Loading

richox commented Oct 23, 2024 •

edited

Loading