Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orc performance issue #630

Open
richox opened this issue Oct 23, 2024 · 0 comments · Fixed by #631
Open

Orc performance issue #630

richox opened this issue Oct 23, 2024 · 0 comments · Fixed by #631

Comments

@richox
Copy link
Collaborator

richox commented Oct 23, 2024

Describe the bug
we found orc scan has poor perfomance while running tpcds benchmark:

the same scan operator is times slower than parquet (from tpcds q3).

To Reproduce
Steps to reproduce the behavior:

  1. generate parquet and orc datasets using /tpcds/datagen.
  2. run bechmarks on both datasets using /tpcds/benchmark-runner.
  3. compare the performance of NativeParquetScan and NativeOrcScan.

Expected behavior
orc should have the similar performance comparing to parquet.

Screenshots
image

Edit
the main reason is that orc-rust reads all data without column pruning and predicate filtering, after applying column pruning with datafusion-contrib/datafusion-orc#133 , the performance will be much better:

image

currently orc is still 20%~30% slower than parquet, which maybe related to unsupported predicate filtering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant