This is the source code of the paper TORN, which is mainly oriented to distributed join scenarios, optimizing the data layout for multi-table queries, single-table queries and the co-partitioning of blocks on its basis. Its main execution process is:
a) train MLP predictor to predict historical queries <workload_predictor.py>;
b) build adaptive partition trees based on predicted (multi-table and single-table) queries <partition_algorithm.py>;
c) form partition files (parquet) by routing data based on partition trees <data_routing.ipynb>;
d) conduct experiments using evaluation metrics <experiment.py / experiment.ipynb>;
e) execute experiments on spark <query_routing.ipynb>.
Directory
-
NORA_experiments includes three directories, a) images. b) queryset. c)datasets. The images stores some temporary visualization files for data layout and queries. The queryset stores workload files that are classifies as multi_join / tpcds / tpch. The datasets stores some data files generated by tpc/tpcds kits which are not uploaded here due to excessive size.
-
SIM_experiments is the test environment of similar workload prediction, and it includes some test workloads for MLP and IFDT.
-
PartitionLayout is responsible for storing the partition tree files generated for tpch, tpcds, and distributed joins scenarios.
Python Files
-
data_helper supports the sampling, reading, statistics and generation of table data and workload.
-
partition_tree.py is used to construct (hierarchical) partition tree. It also can save tree into bytes files and loads it flexibly.
-
partition_node.py illustrates the structure of tree node and its functionality.
-
partition_algorithm.py constructs the trees of provided all baselines.
-
join_until.py supports the computation of hyper cost and provides QDG algorithm.
-
join_depths.py supports the process of determining best depth for different trees.
-
workload_predictor.py includes the achievements of IFDT and MLP predictor and their evaluation process.
Jupyter Files
-
data_routing.ipynb/ data_routing_for_join.ipynb use the partition tree constructed on the sampled data, write all the data to the specific parquet partition file in order according to the filter criteria of each node in the tree. The two files are tested for single-table environment and multi-table environment respectively.
-
query_routing.ipynb/ query_routing_for_join.ipynb: After building the partitioned environment on each spark-nodes, we use SPARK-SQL to execute these workload in single-table and multi-table scenarios respectively and record the execution metrics.
-
experiment.ipynb includes the experiments related to scalability.
-
experiment_join.ipynb includes some experiments related to distributed joins. It is responsible for evaluating the performance of TORN and adaptdb over multi-table queries.
Users need to create their own hadoop and spark environments. The following is an template example of paths that need to be created for the HDFS file system.
PATHS Template
-
par_data
-
tpch
- QdTree
- scale1
- scale10
- ...
- PAW
- TORN
- AdaptDB
- QdTree
-
tpcds
-
multi_join
- tpch
- ...
- tpcds
- tpch
-
The project's dependency packages are written to the requirement.txt and can be installed with the following command:
python3 -m pip install -r requirements.txt
experiment.py:compare_hyper_join_with_multitable
Assigning reasonable depths to two levels of tree has an important impact on the quality of join tree. We define the join depth of top layer as
TORN uses a heuristic algorithm to assign
We define
If