External Sorting algorithm for Databases having constrained storage hierarchy
- Divy Patel (9085310937) dspatel6@wisc.edu
- Sahil Naphade (9085746619) smnaphade@wisc.edu
- Devaki Kulkarni (9086222321) dgkulkarni2@wisc.edu
- Manaswini Gogineni (9085432699) mgogineni@wisc.edu
- Divy: Cache-size mini runs, Device-optimized page sizes, Spilling memory-to-SSD, Spilling from SSD to disk, Graceful degradation, Optimized merge patterns, Testing and Memory Leak Check
- Sahil: Tournament trees, Offset-value coding, Minimum count of row & column comparisons, Optimized merge patterns, Large-size records, Testing and Memory Leak Check
- Devaki: Tournament trees, Offset-value coding, Large-size records
- Manaswini: Verification
-
Tournament trees:
File Tree.cpp @ Line 196
-
Offset-value coding:
File DataRecord.cpp @ Line 122
-
Minimum count of row & column comparisons
-
Cache-size mini runs:
File SortRecords.cpp @ Line 26
-
Device-optimized page sizes:
File SortRecords.cpp @ Line 81 and Line 136
-
Spilling memory-to-SSD:
File SortRecords.cpp @ Line 65
-
Spilling from SSD to disk:
File SortRecords.cpp @ Line 69 and Line 125
-
Graceful degradation:
File SortRecords.cpp @ Line 72, Line 74 and Line 151
- Into merging
- Beyond one merge step
-
Optimized merge patterns:
File SortRecords.cpp @ Line 150 and Line 151
-
Verifying:
File Iterator.cpp @ Line 69 and Line 84
- sets of rows & values:
File Iterator.cpp @ Line 84
- sort order:
File Iterator.cpp @ Line 69
- sets of rows & values:
-
Replacement selection?
-
Run size > memory size?
-
Variable-size records?
-
Compression?
-
Prefix truncation?
-
Quicksort
Tournament-tree priority queue
was used in order to achievehigh fan-in
for merging our sorted run inputs of records and less number of comparisons than a standard tree-of-winnersOffset-value coding
was used to achieveminimum column value comparisons
Cache-size mini runs
were used to be able to fit the sort inputs, for tournament-tree, in the cache. This enabled us to leverage the low-latency accesses when there arecache hits
Device-optimized page sizes
were used in order to being cognizant about theaccess-profile(latency, bandwidth)
of various devices in the storage hierarchy. ForSSD
, we used8KB(100 MB/s * 0.1 ms ~ 10KB)
and forHDD
, we used1MB(100 MB/s * 10 ms ~ 1MB)
- We achieved graceful-degradation by spilling
cache-size runs from cache to memory
,spilling memory-size runs from memory to SSD
andspilling SSD-size runs from SSD to HDD
- Also
HDD-page size(1MB)
sorted runs were written toSSD
prior to actually merging runs on theHDD
. This is to leverage low-latency accesses of flash drives(SSD) Sort-order
,set of rows
and theirvalues
were verified as part of sorting the input records. This is to verify thecorrectness
andintegrity
of our sort algorihthm
- The implementation of the
External-Sort
is complete with all of the techniques which were expected from us as part of the course project - The sort was tested against
1KB
size records and with12M
number of records(although it takes ~1hr to complete the sort, for this particular test-case) - The sort algorithm was tested against
valgrind
to check for any memory leaks introduced while developing. The codebase does not have any memory leaks, from the latest leak-report on the most recent code version
- To run our program, first compile the source code using following command, under
External-Sort
directory
$ cd External-Sort
$ make ExternalSort.exe
- After compiling the source code, to execute the External Sort with custom arguments, run following command inside
External-Sort
directory
# Where,
# "-c" gives the total number of records
# "-s" is the individual record size
# "-o" is the trace of your program run
$ ./ExternalSort.exe -c 120 -s 10 -o trace0.txt
-
The program creates three directories on the completion of the sort algorithm:
input
: This directory consist of the input table which has records generated by the random-generator in arbitrary orderoutput
: This directory consist of the output table which has records from input table but in a sorted order, sorted using our sort algorithmtrace
: This directory consists of trace files generated from the sort. The trace file consists of logs related SSD and HDD device accesses. And the logs related to sort state machine
-
In order to remove all the generated binaries, executables, and the utility directories mentioned above, run the following command
$ make clean
$ docker run -it --privileged -v $pwd/External-Sort:/External-Sort ubuntu bash
$ apt-get update
$ apt-get install build-essential make g++ vim sudo valgrind -y
$ cd ./External-Sort
$ make ExternalSort.exe
# Where,
# "-c" gives the total number of records
# "-s" is the individual record size
# "-o" is the trace of your program run
$ ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt
# Creates a log file `valgrind` inside the External-Sort directory
$ valgrind --track-origins=yes --log-file="/External-Sort/valgrind" --leak-check=yes ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt