External Sorting algorithm for Databases having constrained storage hierarchy
- Divy Patel (9085310937) dspatel6@wisc.edu
- Sahil Naphade (9085746619) smnaphade@wisc.edu
- Devaki Kulkarni (9086222321) dgkulkarni2@wisc.edu
- Manaswini Gogineni (9085432699) mgogineni@wisc.edu
- Divy: Cache-size mini runs, Device-optimized page sizes, Spilling memory-to-SSD, Spilling from SSD to disk, Graceful degradation, Optimized merge patterns, Testing and Memory Leak Check
- Sahil: Tournament trees, Offset-value coding, Minimum count of row & column comparisons, Optimized merge patterns, Large-size records, Testing and Memory Leak Check
- Devaki: Tournament trees, Offset-value coding, Large-size records
- Manaswini: Verification
-
Tournament trees:
File Tree.cpp @ Line 196 -
Offset-value coding:
File DataRecord.cpp @ Line 122 -
Minimum count of row & column comparisons
-
Cache-size mini runs:
File SortRecords.cpp @ Line 26 -
Device-optimized page sizes:
File SortRecords.cpp @ Line 81 and Line 136 -
Spilling memory-to-SSD:
File SortRecords.cpp @ Line 65 -
Spilling from SSD to disk:
File SortRecords.cpp @ Line 69 and Line 125 -
Graceful degradation:
File SortRecords.cpp @ Line 72, Line 74 and Line 151- Into merging
- Beyond one merge step
-
Optimized merge patterns:
File SortRecords.cpp @ Line 150 and Line 151 -
Verifying:
File Iterator.cpp @ Line 69 and Line 84- sets of rows & values:
File Iterator.cpp @ Line 84 - sort order:
File Iterator.cpp @ Line 69
- sets of rows & values:
-
Replacement selection?
-
Run size > memory size?
-
Variable-size records?
-
Compression?
-
Prefix truncation?
-
Quicksort
Tournament-tree priority queuewas used in order to achievehigh fan-infor merging our sorted run inputs of records and less number of comparisons than a standard tree-of-winnersOffset-value codingwas used to achieveminimum column value comparisonsCache-size mini runswere used to be able to fit the sort inputs, for tournament-tree, in the cache. This enabled us to leverage the low-latency accesses when there arecache hitsDevice-optimized page sizeswere used in order to being cognizant about theaccess-profile(latency, bandwidth)of various devices in the storage hierarchy. ForSSD, we used8KB(100 MB/s * 0.1 ms ~ 10KB)and forHDD, we used1MB(100 MB/s * 10 ms ~ 1MB)- We achieved graceful-degradation by spilling
cache-size runs from cache to memory,spilling memory-size runs from memory to SSDandspilling SSD-size runs from SSD to HDD - Also
HDD-page size(1MB)sorted runs were written toSSDprior to actually merging runs on theHDD. This is to leverage low-latency accesses of flash drives(SSD) Sort-order,set of rowsand theirvalueswere verified as part of sorting the input records. This is to verify thecorrectnessandintegrityof our sort algorihthm
- The implementation of the
External-Sortis complete with all of the techniques which were expected from us as part of the course project - The sort was tested against
1KBsize records and with12Mnumber of records(although it takes ~1hr to complete the sort, for this particular test-case) - The sort algorithm was tested against
valgrindto check for any memory leaks introduced while developing. The codebase does not have any memory leaks, from the latest leak-report on the most recent code version
- To run our program, first compile the source code using following command, under
External-Sortdirectory
$ cd External-Sort $ make ExternalSort.exe - After compiling the source code, to execute the External Sort with custom arguments, run following command inside
External-Sortdirectory
# Where, # "-c" gives the total number of records # "-s" is the individual record size # "-o" is the trace of your program run $ ./ExternalSort.exe -c 120 -s 10 -o trace0.txt -
The program creates three directories on the completion of the sort algorithm:
input: This directory consist of the input table which has records generated by the random-generator in arbitrary orderoutput: This directory consist of the output table which has records from input table but in a sorted order, sorted using our sort algorithmtrace: This directory consists of trace files generated from the sort. The trace file consists of logs related SSD and HDD device accesses. And the logs related to sort state machine
-
In order to remove all the generated binaries, executables, and the utility directories mentioned above, run the following command
$ make clean $ docker run -it --privileged -v $pwd/External-Sort:/External-Sort ubuntu bash $ apt-get update $ apt-get install build-essential make g++ vim sudo valgrind -y $ cd ./External-Sort $ make ExternalSort.exe # Where, # "-c" gives the total number of records # "-s" is the individual record size # "-o" is the trace of your program run $ ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt # Creates a log file `valgrind` inside the External-Sort directory $ valgrind --track-origins=yes --log-file="/External-Sort/valgrind" --leak-check=yes ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt