Big Sort – Sorting files in TeraBytes
We did sorting many times, it is a simple task, enter the data in excel then single button click will sort the entries in fraction of seconds. What if we need to sort huge amount of data which cannot be opened in excel or any other spread sheet application? Scientist studied sorting algorithm extensively, it’s an important subject in computer science and proved that sorting complexity cannot be brought down below O(nlogn). For our task of sorting huge files, Merge Sort algorithm can be used.
In merge sort we have multiple pieces of individually sorted list, while merging, elements are picked from the header position of the lists so that the new list is sorted. The process can start from list as small as 1 element. Merging list of 1 element leads to list of 2 element, merging them lead to list of 4 element, there merging lead to 8 element list, this process is repeated till the complete data is sorted.
The one of the major problem in sorting is holding the data in memory, so let’s read file serially i.e. read line by line, once considerable amount of lines are obtained sort them in-memory and write to separate file. Continue to read from the main file where we left, do the above process till the file is over. Now we have multiple small files whose content are sorted. In order to merge them as single sorted file, created file read pointer to each small file, compare the content at header and write to final sorted file. Below python code implements the stated algorithm and here is the GitHub link.