Compare Two Large Text Files by Key Columns

DONG Yuxuan https://www.dyx.name

29 Jun 2019 (+0800)

Background

There’re two text files storing records and a record is a line separated into columns by delimiters. Some columns are key columns which identify a record and the rest of the columns are called the value columns. We want to compare these two files to see which records are inserted, deleted and modified. An inserted record is a record that its key appears only in the second file. A deleted record is a record that its key appears only in the first file. A modified record is a record that its key appears in both files but value columns are different. The following example is two files recording systems with versions, and the system column (column 1) is the key column.

before.txt

macOS 10.12
Windows 10
Ubuntu 16

after.txt

macOS 10.14
Windows 10
Manjaro 17

In this example, Manjaro with version 17 is inserted, Ubuntu is deleted, and macOS has a version modification. A diff-like syntax could be used to describe it:

* macOS 10.12
> macOS 10.14
+ Manjaro 17
- Ubuntu 16

The problem is simple if files are small. All we need to do is loading files into a hash table in the memory and compare. However, Files can be too large to load into the memory, thus we need a fast algorithm which reads only a part of a file at one time.

Algorithm

The algorithm first runs external sort on files by key columns. This can be done on most UNIX systems by running the following commands.

$ sort -k1,1 before.txt > sorted_before.txt
$ sort -k1,1 after.txt > sorted_after.txt

After the external sorting, the algorithm reads two sorted files line by line to run the following iteration.

If there’re remaining records in the first file after the iteration, the algorithm asserts all these records are deleted. If there’re remaining records in the second file after the iteration, the algorithm asserts all these records are inserted.

Analysis

The iteration is very similar to the merging algorithm in mergesort and obviously has the time complexity of \(\Theta(n)\) where \(n\) is the number of lines. External sort can be done in \(\Theta(n\log n)\). Thus the time complexity of the algorithm is \(\Theta(n\log n)\).

Implementation

Based on the algorithm, a UNIX utility KVDiff is developed. It can be installed by running pip install kvdiff and the usage can be got by running kvdiff ---help.