Speeding Up Grep Log Queries with GNU Parallel
Sometimes you come across a tool that everyone but you seems to have known about. I hit a wall recently where I wanted to query a massive 10GB text file with a list of terms in another file. Usually a simple grep command would do the trick, but I quickly learned the limitations of grep when I let the command run overnight and came back in the morning to my system still churning away.
So grep in all of its utility has been a powerful tool in the arsenal of many an IT professional, or anyone using shell for that matter. Grep was created before I was born in 1973 by Ken Thompson as an offshoot of the “ed” regular expression parser. It is such an integral tool that it is part of pretty much every Unix based system.
Being an elder utility it is a little stuck in its way about how it goes about doing work and although it gets the job done is not particularly efficient. Grep like many command line tools they are not designed to take advantage of processors with multiple cores, back in the day it only had one core, that’s the way it was and we liked it!
Enter GNU Parallel, a shell tool designed for executing tasks in parallel using one or more computers. For my purposes I just ran in on a single system, but wanted to take advantage of multiple cores. Having enough memory on my system, I loaded the entire massive file into memory and pipe it to GNU Parallel along with another file consisting of thousands of different strings I want to search for in the “PATTERNFILE”:
cat BIGFILE | parallel --pipe grep -f PATTERNFILE
A process that would have taken almost a day ran in under a a few hours. Almost immediately after I fired the command the fan in my laptop kicked into overdrive, a good sign that it was being put to work. To really leverage the power of the tool you can farm processes out to multiple systems, but for now I am just happy to be able to run shell commands using multiple cores.