Linux: Text Processing: grep, cat, awk, uniq

By Xah Lee. Date: . Last updated: .

This page is a basic tutorial on using Linux shell's text processing tools. They are especially useful for processing lines.

Get Lines: grep

grep is the most important command. You should master it.

Show Matching Lines

# show lines containing xyz in myFile
grep 'xyz' myFile
# show lines containing xyz in all files ending in html in current dir top level files
grep 'xyz' *html

Grep for All Files in a Dir

# show matching lines in dir and subdir, file name ending in html
grep -r 'xyz' --include='*html' ~/web

Here's what the options mean:

-r
All subdirectories.
--include='*html'
Match file name by a glob pattern (* is a wildcard that matches 0 or more any char.).

grep without regex

Use the option -F. (F means “Fixed string”)

# search ruby source files that contains  .* literally
grep -F '.*' *rb

This is useful when you want to search complicated string in source code, such as *@$.*#+-/\|`.

If your string is really complicated, you can put it in a file, and use the option --file=my_pattern_filename for the search text. Example:

# search js source code in dir and all subdirs. The regex is stored in file named myPattern.txt
grep -r --file=myPattern.txt --include=*js .

Most Useful Grep Options

Options for Pattern String

-F
Use fixed string. (no regex)
-P
Use Perl's regex syntax. (Perl and Python's regex are basically compatible.)
-i
Ignore case.
-v
Print lines NOT containing the pattern.

Examples:

# print lines not matching a string, for all files ending in “log”
grep -v 'html HTTP' *log
# print lines containing “png HTTP” or “jpg HTTP”
grep -P 'png HTTP|jpg HTTP' *log

Options for File Selection

Output Options

-H
Include file name in the result.
-h
Do NOT print file name.
-l
Print just file name; do NOT print the matched lines.
-L
Print just file name that does NOT match.

More Grep Examples

# print lines containing “html HTTP” in a log file, show only the 12th and 7th columns, show only certain lines, then sort, then condense repeation with count, then sort that by the count.

grep 'html HTTP' apache.log | awk '{print $12 , $7}' | grep -i -P "livejournal|blogspot" | sort | uniq -c | sort -n
# print all links in all html files of a dir, except certain links. Output to xx.txt

grep -r --include='*html' -F 'http://' ~/web | grep -v -P 'google.com|twitter.com|reddit.com|wikipedia.org' > xx.txt

text columns, awk, sort, unique, sum column …

show only nth column in a text file

# print the 7th column. (columns are separated by spaces by default.)
cat myFile | awk '{print $7}'

For delimiter other than space, for example tab, use -F option. Example:

# print 12th atd 7th column, Tab is the separator
cat myFile | awk -F\t '{print $12 , $7}'

Alternative solution is to use the cut utility, but it does not accept regex as delimeters. So, if you have column separated by different number of spaces, “cut” cannot do it.

remove duplicate lines

sort myFile -u

or

sort myFile | uniq

To prepend the line with a count of repetition, use sort myFile | uniq -c

sum up 2nd column

awk '{sum += $2} END {print sum}' filename
Sum the 2nd column in a file.

show only first few lines of a huge file

head filename
Show first n lines of a file.
head -n 100 filename
Show first 100 lines of a file.
tail filename
Show the last n lines of a file.
head -n 100 filename
Show last 100 lines of a file.

Sort by Number, by Field

Linux: Sort Lines

Processing Multiple Files

Linux: Traverse Directory: find, xargs

Count Char, Word, Lines

wc
Count the number of chars, words, lines. Useful with cat, grep