MathCurvesSurfacesWallpaper GroupsGallerySoftwarePOV-Ray
ProgramingLinuxPerl PythonHTMLCSSJavaScriptPHPJavaEmacsUnicode ♥
Web Hosting by 1&1

Linux Shell Text Processing Tutorial: grep, cat, awk, sort, uniq, find, xargs, …

Xah Lee, , …,

This page is a basic tutorial on using linux shell's text processing tools. For example, grep, cat, awk, sort, uniq. They are especially useful for processing lines.

Get Lines: grep

grep is the most important command. You should master it.

How to show only certain lines that contains a text pattern?

Use grep. Example: grep 'xyz' myFile will print only lines containing the text “xyz” in file named “myFile”. grep 'xyz' *html will apply grep to all files whose name ends with “html”.

How to use grep for all files in a dir?

Use -r for all subdirectories. Use --include='*html' to match file name. Example:

grep -r 'xyz' --include='*html' ‹dirname›

This will apply grep to all files ending in “html” in a directory ‹dirname› and all subdirectories.

How to use grep for exact string? (that is, how to turn off regex.)

Use the option -F. Example:

# search perl source files for the string “href\s*=\s*"([^"]+)".*>” literally
grep -F 'href\s*=\s*"([^"]+)".*>' *pl

This is useful when you want to search complicated string in source code.

If your string is really complicated, you can put it in a file, and use the option --file=‹pattern filename› for the search text. Example:

# search emacs lisp source code in dir and all subdirs. The search pattern is stored in file named myPattern.txt
grep -r --file=myPattern.txt --include=*el .

Most Useful Grep Options

Options for Pattern String

Examples:

# print lines of log files not matching a string.
grep -v -F 'html HTTP' *log
# print lines containing either “png HTTP” or “jpg HTTP”
grep -P 'png HTTP|jpg HTTP' *log

Options for File Selection

Output Options

More Grep Examples

# print lines containing “html HTTP” in a log file, show only the 12th and 7th columns, show only certain lines, then sort, then condense repeation with count, then sort that by the count.

grep 'html HTTP' apache.log | awk '{print $12 , $7}' | grep -i -P "livejournal|blogspot" | sort | uniq -c | sort -n
# print all links in all html files of a dir, except certain links. Output to xx.txt

grep -r --include='*html' -F 'http://' ~/web | grep -v -P 'gaq.push|www.google.com|googlesyndication.com|twitter.com|apis.google.com|www.reddit.com/submit|xahlee.disqus.com|amazon_ad_tag|wikipedia.org/wiki|www.youtube.com/embed|www.assoc-amazon.com|maps.google.com/maps|class="amz"|class="deadurl"|GitCafe 中文|xahlee.blogspot.com|class="sorc"|xahlee.info|xahlee.org|ergoemacs.org|wordyenglish.com|xahmusic.org|xahsl.org' > xx.txt

text columns, awk, sort, unique, sum column …

How to show only nth column in a text file?

# print the 7th column. (columns are separated by spaces by default.)
awk '{print $7}' myFile

For delimiter other than space, for example tab, use 「-F」 option. Example:

# print 12th atd 7th column, Tab is the separator
awk -F\t '{print $12 , $7}' myFile

Alternative solution is to use the “cut” utility, but it does not accept regex as delimeters. So, if you have column separated by different number of spaces, “cut” cannot do it.

How to show only uniq lines in a file?

sort myFile | uniq. To prepend the line with a count of repetition, use sort myFile | uniq -c

How to sum up the 2nd column in a file?

awk '{sum += $2} END {print sum}' myFile.

How to show only first few lines of a huge file?

head myFile. If you want to see first n lines, use head -n 100 myFile. If you want to see the bottom of a file, use “tail”.

For complex text processing, you need a full language. See: Perl & Python TutorialEmacs Lisp Tutorial.

sort, by string, by number, by field

See: Linux: sort Examples.

Processing Multiple Files

How to list only files who's name matches a text pattern?

find myDir -name "*.html" will show just files ending with “.html”.

How to list only files larger than n bytes?

find myDir -size +900000c will list files in “myDir” larger than 9 Mega bytes.

To list files smaller than a given size, use a minus sign “-” instead of the plus. To list files of exactly a give size, don't use the plus or minus.

How to delete all files who's name matches a text pattern?

# delete all files whose name ends with “~”.
find . -name "*~" -exec rm {} \;

How to delete empty files?

# list all empty files
find . -type f -empty

# delete all empty files
find . -type f -empty -exec rm {} \;

How to delete all empty dirs?

# list all empty dirs
find . -depth -empty -type d

# delete all empty dirs
find . -depth -empty -type d -exec rmdir {} ';' 

Using “find” with “xargs”

How to use “find” on file names that may contain spaces or dash?

# print file names that may contain spaces
find . -print0 | xargs -0 -l -i echo "{}";

The “-print0” tells “find” to print the file names separeted by a null char (ASCII 0). (as opposed to a newline char by “-print”) The “-0” tells xargs to parse input using null char as seperators and take any special char in file name as literal.

The “-l” tells “xargs” to pass just one file name at a time. The “-i” allows you to use “{}” as the file name. The “"{}"” creates quoting around the entire file name, so that “echo” (or another program) will see it as one argument instead of several. (Note: the “-i” must come after “-l”)

# convert all bmp files to png in a dir. Requires “convert” from ImageMagick
find . -name "*bmp" -print0 | xargs -0 -l -i basename "{}" ".bmp" | xargs -0 -l -i convert "{}.bmp" "{}.png"

Use GNU Parallel for xargs

Note: a modern replacement for xargs is GNU Parallel. The syntax is almost indentical to xargs, except it runs in parallel. It also doesn't have problems with file names containing quotes or apostrophes.

Thanks to Ole Tange for telling me about GNU Parallel. (Ole is the author)

blog comments powered by Disqus