This page is a basic tutorial on using linux shell's text processing tools. For example, grep, cat, awk, sort, uniq. They are especially useful for processing lines.
grep is the most important command. You should master it.
How to show only certain lines that contains a text pattern?
Use grep. Example: grep 'xyz' myFile will print only lines containing the text “xyz” in file named “myFile”. grep 'xyz' *html will apply grep to all files whose name ends with “html”.
How to use grep for all files in a dir?
Use -r for all subdirectories. Use --include='*html' to match file name. Example:
grep -r 'xyz' --include='*html' ‹dirname›
This will apply grep to all files ending in “html” in a directory ‹dirname› and all subdirectories.
How to use grep for exact string? (that is, how to turn off regex.)
Use the option -F. Example:
# search perl source files for the string “href\s*=\s*"([^"]+)".*>” literally grep -F 'href\s*=\s*"([^"]+)".*>' *pl
This is useful when you want to search complicated string in source code.
If your string is really complicated, you can put it in a file, and use the option --file=‹pattern filename› for the search text. Example:
# search emacs lisp source code in dir and all subdirs. The search pattern is stored in file named myPattern.txt grep -r --file=myPattern.txt --include=*el .
-F = use fixed string. (not regex)-P = use Perl's regex syntax. (perl and Python's regex are basically compatible.)-i = ignore case.-v = print lines NOT containing the pattern.Examples:
# print lines of log files not matching a string. grep -v -F 'html HTTP' *log
# print lines containing either “png HTTP” or “jpg HTTP” grep -P 'png HTTP|jpg HTTP' *log
*.html = search all files ending in “.html”, in current dir. (files in subdir are ignored)grep -r --include='*html' ‹pattern› ‹dirname› = search files for ‹pattern› in ‹dirname› including subdirs, but only files ending in “.html”.-H = include file name in the result.-h = do NOT print file name.-l = print just file name; do NOT print the matched lines.-L = print just file name that does NOT match.# print lines containing “html HTTP” in a log file, show only the 12th and 7th columns, show only certain lines, then sort, then condense repeation with count, then sort that by the count. grep 'html HTTP' apache.log | awk '{print $12 , $7}' | grep -i -P "livejournal|blogspot" | sort | uniq -c | sort -n
# print all links in all html files of a dir, except certain links. Output to xx.txt grep -r --include='*html' -F 'http://' ~/web | grep -v -P 'gaq.push|www.google.com|googlesyndication.com|twitter.com|apis.google.com|www.reddit.com/submit|xahlee.disqus.com|amazon_ad_tag|wikipedia.org/wiki|www.youtube.com/embed|www.assoc-amazon.com|maps.google.com/maps|class="amz"|class="deadurl"|GitCafe 中文|xahlee.blogspot.com|class="sorc"|xahlee.info|xahlee.org|ergoemacs.org|wordyenglish.com|xahmusic.org|xahsl.org' > xx.txt
How to show only nth column in a text file?
# print the 7th column. (columns are separated by spaces by default.) awk '{print $7}' myFile
For delimiter other than space, for example tab, use 「-F」 option. Example:
# print 12th atd 7th column, Tab is the separator awk -F\t '{print $12 , $7}' myFile
Alternative solution is to use the “cut” utility, but it does not accept regex as delimeters. So, if you have column separated by different number of spaces, “cut” cannot do it.
How to show only uniq lines in a file?
sort myFile | uniq. To prepend the line with a count of repetition, use sort myFile | uniq -c
How to sum up the 2nd column in a file?
awk '{sum += $2} END {print sum}' myFile.
How to show only first few lines of a huge file?
head myFile. If you want to see first n lines, use head -n 100 myFile. If you want to see the bottom of a file, use “tail”.
For complex text processing, you need a full language. See: Perl & Python Tutorial ◇ Emacs Lisp Tutorial.
See: Linux: sort Examples.
How to list only files who's name matches a text pattern?
find myDir -name "*.html" will show just files ending with “.html”.
How to list only files larger than n bytes?
find myDir -size +900000c will list files in “myDir” larger than 9 Mega bytes.
To list files smaller than a given size, use a minus sign “-” instead of the plus. To list files of exactly a give size, don't use the plus or minus.
How to delete all files who's name matches a text pattern?
# delete all files whose name ends with “~”. find . -name "*~" -exec rm {} \;
How to delete empty files?
# list all empty files find . -type f -empty # delete all empty files find . -type f -empty -exec rm {} \;
How to delete all empty dirs?
# list all empty dirs find . -depth -empty -type d # delete all empty dirs find . -depth -empty -type d -exec rmdir {} ';'
How to use “find” on file names that may contain spaces or dash?
# print file names that may contain spaces find . -print0 | xargs -0 -l -i echo "{}";
The “-print0” tells “find” to print the file names separeted by a null char (ASCII 0). (as opposed to a newline char by “-print”) The “-0” tells xargs to parse input using null char as seperators and take any special char in file name as literal.
The “-l” tells “xargs” to pass just one file name at a time. The “-i” allows you to use “{}” as the file name. The “"{}"” creates quoting around the entire file name, so that “echo” (or another program) will see it as one argument instead of several. (Note: the “-i” must come after “-l”)
# convert all bmp files to png in a dir. Requires “convert” from ImageMagick find . -name "*bmp" -print0 | xargs -0 -l -i basename "{}" ".bmp" | xargs -0 -l -i convert "{}.bmp" "{}.png"
Note: a modern replacement for xargs is GNU Parallel. The syntax is almost indentical to xargs, except it runs in parallel. It also doesn't have problems with file names containing quotes or apostrophes.
Thanks to Ole Tange for telling me about GNU Parallel. (Ole is the author)