UNIX Tar Problem: File Length Truncation, Unicode Name Support

By Xah Lee. Date: . Last updated: .

Discovered, that GNU tar now has a --help option. So, instead of typing man tar, you type tar --help. Not sure if this has been there for long or what.

Much better. I always hated the “man” faak. You can never be sure if the man page correspond to the version you are using, and because the doc is separate, it's also pain to maintain for dev, tends to get out of sync.

Another thing about tar is that i never figured out why its syntax doesn't use the dash. You use tar xvf myfile.tar instead of tar -xvf myfile.tar. Many years ago, with dash won't work. Not sure all tar programs support that today.

Also, you can't talk about tar without talking about unix line truncation problem. Tar used to truncate your file names if the path is long (For example, ~120 chars). See: Unix, RFC, Line Truncation. what's the max length of file path till it truncates, and is it done silently?

Something i still wanted to test but never got to it. Does current version of tar preserve file name that has lots Unicode character? (For example, Chinese, math symbols.)

According to tar (file format), there seems to be a new spec in “POSIX.1-2001” that addressed file name length and charset encoding, and is implemented by GNU tar in 2004.

Wikipedia tar 2021-06-12
Wikipedia on tar, 2021-06-12

The Wikipedia article turns out quite informative. One thing it mentioned is the “tarbomb”. That is, when untar, the file gets scattered all over your dir, or even to parent dirs, and OVERWRITES your files. This is a extreme pain in the ass, and still happens today.

Another problem interesting is that tar doesn't support table of contents so no random access. If you need to list files or extract one file, you need to read thru it from the beginning.

Here's another good resource discussing tar's problems.

duplicity why not tar 2021-06-12
[2021-06-12 from New file format? http://duplicity.nongnu.org/new_format.html]

In recent month i read that Google still use tape drive as one of their backup. I wonder if they use tar as the file format.

Alright, today, i'm deprecating tar for any personal use. If you are making decisions for yourself, i suggest zip as replacement. Zip is open source and well supported. Adopted by Java (in its jar file) and others. Gzip is also well supported by the industry. (For example, adopted in Sitemap. [see Creating A Sitemap With Emacs Lisp]

See also: ZIP, Open Source, Mother-Son Relationship .

Apparently, the reason tar doesn't uses dash for its option syntax, is because, the Seventh Edition Unix (released in 1979), the tar command does not use dash for its options syntax. Instead, the first char specifies what to do, the second char specifies options.

Here's Seventh Edition Unix man page of tar:

TAR(1)							      General Commands Manual							    TAR(1)

NAME

       tar  -  tape archiver

SYNOPSIS

       tar [ key ] [ name ... ]

DESCRIPTION

       Tar  saves and restores files on magtape.  Its actions are controlled by the key argument.  The key is a string of characters containing at
       most one function letter and possibly one or more function modifiers.  Other arguments to the command are file or directory names  specify-
       ing which files are to be dumped or restored.  In all cases, appearance of a directory name refers to the files and (recursively) subdirec-
       tories of that directory.

       The function portion of the key is specified by one of the following letters:

       r       The named files are written on the end of the tape.  The c function implies this.

       x       The named files are extracted from the tape.  If the named file matches a directory whose contents had been written onto the  tape,
	       this directory is (recursively) extracted.  The owner, modification time, and mode are restored (if possible).  If no file argument
	       is given, the entire content of the tape is extracted.  Note that if multiple entries specifying the same file are on the tape, the
	       last one overwrites all earlier.

       t       The  names  of  the specified files are listed each time they occur on the tape.  If no file argument is given, all of the names on
	       the tape are listed.

       u       The named files are added to the tape if either they are not already there or have been modified since last put on the tape.

       c       Create a new tape; writing begins on the beginning of the tape instead of after the last file.  This command implies r.

       The following characters may be used in addition to the letter which selects the function desired.

       0,...,7	 This modifier selects the drive on which the tape is mounted.	The default is 1.

       v	 Normally tar does its work silently.  The v (verbose) option causes it to type the name of each file it treats  preceded  by  the
		 function letter.  With the t function, v gives more information about the tape entries than just the name.

       w	 causes  tar  to print the action to be taken followed by file name, then wait for user confirmation. If a word beginning with `y'
		 is given, the action is performed. Any other input means don't do it.

       f	 causes tar to use the next argument as the name of the archive instead of /dev/mt?.  If the name of the file is `-',  tar  writes
		 to  standard output or reads from standard input, whichever is appropriate. Thus, tar can be used as the head or tail of a filter
		 chain Tar can also be used to move hierarchies with the command
							   cd fromdir; tar cf - . | (cd todir; tar xf -)

       b	 causes tar to use the next argument as the blocking factor for tape records. The default is 1, the maximum  is  20.  This  option
		 should only be used with raw magnetic tape archives (See f above).  The block size is determined automatically when reading tapes
		 (key letters `x' and `t').

       l	 tells tar to complain if it cannot resolve all of the links to the files dumped. If this is not specified, no error messages  are
		 printed.

       m	 tells tar to not restore the modification times.  The mod time will be the time of extraction.

FILES

       /dev/mt?
       /tmp/tar*

DIAGNOSTICS

       Complaints about bad key characters and tape read/write errors.
       Complaints if enough memory is not available to hold the link tables.

BUGS

       There is no way to ask for the n-th occurrence of a file.
       Tape errors are handled ungracefully.
       The u option can be slow.
       The  b  option  should not be used with archives that are going to be updated. The current magtape driver cannot backspace raw magtape.	If
       the archive is on a disk file the b option should not be used at all, as updating an archive stored in this manner can destroy it.
       The current limit on file name length is 100 characters.