Python Doc Problem: gzip

By Xah Lee. Date: . Last updated: .

Today i need to use Python to compress/decompress gzip files. Since i've read the official Python tutorial 8 months ago, have spent 1 hour with Python every other day since, have 14 years of computing experience, 8 years in mathematical computing and 4 years in unix admin and Perl, i have quickly found the official doc: http://python.org/doc/2.4.1/lib/module-gzip.html

I'd imagine it being a function something like:

fileContent = GzipFile(filePath, compress/decompress)

However, after 20 seconds of scanning the doc, i find not a single example showing how it is used.

Instead, the doc starts with some arcane info about compatibility with some other compression module and other software. Then it talks in a very haphazard way with confused writing about the main function GzipFile. No perspectives whatsoever about using it to solve a problem nor a concrete description of how to use it. Instead, jargons of Class, Constructor, Object etc are thrown together with presumption of reader's expertise of IO programing in Python and gzip compression arcana.

After no understanding, and being not a Python expert, i wanted to read about file objects but there is no link.

After locating the file object's doc page: http://python.org/doc/2.4.1/lib/bltin-file-objects.html, but itself is written and organized in a very unhelpful way.

Problem Detail

Here's the detail of the problems of its documentation. It starts with:

The data compression provided by the zlib module is compatible with that used by the GNU compression program gzip. Accordingly, the gzip module provides the GzipFile class to read and write gzip-format files, automatically compressing or decompressing the data so it looks like an ordinary file object. Note that additional file formats which can be decompressed by the gzip and gunzip programs, such as those produced by compress and pack, are not supported by this module.

This intro paragraph is about 3 things: (1) the purpose of this gzip module. (2) its relation with zlib module. (3) A gratuitous arcana about gzip program's support of “compress and pack” software being not supported by Python's gzip module. Necessarily mentioned because how the writing in this paragraph is phrased. The writing itself is a jumble.

Of the people using the gzip module, vast majority really just need to decompress a gzip file. They don't need to know (2) and (3) in a preamble. The worst aspect here is the jumbled writing.

The doc continues and onto the main function:

class GzipFile( [filename[, mode[, compresslevel[, fileobj]]]])
Constructor for the GzipFile class, which simulates most of the methods of a file object, with the exception of the readinto() and truncate() methods. At least one of fileobj and filename must be given a non-trivial value. The new class instance is based on fileobj, which can be a regular file, a StringIO object, or any other object which simulates a file. It defaults to None, in which case filename is opened to provide a file object.

This paragraph assumes that its readers are thoroughly familiar with Python's File Objects and its methods. The writing is haphazard and extremely confusive. Instead of explicitness and clarity, it tries to convey its meanings by side effects. On whole it is incomprehensible.

• The words “simulate” are used twice inanely. The sentence “…Gzipfile class, which simulates…” is better said by “Gzipfile is modeled after Python's File Objects class.”

• The intention to state that it has all Python's File Object methods except two of them, is ambiguous phrased. The way it is written, seems to suggest that it has all methods of Python's File Object but two of the methods work in a different way then the File Object's methods .

• The use of the phrase “non-trivial value” is inane. What does a non-trivial value mean here? Does it mean Null? Nil? Undefined? Zero? Empty? Does “non-trivial value” have a technical meaning in Python? Or, is it meant with generic English interpretation? If the latter, then what does it mean to say: “… must be given a non-trivial value”? Does it simply mean one of these parameters must be given?

• The rest of the paragraph is just incomprehensible.

When fileobj is not None, the filename argument is only used to be included in the gzip file header, which may includes the original filename of the uncompressed file. It defaults to the filename of fileobj, if discernible; otherwise, it defaults to the empty string, and in this case the original filename is not included in the header.

“discernible”? What discernible? How?

The mode argument can be any of 'r', 'rb', 'a', 'ab', 'w', or 'wb', depending on whether the file will be read or written. The default is the mode of fileobj if discernible; otherwise, the default is 'rb'. If not given, the 'b' flag will be added to the mode to ensure the file is opened in binary mode for cross-platform portability.

“discernible”? Again, familiarity with the working of Python's file object is implicitly assumed. For people who do not have expertise with working with files using Python, it necessitates the reading of Python's file objects documentation.

The compresslevel argument is an integer from 1 to 9 controlling the level of compression; 1 is fastest and produces the least compression, and 9 is slowest and produces the most compression. The default is 9.

Calling a GzipFile object's close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object's getvalue() method.

huh? append more material? pass a StringIO? and memory buffer?

I just want to fscking decompress a gzip file.

Here, expertise in programing with IO is assumed of the reader. Meanwhile, the writing is not clear about how exactly what it is trying to say about the close() method.

Suggestions

A quality documentation should be clear, succinct, precise. And, the least it assumes reader's expertise to achieve these qualities, the better it is.

Vast majority of programers using this module really just want to compress or decompress a file. They do not need to know any more details about the technicalities of this module nor about the Gzip compression specification. Here's what Python documentation writers should do to improve it:

• Rewrite the intro paragraph. Example: “This module provides a simple interface to compress and decompress files using the GNU compression format gzip. For detailed working with gzip format, use the zlib module.”. The “zlib module” phrase should be linked to its documentation.

• Near the top of the documentation, add a example of usage. A example is worth a thousand words:

# decompressing a file
import gzip
fileObj = gzip.GzipFile("war_and_peace.txt.gz", 'rb')
fileContent = fileObj.read()
fileObj.close()
# compressing a file
import gzip
fileObj = gzip.GzipFile("hamlet.txt.gz", 'wb')
fileObj.write(fileContent)
fileObj.close()

• Add at the beginning of the documentation a explicit statement, that GzipFile() is modeled after Python's File Objects, and provide a link to the File Object's documentation.

• Rephrase the writing so as to not assume that the reader is thoroughly familiar with Python's IO. For example, when speaking of the modes {'r', 'rb', …} add a brief statement on what they mean. This way, readers do not have to take a extra step to look for their meaning in the File Object's documentation.

• Move arcane technical details about gzip compression to the bottom as footnote.

• General advice on the writing: The goal of writing on this module is to document its behavior, and effectively indicate how to use it. Keep this in mind when writing the documentation. Make it clear on what you are trying to say for each itemized paragraph. Make it precise, but without overdoing it. Assume the readers are familiar with Python language or gzip compression. For example, assume that the readers understand what are classes and objects in Python, and they know what compressions are, compression levels, file name suffix convention. However, do not assume that the readers are expert of Python IO, or gzip specification or compression technology and compression software in the industry. If arcane details or warnings are necessary, move them to footnotes.

Official Python Doc Fix

In , M A Darche (aka madarche) filed a bug report on this issue with acknowledgment. See bugs.python.org/issue2406. This is reflected in Python doc 2.6, released in . See: http://www.python.org/doc/2.6/library/gzip.html. Thanks to M A Darche.

See also: Why Open Source Documentation is of Low Quality