Fetching Webpage Content in Python & Perl

, , …,


Suppose you want to fetch a webpage. The following code does it:

# -*- coding: utf-8 -*-
# python

from urllib import urlopen
print urlopen("http://wordyenglish.com/flatland/index.html").read()

Encoding URL

Sometimes in working with HTML pages, you need to create links. In URL, certain chars need to be encoded. For example, 〔http://example.com/~xah〕 needs to be 〔http://example.com/%7Exah〕. Basically, any reserved chars ! * ' ( ) ; : @ & = + $ , / ? # [ ] when not used for special purposes such as CGI parameters, needs to be encoded by its hexadecimal. For example, ~ has hexadecimal 7e, so it needs to be encoded as %7e.

In Python, the “quote” function does it. “unquote” reverses it.

# -*- coding: utf-8 -*-
# python

from urllib import quote
print quote("~joe's home page")
print 'http://www.google.com/search?q=' + quote("ménage à trois")


See also: URL Percent Encoding and UnicodeURL Percent Encoding and Ampersand Char.


In Perl, there are several ways to get a webpage content. The easiest way to get a webpage is to use the Perl program HEAD or GET usually installed at /usr/bin. For example, in shell, type:

GET 'http://yahoo.com/'

HEAD returns a summary of the page info, such as file size. GET returns the full HTML file. (HEAD and GET are two calling methods of the HTTP protocol. The Perl script are named that way for this reason.)

If you need more complexty, perl has “LWP::Simple” or “LWP::UserAgent”. (there are many others) Both of these you need to install.

# -*- coding: utf-8 -*-
# perl

use strict;
# use LWP::Simple;
use LWP::UserAgent;

my $ua = new LWP::UserAgent;
my $url='http://yahoo.com/';
my $request = new HTTP::Request('GET', $url);
my $response = $ua->request($request);
my $content = $response->content();
print $content;

In the above, the $ua -> timeout(120); is a Object Oriented syntax.