Fetching Webpage Content in Python ＆ Perl
Suppose you want to fetch a webpage. The following code does it:
# -*- coding: utf-8 -*- # python from urllib import urlopen print urlopen("http://wordyenglish.com/flatland/index.html").read()
Sometimes in working with HTML pages, you need to create links. In URL, certain chars need to be encoded. For example, 〔http://example.com/~xah〕 needs to be 〔http://example.com/%7Exah〕. Basically, any reserved chars
! * ' ( ) ; : @ & = + $ , / ? # [ ] when not used for special purposes such as CGI parameters, needs to be encoded by its hexadecimal. For example,
~ has hexadecimal
7e, so it needs to be encoded as
In Python, the “quote” function does it. “unquote” reverses it.
# -*- coding: utf-8 -*- # python from urllib import quote print quote("~joe's home page") print 'http://www.google.com/search?q=' + quote("ménage à trois")
See also: URL Percent Encoding and Unicode • URL Percent Encoding and Ampersand Char.
In Perl, there are several ways to get a webpage content. The easiest way to get a webpage is to use the Perl program HEAD or GET usually installed at
/usr/bin. For example, in shell, type:
HEAD returns a summary of the page info, such as file size. GET returns the full HTML file. (HEAD and GET are two calling methods of the HTTP protocol. The Perl script are named that way for this reason.)
If you need more complexty, perl has “LWP::Simple” or “LWP::UserAgent”. (there are many others) Both of these you need to install.
# -*- coding: utf-8 -*- # perl use strict; # use LWP::Simple; use LWP::UserAgent; my $ua = new LWP::UserAgent; $ua->timeout(120); my $url='http://yahoo.com/'; my $request = new HTTP::Request('GET', $url); my $response = $ua->request($request); my $content = $response->content(); print $content;
In the above, the
$ua -> timeout(120); is a Object Oriented syntax.