Linux: Download Website by Command: wget, curl, HEAD, GET

By Xah Lee. Date: . Last updated: .

wget and curl are command line tools that lets you download websites.

On Ubuntu Linux, you also have GET and HEAD, usually installed at /usr/bin/. They let you fetch a URL's HTTP header or the whole page.

wget

How to download just one single file from a website?

# download a file
wget http://example.org/somedir/largeMovie.mov

How to download a entire website?

# download website, 2 levels deep, wait 9 sec per page
wget --wait=9 --recursive --level=2 http://example.org/

Some sites check on user agent, so you might add this option .

wget http://example.org/ --user-agent='Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'

How to find my user agent string?

Your user agent string is:

curl

How to download a numbered file sequence from a website?

# download all jpg files named cat01.jpg to cat20.jpg
curl -O http://example.org/xyz/cat[01-20].jpg
# download all jpg files named cat1.jpg to cat20.jpg
curl -O http://example.org/xyz/cat[1-20].jpg

Other useful options are:

Note: curl cannot be used to download entire website recursively. Use wget for that.

What is the difference between wget and curl?

The major difference between wget and curl is that wget lets you download a site by crawling links, while curl is for specific URL or a list of URLS.

for detail, see: curl author's explanation at http://daniel.haxx.se/docs/curl-vs-wget.html

Get URL Headers with HEAD

You can use HEAD to get the header of a http request.

HEAD is a perl script. On Ubuntu, by default it's installed at /usr/bin/HEAD

Here's a sample session:

~/web/xahlee_info/linux $ HEAD example.org
200 OK
Cache-Control: max-age=604800
Connection: close
Date: Fri, 19 Sep 2014 07:07:48 GMT
Accept-Ranges: bytes
ETag: "359670651"
Server: ECS (rhv/818F)
Content-Length: 1270
Content-Type: text/html
Expires: Fri, 26 Sep 2014 07:07:48 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Client-Date: Fri, 19 Sep 2014 07:07:48 GMT
Client-Peer: 93.184.216.119:80
Client-Response-Num: 1
X-Cache: HIT
X-Ec-Custom-Error: 1

Use GET to retrieve entire url content.

# fetch a url and save as xyzfilename
GET example.org > xyzfilename