Linux: Download Website by Command: wget, curl, HEAD, GET
curl are command line tools that lets you download websites.
On Ubuntu Linux, you also have
HEAD, usually installed at
/usr/bin/. They let you fetch a URL's HTTP header or the whole page.
How to download just one single file from a website?
# download a file wget http://example.org/somedir/largeMovie.mov
How to download a entire website?
# download website, 2 levels deep, wait 9 sec per page wget --wait=9 --recursive --level=2 http://example.org/
Some sites check on user agent. (user agent basically means browser). so you might add this option “--user-agent=”.
wget http://example.org/ --user-agent='Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'
How to find my user agent string?
Your user agent string is:
How to download a numbered file sequence from a website?
# download all jpg files named cat01.jpg to cat20.jpg curl -O http://example.org/xyz/cat[01-20].jpg
# download all jpg files named cat1.jpg to cat20.jpg curl -O http://example.org/xyz/cat[1-20].jpg
Other useful options are:
--referer http://example.org/→ set a referer (that is, a link you came from)
--user-agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)"→ set user agent, in case the site needs that.
Note: curl cannot be used to download entire website recursively. Use wget for that.
What is the difference between wget and curl?
The major difference between wget and curl is that wget lets you download a site by crawling links, while curl is for specific URL or a list of URLS.
- wget allow recursive fetch. curl is for one or more urls.
- wget is a command line tool. curl is powered by libcurl, available as API for programing languages.
- wget is mostly for end user getting web content. curl supports much more protocols and more flexible in programing use.
for detail, see: curl author's explanation at http://daniel.haxx.se/docs/curl-vs-wget.html
Get URL Headers with HEAD
You can use HEAD to get the header of a http request.
HEAD is a perl script. On Ubuntu, by default it's installed at
Here's a sample session:
~/web/xahlee_info/linux $ HEAD example.org 200 OK Cache-Control: max-age=604800 Connection: close Date: Fri, 19 Sep 2014 07:07:48 GMT Accept-Ranges: bytes ETag: "359670651" Server: ECS (rhv/818F) Content-Length: 1270 Content-Type: text/html Expires: Fri, 26 Sep 2014 07:07:48 GMT Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT Client-Date: Fri, 19 Sep 2014 07:07:48 GMT Client-Peer: 184.108.40.206:80 Client-Response-Num: 1 X-Cache: HIT X-Ec-Custom-Error: 1
Use GET to retrieve entire url content.
# fetch a url and save as xyzfilename GET example.org > xyzfilename