CHAPTER 5

image

Get More with Wget

I frequently find myself needing to download a tarball, or a config file, to a server from a web page. Or within a script I need to know that I can query both local and remote web servers over HTTP(S) to check uptime or the presence of certain content. One of the most popular feature-filled tools is Wget, which, I hazard to guess, stands for “web get”. Previously known as Geturl, Wget is described as “the non-interactive network downloader”. It boasts several supported protocols, namely HTTP, HTTPS and FTP, and also has the ability to look past and query a site beyond HTTP proxies; a powerful feature-set indeed. But what can you use it for? In this chapter, you will explore some of its features, beginning with a few simple examples and continuing with some scenarios that you’ll might find Wget the most useful.

To install Wget on Debian and Ubuntu-based systems, run the following command. There is a significant chance that it’s already installed thanks to its popularity:

# apt-get install wget

To install Wget on Red Hat derivatives use the following command:

# yum install wget

Long- and Shorthand Support

The fact that Wget follows the “GNU getopt” standard means that Wget will process its command-line options in both long- and short-hand. Note the --no-clobber option being the same as the -nc option, for example. Let’s see how to use Wget to download something using that option—and should that download be interrupted, you’ll ask it to use the humorously named no-clobber option to avoid overwriting a previously downloaded file:

# wget –no-clobber http://www.chrisbinnie.tld

The same command can be run from the command line using:

# wget -nc http://www.chrisbinnie.tld

I mention this because each of the two formatting options has its place. For example, when you come to grips with the package you will likely soon be abbreviating everything and may end up inserting shorthand into a script that you’ve written.

However as you find your way through a package’s options, it’s initially much easier to write out everything in longhand. Also, if you’re preparing a script that other people will be using and administrating in the future, then looking up each and every command line argument to decipher a complex command takes a reasonable amount of time. Longhand might be more suited to that situation too. Now I will try to explain each command-line option using both types of formats for clarity and to give you the choice of either.

As you might have guessed from the functionality that I’ve alluded to, Wget also supports “regetting” or “resuming” a download. This makes a world of difference with servers that offer such functionality. You simply pick up from the bit of the file you had already successfully downloaded, wasting no time at all to get straight back to the task at hand. And, with a 10GB file, that’s usually more of a necessity than a nicety.

Logging Errors

But, what about if the marvelous Wget encounters errors during its travels? If it runs from within a script or from the command line, you can collect errors in a separate log file. A very painless solution is all that’s needed and this time with the -o switch:

# wget --output-file logfile_for_errors http://www.chris.tld/chrisbinnie.tar.gz

Back to the command line. If you’re not running Wget from within a script, you can drop Wget into background mode straight away after launching it. In this case, if there’s no -o log file declared, then Wget will create wget-log in such cases to catch the output.

Another logging option is to append it to a log file that already exists. If you’re executing a few automated scripts overnight, using Cron jobs for example, a universal Wget log file might live in the /var/log directory and get rotated by an automated log-rotation tool such as logrotate.

You can achieve this functionality with the -a option, as demonstrated here:

# wget –append-output=/var/log/logfile_for_errors http://www.chris.tld/chrisbinnie.tar.gz

While on the subject of errors and spotting what has been actioned after running Wget, you can also choose -d to enable debugging. This is simply --debug in longhand.

A minor caveat would be that some sysadmins disable this functionality by not compiling the option into Wget. If it doesn’t work for you, that is likely to be the reason why.

Rather than redirect your Wget output to /dev/null all the time, you can enable -q to suppress the output. As you might have guessed, this can be enabled with the following:

# wget --quiet ftp://ftp.chrisbinnie.tld/largefile.tar.gz

There are even more options than that. Without enabling debugging, you can always use the Linux standard -v to enable the --verbose output option, which should help out most of the time. It helps diagnose simple issues at least, if things aren’t quite working as you would hope.

Using the fully considered approach to which you’re becoming accustomed to, the ever-trusty Wget even caters to an -nv option, which stands for --no-verbose. This allows you to disable verbose mode but without switching all noise off. As a result, you still see some simple transactional information and any useful, generated errors.

Automating Tasks

The previous features are not the only ones that make Wget so popular; it also boasts some options that, I suppose, are mainly designed for automation.

For example, -i (“input file”) lets you grab not just one URL but as many as you can throw at it. You can specify a file brimming to the top with URLs using this method, as follows:

# wget --input-file=loads_of_links

You can also opt for what has been referred to as “downloading recursively” in the past. In other words, when Wget discovers a link to another page, linked to from the resource that you’ve pointed it at, then Wget will aim to collate a local copy of those images, files, and web pages too. Really that means anything attached to that page. The optional parameter --recursive command is as simple as -r when abbreviated:

# wget -r http://www.chrisbinnie.tld/inner-directory/files

It will pick up both HTML and XHTML. In addition, Wget thoughtfully respects the directory structures, which means, in a very handy fashion, you can essentially click through your local copy of the content as if you’re surfing online.

And, a little like creating your own web content–harvesting spider, Wget will even pay attention to the standard robots.txt file commands found in the root of web sites. The robots file can instruct any automated tool, that will pay attention at least, about what is public content and allowed to be downloaded and what should be left alone. There’s a whole host of other instructions, which are for another day.

As you’re grabbing the content (more commonly called “scraping”), using the -i option, from within a script and saving it locally, it can also be useful to prepend a URL to the start of the resource (and any proceeding resources) with the -B option. This allows you to keep a track of which web site the resource came from. Do this by adding the top-level URL to the directory structure.

This can be achieved with:

--base=http://www.chrisbinnie.tld

A feature that I’ve used many times in scripts (where the content is scrutinized by grep, awk and sed once downloaded) is the -O option (not to be confused with the lowercase -o flag).

Using that option, you can drop the downloaded content to a single file and any additional downloaded content will just be appended to that file. This is rather than mimicking a downloaded web page with lots of small files in a directory.

Fulfilling Requests with Retries

One key consideration of automating any task online is accounting for the fact that there will inevitably be problems, which will cause varying degrees of pain. You should be expecting connectivity issues, outages, downtime, and general Internet malaise such as spam, bots, and malware.

You can ask that Wget dutifully try to fulfil a request by using the -t switch; or, in longhand, you can add the number of ten retries that a resource will have before Wget gives up using --tries=10. This configurable parameter really helps with poor connections or unreliable targets.

Consider a scenario where you want to pick up links and download content from HTML that’s saved in a file locally. The -F or --force-html switch will allow this. Use this in the conjunction with the command switch --base to pick up relative, not absolute, links from your target.

Having touched on connectivity issues, consider this for a second. What if you have to suddenly drop your laptop into suspend mode because you’ve just caught sight of the time? The -c or --continue argument lets you carry on later, making the reliable Wget completely unaware that you were late for your dental appointment at 2:30.

You don’t even have to add the --continue switch, as it’s the default option. However, you do need it if you want to resume a download run prior to the current instance.

A super quick word of warning, however. In later versions of Wget if you ask it to kick off a “resume” on a file that already contains some content and the remote server that you connect to does not support the resume functionality, it will override the existing content in error!

To save that existing content, you should change the file name before beginning the download and then try to merge the files later.

Showing Progress

Another nice touch is a progress bar. How many times have programs just sat, looking busy even if they’re not, while you wait what seems like forever?

There’s no shorthand for changing the default process bar format, but fear not because adding --progress=dot will give you a counter in dots, whereas bar instead of dot will draw a perfectly plausible progress bar using ever-faithful ASCII art.

Speaking of display options, how about adding -Q? The following will let you show download details in megabytes or kilobytes (the latter in this case):

# wget --quote=k https://chris.binnie.tld/anotherfile.zip

DNS Considerations

One issue that I’ve encountered concerning underlying infrastructure problems in the past, and not Wget I’d like to point out, has been identified as DNS caching issues.

Back in the day, the relatively small number of domestic-market orientated ISPs were much less concerned about proxying web content to save their bandwidth. They didn’t feel the need to cache popular content nearly as readily (so the content was downloaded once rather than several times, cutting down on the ISPs bandwidth usage and therefore the cost).

The same applied to Domain Name System lookups (despite DNS lookups generating a minuscule amount of traffic, there are still server-load implications). In these modern Internet times in which we live, ISPs that offer access to millions of subscribers tend to have expiry times set to much higher values (a day or two, as opposed to an hour or two, which was adequate in the past).

With Wget you can, somewhat surprisingly, disable localized caching reliance (which I should point out doesn’t affect the underlying infrastructure unfortunately) to at least prevent the thoughtful Wget from causing the same caching issues every now and again. The fact that Wget is generous enough an application to provide caching in the first place is far from commonplace. Cached lookups are kept in memory only and by rerunning Wget it will mean that name servers are sent fresh queries with this option enabled.

Using Authentication

What if you can’t get access to a site because it’s password protected? As you might expect, Wget accounts for that common scenario too. Yet again there is more than one option for you to explore.

The most obvious way is simply entering a login and password to the command line as follows:

# wget --user=chris --password=binnie http://assets.chrisbinnie.tld/logo.jpg

But hold on for a moment. If you’re aware of Bash history, never mind prying eyes, this approach is far from ideal from a security perspective.

Instead Wget lets you proffer this methodology to the command line, where you are forced to the type the password in at the prompt.

# wget --user=chris --ask-password http://assets.chrisbinnie.tld/logo.jpg

For the other main protocols, you can also throw --ftp-user, --ftp-password, --http-user, and --http-password respectively into the ring.

Denying and Crafting Cookies

If you’ve thought about an obvious HTTP side-effect that causes you headaches, then rest assured that Wget has it covered.

You can either deny all cookies (which collect server-side statistical information and session data) or alternatively craft them in any which way you want.

Sometimes it’s necessary to carefully craft very specific cookies so that Wget can navigate a web site correctly, imitating a user for all intents and purposes. You can achieve this effect with the --load-cookies option. Simply add a file name as the suffix to that option and you can then effectively spoof being logged into a site by using previously generated cookie data. That means you can move around that web site with ease.

Creating Directory Structures

In and among your downloading-and-saving options are the -nd option, otherwise known as --no-directories. In addition, there’s –x, which is its opposite (which allows you to exclude directories via comma separated list). This still allows you to create a directory structure and pull in all the followed links.

This might be useful for segmenting different data captures, even if it’s only one splash page of a web site, harvested from commands that you have run from a script.

Then let’s not forget adding the base URL to the beginning of the saved directories. The -nH or --no-host-directories flag lets you avoid including the base URL altogether.

Precision

And for the pedants amongst you, how about removing the mention of the protocol at the start of those saved directory names? Well, --protocol-directories should just save the directory as the hostname without the http, https, or ftp prepended.

I’ve had a problem in the past relating to how e-mail clients handled generated Character Sets from content that had been created by scripts. You can use the excellent Wget to craft headers to your heart’s content. Essentially you end up creating a fake web browser header. This can be achieved as follows:

# wget --header='Accept-Charset: iso-2057' --header='Accept-Language: en-gb' http://www.chrisbinnie.tld

Anyone who has seen their browsers complain about the maximum number of redirects being reached will see straight through this option: --max-redirect=. If you add a number to the end, web server redirects will generate an error once the maximum limit is exceeded. The default sits at 20.

Security

A very basic form of security is possible using the HTTP referrer field. How might it be used? A common scenario might be when a webmaster doesn’t want to enable authentication to permit access to a resource, but instead wants a simple level of security added so that people can’t just link directly to that end resource.

One way to do this is to enforce from which page the request came. In other words, you can only access PageTwo.html, for example, through a link clicked on PageOne.html. This loose form of security can be bypassed easily (maliciously or otherwise) by including a distinct referrer as such:

# wget --referrer=PageOne.html PageTwo.html

Posting Data

Forging onward, you can pass more complex data along the URL as if you were using a dynamic scripting language directly, such as PHP or ASP.

The --post-data= parameter addition lets you include a string of data such as a variable name or two with their values set, like this:

http://www.chrisbinnie.tld/page.php?firstname=chris&secondname=binnie.

There are times, however, when you’ll need to pass on more data, as the Internet appears to wield lengthier URLs by the day. You can do this simply with --post-file= followed by a file name full of HTTP POST data. Very handy, methinks.

Ciphers

Faced with extended encryption, you can add the --secure-protocol=name_of_protocol parameter. You can use TLSv1 (the successor to SSL), SSLv2, or the more modern SSLv3.

You can also request “auto” instead, which enables the bundled SSL library to make an informed choice. A headache that I’ve had in the past has been with expired SSL certificates (from badly administrated servers) or self-signed certificates occasionally causing unusual responses. The simple fix --no-check-certificate is excellent for scripts and saves the day.

Initialisms

The largest files tend to live on FTP servers so it’s important that Wget handles the access required to download them correctly. I’m sure you won’t fall off your seat if I tell you that it surmounts this obstacle beautifully.

In addition to the --ftp-user=chris and --ftp-password=binnie parameters, the intelligent Wget will default to -wget@, which will satisfy anonymous FTP logins (anonymous logins are usually login: email-address and password: anonymous if you haven’t come across it before).

You should be aware that you can also incorporate passwords within URLs. However, by embedding passwords inside URLs, along with the simple user and password parameters, any user on the system who runs the ps command to see the process table will sadly, but as if by magic, see your password. Hardly secure, I think you will agree.

You will shortly look at the predetermined named config files: .wgetrc or equally an alternative name is .netrc.

A word of warning before you get there, however. If passwords are anything but of the lowest importance, you should not leave them in plain text in either of the two config files that I just mentioned.

You can always delete them after a file download has initiated. I’m sure with some creative scripting you could unencrypt a password, drop the password into the Wget configuration file, and then dutifully remove it after the transfer has begun.

If you’ve come across firewalling-meets-FTP headaches in the past, I’m sure that you have heard of passive mode. You can circumvent some of these headaches by chucking the --no-passive-ftp option to FTP, whereas firewalls usually prefer passive mode be enabled.

If your Wget client encounters filesystem symlinks (symbolic links are a little like shortcuts on Windows for those who don’t know) then sometimes it won’t be presented with the actual file but it will instead look like a symlink only. As you might expect by now, however, the mighty Wget lets you decide the fate of these shortcuts with --retr-symlinks.

Directories that are symlinked are too painful currently for Wget to recurse through afterward, so take heed in case that causes eyestrain.

I have already mentioned the recursive -r option for harvesting subdirectories. A caveat is that sometimes by following links you end up with gigabyte upon gigabyte of data inadvertently. Failing that, there’s a chance that you will end up with tens of thousands of tiny files (comprised of images, text files, stylesheets, and HTML) that cause administration headaches. You can explicitly specify how deep to trawl into the depths of the subdirectories with -l or the --level=5; set at five levels in this case.

Proxies

Another scenario might be as follows. You have built a proxy server that sits in between the web and your hundreds of users. You have tested it thoroughly and it’s working fine but you want to fill the cache with popular web pages before hundreds of users start using the proxy server and stressing it with their sudden increase in load.

You have a whole weekend before Monday morning arrives so you decide to start copying and pasting a long list of popular URLs that will almost definitely be viewed by your users (as well as banning a few inappropriate ones). You know that you can fill a file and feed it into Wget in a variety of ways, but if you just want to visit the web sites to populate the proxy’s cache, you definitely don’t want to keep all of the data that you have downloaded on the machine visiting them, just within the proxy.

Wget handles this with perfect simplicity using the --delete-after flag. It visits the sites in a truly normal fashion (set up carefully almost like a fake web browser if required, passing the correct User Agent parameters). Once you’ve set that up, you simply purge all of the data that you have collected.

Fear not—this won’t incur dressing-downs thanks to the accidental deletion of files on any FTP servers you visit. It is quite safe.

Mirroring

If you wanted to keep a full-fledged backup of your own web site locally, you could use Wget to scrape the content via a daily Cron job and use the -m or mirror option.

You immediately have recursive downloading enabled by taking this approach as well as setting the directory depth to “infinite” and timestamping being enabled. For future reference, should it help, according to the man page for Wget, the mirroring feature adds the following options: -r, -N, -l, and inf –no-remove-listing in one fell swoop.

Downloading Assets

Sometimes when you’re scraping a web page, the results can be disappointing. As discussed, you are effectively creating a fake browser and Wget can’t possibly be expected to live up to a fully functional browser, especially with its tiny filesize footprint.

To grab all the elements that comprise a web page, you might need to enable -p or --page-requisites”= to fix this.

Within reason it makes perfect sense because the reality of today’s web is one of stylesheets and highly graphical content.

One other useful feature that’s worth mentioning means that (because Windows isn’t usually case sensitive to file names but Unix-like operating systems are) you can choose to not pay attention to somewhat challenging, okay then irritating, file names with --ignore-case.

Persistent Config

As I mentioned, you can specify many of these parameters at runtime or from within your local user’s config file. This example uses wgetrc and not .netrc.

You might always want kilobytes, for instance. And, you know that your remote file target is on an unreliable server so you always need a large number of retries before you get it right. Here’s a sample .wgetrc which, in my case, would live in /home/chrisbinnie/.wgetrc.

Sample .wgetrc
============

quota = k # Not MegaBytes
tries = 50 # Lots of retries
reclevel = 2 # Directory levels to delve down into
use_proxy = on # Enable the proxy
https_proxy = http://proxy.chris.tld:65530/ # Set up a different proxy for each protocol
http_proxy = http://proxy.chris.tld:65531/
ftp_proxy = http://proxy.chris.tld:65532/
wait = 2 # Wait two seconds before hammering the bandwidth again
httpsonly = off # In recursive mode don't follow links which aren't HTTPS for security
secureprotocol = auto # Finely tweak your preferred encryption protocol version

Clearly, there are a massive amount of other options that can be included in this file. I will let you use your imagination and guile to pick and choose ones relevant to your precise requirements. Trial and error should help greatly. Once thing I love about small Linux utilities is not having to wait for a full iteration. You just press Ctrl+C if you want to interrupt a launch mid-flow.

A Note on Criminal Activity

On to a more serious note now.

I once spent a great deal of time securing a public-facing web server and for reasons that I became frustrated with I found it very difficult to avoid using an alternative to Wget called Curl.

Why did I put so much effort into avoiding such a fantastic tool being installed on my server? For the simple reason that if a small breach of my server occurred, I knew that I might be in serious trouble.

I was most concerned with the Apache web server user on the system (www-data, httpd, or even Apache being the Apache username used in the past) being exploited. I knew that Wget and Curl could be used to pull nefarious data down onto the server, which could almost certainly then be executed to take control of my server. The worst possible scenario.

You might ask why I was so sure of this threat. The simple answer is that I had witnessed precisely that type of attack in the past. If you’re familiar with Apache then you’ll know that as standard, without adjusting configuration settings, any hits to your web server are logged to the file /var/log/apache2/access.log. This file name changes a little, but only subtlely, depending on distribution.

While all the hits to your server end up there, any errors or informational messages are written to the /var/log/apache2/error_log file. Obviously it makes administration much easier to split the two elements into distinct files.

Having been asked to help recover one compromised server that I was helping to recover, to my absolute horror, in Apache’s error log appeared a line beginning with Wget.

It was quite easy to spot because there was an extraordinary lack of timestamps at the start of each line. There were also some nonsensical characters thrown in on the lines around the handful of entries of Wget output.

It showed that this server had connected remotely to a dodgy-looking web site and the appended file name was seemingly a few random characters to disguise it. Lo and behold, this server had been compromised through a popular PHP forum application and the automated Apache user had unsuspectingly logged its activities within the error log. The sloppy attacker had not cleaned up the log files, despite gaining superuser access (so they had full permissions to tidy up and disguise their break-in route).

After some digging, I soon discovered a very nasty rootkit, which was configured to send any users’ logins and passwords back to a foreign country at midnight every night. And, just in case the sysadmin for that server got wise and tried to clean the infection, there was a secondary SSH daemon running on a hidden port that didn’t show up in netstat (but it did in lsof), thanks to the netstat binary being overwritten with an infection version.

Don’t get me wrong—this attack was a success because of a bug in a PHP application and was far from Wget’s fault. It’s a word of warning for those new to security, however.

The more tools such as compilers and innocuous downloaders you leave on a server, the easier that server is to take control of even after a relatively minor breach has taken place. And, that could easily mean the difference between no extra work and three full days of rebuilding a server from scratch. If you’re like me, you don’t like making extra work for yourself.

Summary

A final word of warning: Don’t try to scrape content illegally. As I have demonstrated, it is a very straightforward process with Wget, and Curl come to that, to take a copy of public-facing content and save it to your local drive, without breaking a sweat. However, it should go without saying that you should be aware of potential copyright headaches. At the risk of repeating a commonly spoken phrase in sysadmin circles: “With great power...”.

The aim of this chapter was to stimulate some interest in the superhero that is Wget; an exceptional tool that is sometimes taken for granted. Put to good use for both diagnostic purposes and a multitudinous variety of scripts, Wget is a fantastic weapon that should be included in any sysadmin’s arsenal. With some creative thinking, I trust that you will enjoy experimenting with Wget as much as I have.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.161.225