Protecting Moodle from unwanted search bots

Every access to our Moodle website generates traffic, from requesting computer to the server and the other way around. Web server consumes CPU, memory, and other machine resources to generate and deliver desired content to the user. As with anything else in our physical world, all resources are finite and usage of those resources generates cost. We want to employ our platform's resources only to the desirable, legal requests. Since there is no functional difference between page requests made by some automated software and live users, we need to know that giving free access to anybody will implicate higher maintenance costs, open a potential security hole, and most likely incur a performance hit and reduce the availability of the website. This means that we must be aware of the kind of requests that can be made by legal or illegal bots on our Moodle platform.

Search engines

Publicly available websites used for searching of the World Wide Web are generally referred to as search engines. The most common and most widely known example of such a website is Google (of course there are others like Bing, Yahoo, Baidu, Yandex, AltaVista, etc.). Most of the search engines employ Internet bots that scrape keywords from raw HTML content and generate data cache storage which is later used to generate a response list to a user search request. To determine whether our site is indexed by some search engine we can use site keyword. For example, typing site: www.packtpub.com in Google will produce a list of all indexed pages for that site.

Search engines

That is the easiest way to determine the level of exposure of your website on Google, and the process is similar with the other search engines.

Moodle and search engines

Moodle has basic support for handling several public search engines. Supported engines are Google, Yahoo, AltaVista, and Bing. For the Intranet purposes Moodle also supports Zoom search engine from WrenSoft.

This means that if we configure Moodle to permit external indexing by search engines they will be permitted to scan the entire site without the need to log in.

In order to enable your Moodle to permit indexing of the front page and any other publicly available resource, you need to go to the Administration | Security | Site policies and check the option Open to Google.

Moodle and search engines

Even though the option refers to Google only, it actually applies to all supported search engines.

Moodle access check

Moodle has a well defined process of granting or denying access to a user. This is not so well documented and since it is important to understand how access check actually works we will explain it in depth here. As a result you will know how to properly configure your system and open just the parts you want to have opened to the general public.

By default a Moodle instance permits anybody to enter on a front page and into any activity, resource, or course that permits free entrance to guest users.

Note

To configure a specific course open for guests, visit the course settings page<course short name> | Edit course settings and in the Availability section mark Guest access with option Allow guests without key. This means that anybody will be able to enter into such a course and browse its content as a guest user.

Moodle access check

Guest access in Moodle is permitted by default. You can configure that feature if you visit Administration | Users | Authentication | Manage authentication and set the Guest login button to show or hide.

Moodle access check

One final option completes the set that controls the level of openness you want to give to your Moodle. That is Force users to login. With that option configured nobody can access anything within the platform unless he logs in.

Moodle access check

And here is the decision process that Moodle makes when a user wants to access the platform:

Force users to log in

Guest login button

Open to Google

Security level

Yes

Hide

No

This is the highest security level. Nobody can enter the site without valid credentials. No bot will be able to enter the site. This is a recommended setting for any Moodle instance hosted on the Internet.

Yes

Hide

Yes

With this configuration set, only users with valid credentials and allowed search engines can enter the website. Search engine bot will be able to see anything on the front page and any course marked for guest access. This is a recommended way of configuring platform in a case where we want the search engine to index publicly available content. (There are some additional security concerns that should be considered—more on that a bit later.)

Yes

Show

No

This is misleading configuration because we have guest access enabled but disabled search engines. This however does not actually stop any bot from activating Guest login button and entering the site as guest user. This is therefore not a recommended setting.

Yes

Show

Yes

This configuration set still asks for users to be logged in but enables both guest access and permitted search engines. Not recommended if you do not want to expose your site to the public eye.

No

Show

Yes

These configurations leave open the front page of the site for anybody to see together with any content open for guest access. These are generally not recommended settings for public websites. They are more appropriate for internal Intranet use.

No

Show

No

 

No

Hide

Yes

 

No

Hide

No

This is the most appropriate configuration for opening front page to search engines without forcing users to log in and giving any additional access within the site.

There are serious security concerns related to the Open to Google setting. The reason for this is the way Moodle checks whether a permitted search engine bot is trying to access the site. Every HTTP request contains some information about the client. This information is formed through so-called HTTP headers.

Note

In HTTP protocol requests, header fields contain operating parameters of the request or response. Through headers we inform the web server what resource (page) we want, what is our identifier, and so on.

The HTTP header used for identifying the source of the request is called User-Agent. User-Agent is a header destined to identify the client software that is making request. All web browsers and well behaving bots have standardized User-Agent values.

The problem, of course, is that it is extremely easy to modify this value within your browser or within your web bot and thus impersonate another browser or bot. For example, in order to permit Google bot to enter inside the site, Moodle checks whether a value of User-Agent header contains the value Google bot. In case that is true, then this request is allowed and the bot is automatically logged into platform as a guest user.

To demonstrate how easy it is to pass this protection I will use Mozilla Firefox 3.6. You can customize the User-Agent value for that browser. To do that in the location bar, type about:config and inside the filter field enter agent.

Moodle access check

Locate the key general.useragent.extra.firefox and double-click on it to set value Googlebot (it is case-sensitive). Restart the browser and open your Moodle which is configured to allow search engine bots. You will notice that you are logged in as a guest user just by opening the front page.

Moodle access check

This clearly demonstrates how easy it is to bypass this protection. Therefore whenever possible, configure your Moodle with the recommended highest security settings (see table).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.129.194