Designing the clone

Cloning TinyURL is relatively simple but there is some thought behind the design of the application. We will be building a clone of TinyURL called Tinyclone, which will be hosted at the domain http://tinyclone.saush.com.

Creating a short URL for each long URL

The domain of the short URL is fixed. What's left is the file pathname. We need to represent the long URL with a unique file pathname (a key), one for each long URL. This means we need to persist the relationship between the key and the URL.

One of the ways we can associate the long URL with a unique key is to hash the long URL and use the resulting hash as the unique key. However, the resulting hash might be long and hashing functions could be slow.

The faster and easier way is to use a relational database's auto-incremented row ID as the unique key. The database will help ensure the uniqueness of the ID. However, the running row ID number is base 10. To represent a million URLs would already require seven characters, to represent 1 billion would take up nine characters. In order to keep the number of characters smaller, we will need a larger base numbering system.

In this clone we will use base 36, which is 26 characters of the alphabet (case insensitive) and 10 numbers. Using this system, we will only need five characters to represent 1 million URLs:

1,000,000 base 36 = lfls

And 1 billion URLs can be represented in just six characters:

1,000,000,000 base 36 = gjdgxs

Automatically redirecting from a short URL to a long URL

HTTP has a built-in mechanism for redirection. In fact it has a whole class of HTTP status codes to do this. HTTP status codes that start with 3 (such as 301, 302) tell the browser to go look for that resource in another location. This is used in the case where a web page has moved to another location or is no longer at the original location. The two most commonly used redirection status codes are 301 Move Permanently and 302 Found.

301 tells the browser that the resource has moved away permanently and that it should look at that location as the permanent location of the resource. 302 on the other hand (despite its name) tells the browser that the resource it is looking for has moved away temporarily.

While the difference seems trivial, search engines (as user agents) treat the status codes differently. 301 tells the search engines that the short URL's location has moved permanently away to the long URL, so credit for the long URL goes to the long URL. However, because 302 only tells the search engine that the location has moved temporarily, the credit goes to the short URL. This can cause issues with search engine marketing accounting.

Obviously in our design we will need to use the 301 Moved Permanently HTTP status code to do the redirection. When the short URL http://tinyclone.saush.com/singapore-flyer is requested, we need to send a HTTP response:

HTTP/1.1 301 Moved Permanently
Location: http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=singapore+flyer&vps=1&jsv=169c&sll=1.352083,103.819836&sspn=0.68645,1.382904&g=singapore&ie=UTF8&latlng=8354962237652576151&ei=Shh3SsSRDpb4vAPsxLS3BQ&cd=1&usq=Singapore+Flyer
Content-Type: text/html
Content-Length: 235
<html>
<head>
<title>Moved</title>
</head>
<body>
<h1>Moved</h1>
<p>This page has moved to <a href="http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=singapore+flyer&vps=1&jsv=169c&sll=1.352083,103.819836&sspn=0.68645,1.382904&g=singapore&ie=UTF8&latlng=8354962237652576151&ei=Shh3SsSRDpb4vAPsxLS3BQ&cd=1&usq=Singapore+Flyer">Singapore Flyer</a>.</p>
</body>
</html>

Providing a customized short URL

Providing a customized short URL with the above design we had in mind before makes things less straightforward. Remember that our design uses the database row ID in base 36 as the unique key. To customize the short URL we cannot use this database row ID, so the customized short URL needs to be stored separately.

In Tinyclone we store the customized short URL in a separate secondary table called Links, which in turn points to the actual data in a table called Url. When a short URL is created and the user doesn't request a customized URL, we store the database record ID from the Url table as a base-36 string into the Links table. If the user requests a customized URL, we store the customized URL instead of the record ID.

A record in the Links table therefore maps a string to the actual record ID in the URL table. When a short URL is requested, we first look into the secondary table, which in turn points us to the actual record in the primary table.

Filtering undesirable words out

While we could use more complex filtering mechanisms, URL shorteners are simple web applications, so we stick to a simpler filtering mechanism. When we create the secondary table record, we compare the key with a list of banned words loaded in memory on startup.

If it is a customized short URL and the word is in the list, we prevent the user from using it. If the key was the actual record ID in the primary table, we create another record (therefore using another record ID) to store the URL.

What happens if the new record ID is also coincidentally in the banned words list? We'll just have to create another one recursively until we find one that is not in the list. There is a probability that two or more consequent record IDs are in the banned words list, but the frequency of it happening is low enough that we don't need to worry about it.

Previewing the long URL

This feature is simple to implement. When the short URL preview function is called, we will show a page that displays the long URL instead of redirecting to that page. In Tinyclone we lump the preview long URL page together with the statistics information page.

Providing statistics

To provide statistics for the usage of the short URL, we need to store the number of times the short URL has been clicked and where the user is coming from. To do this, we create a Visits table to store the number of times the short URL has been visited. Each record in the Visits table stores the information about a particular visit, including its date and where the visitor comes from.

We use the environment variable from the server called REMOTE_ADDR to find out where the visitor comes. REMOTE_ADDR provides the remote IP address of the client that is accessing the short URL. We use this IP address with an IP geocoding provider to find the country that the visitor comes from then store the country code as well as the IP address.

Collecting the data is only the first step though. We will need to display it properly. There are plenty of visualization APIs and tools in the market; many of them are freely available. For the purpose of this chapter, we have chosen to use the Google Charts API to generate the following charts:

  • A bar chart showing the daily number of visits
  • A bar chart showing the total number of visits from individual countries
  • A map of the world visualizing the number of visits from individual countries

Note

You might notice that in our design the user does not need to log into this application to use it. Tinyclone is the only clone in the book that does not have any access control on its pages. Most URL shorteners have a public and main feature that redirects short URLs to their original, long URLs. In addition to that some URL shorteners have user-specific access controlled pages that provide information to the users such as the statistics and reporting feature shown above. However, in this clone we will not be implementing any access controlled pages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.220.22