First, let's define what a CRON job is. The term
cron originally referred to a time-based job scheduler in Unix that allowed you to schedule jobs/scripts to be run periodically at specific times. The same concept can be applied to web requests, and in our case, the goal is to run our web scraper and update the data in our database periodically and without our interference. Another reason why GAE is so convenient to use is because of how easy the platform makes scheduling CRON jobs. To do so, we simply need to create a cron.xml
file in the /war/WEB-INF/
directory of our GAE project. In this XML file, we add the following code:
<?xml version="1.0" encoding="UTF-8"?> <cronentries> <cron> <url>/videoGameScrapeServlet</url> <description>Scrape video games from Blockbuster</description> <schedule>every day 00:50</schedule> <timezone>America/Los_Angeles</timezone> </cron> </cronentries>
This is pretty self explanatory. First, we define root tags named <cronentries>
and within these, we can insert any number of <cron>
tagsāeach one denoting a scheduled process. In these <cron>
tags, we need to tell the scheduler what the URL that we want to hit is (this will be relative to the root URL, of course), as well as the schedule itself (in our case, it's everyday at 12:50 A.M.). Other optional tags are a description tag, a time-zone tag, and/or a target tag that allows you to specify which version of your GAE project to invoke the specified URL.
Now, in my case, I asked the scheduler to run the job every day at 12:50 A.M. (PST), but examples of other schedule formats are as follows:
every 12 hours every 5 minutes from 10:00 to 14:00 2nd,third mon,wed,thu of march 17:00 every monday 09:00 1st monday of sep,oct,nov 17:00 every day 00:00
I won't go into the exact syntax of the scheduler tags, but you can see that it's pretty intuitive. However, for those of you who would like to learn more about CRON jobs in GAE or look at some of the less commonly used features, feel free to check out the following URL for a comprehensive look at CRON jobs:
http://code.google.com/appengine/docs/java/config/cron.html
But as far as our example goes, what we did previously will suffice and so we'll stop here!
3.142.114.19