Running jobs in a distributed environment

As you should be more than familiar with jobs by now, you might have implemented them in your own application. For example, take clean up jobs, which are weekly or hourly jobs that clear leftover or unneeded data that has accumulated over time out of the database. You've developed your application as usual, added this neat clean up job, and deployed it to a multi-node system. Now your job runs on every node at the same time, which is most likely not what you wanted. This wastes application resources or might even lead to lock problems, as many systems try to access the same database resources at the same time.

In this example the standard cache functionality will be used to overcome this problem.

The source code of the example is available at examples/chapter7/distributed-jobs.

Getting ready

All you need is an application running on two nodes, in which both use the same memcached instance for caching. The configuration for memcached setup is already commented out in the standard application.conf file.

How to do it...

An easy method is to define an abstract job, which makes sure that it only runs on one node, but leaves the logic to the concrete implementation of the job:

public abstract class MultiNodeJob extends Job {
   
   protected final String cacheName =  this.getClass().getSimpleName();
   
   public abstract void execute();
   public void doJob() {
      boolean isRunning = Cache.get(cacheName) != null;
      if (isRunning) {
         Logger.debug("Job %s is already running", cacheName);
         return;
      }
      // Possible race condition here
      Cache.safeSet(cacheName, Play.id, "10min");
   
      try {
         execute();
      } finally {
         if (!isRunning) {
            Cache.safeDelete(cacheName);
         }
      }
   }
}

The concrete implementation boils down to a standard job now:

@Every("10s")
public class CleanUpJob extends MultiNodeJob {
   
   @Override
   public void execute() {
      Logger.debug("Doing my job of all greatness");
      // do something long running here
   }
}

How it works...

In order to create arbitrary jobs, which run only on one host out of an arbitrary number of applications, a helper class is needed. The MultiNodeJob class checks in the cache whether an element with the same name as the short class name exists. If this is the case, no execution will happen. If not, then the element will be set so no other nodes will execute this job.

The concrete implementation of any job consists only of the code to execute and the definition of how often it should run. It does not include any specific checks whether another node already runs this job. It needs to inherit from the MultiNodeJob.

As seen in the preceding code, the method Cache.safeSet() was used because it waits until the set has been executed, before going on. You could possibly change this to set(). What is far more important is the expiration date of the set key. If any node setting the key crashes while running this job, it might never be executed again because no one explicitly removes the set key from the cache.

As already written in the preceding comment, this solution is not one hundred percent correct, if you really need to make sure that the job runs exactly once at one time. If two nodes try to get the value of an non-existing element in the cache at the same time, both nodes will execute the job. This is a classical locking problem, which is hard to solve in multi node systems and cannot be solved with the current setup.

This problem gets more complex if you run jobs at a certain time via the @On annotation. Most likely the clocks of your servers are synced with an NTP time server, which means they will start pretty much at the same time.

In case you are using the @Every annotation, this problem will blur a little bit because the time between two jobs is actually the waiting time you defined in the annotation plus the execution time. So, if one node executes the job, and the other one does not, the difference in repeating times will start to differ a lot. If you do not intend this behavior at all, just use the @On annotation.

There's more...

Though this is a nice solution, it has already proven not to be perfect. Here are some possible areas for improvement.

Solving the locking problem

You cannot solve the above explained locking problem with the current cache implementation. You can however work around this problem. The easiest solution is to only allow one node to execute jobs. By setting the configuration property play.jobs.pool=0, no thread pool for jobs is created and no jobs except the synchronous ones on application startup are executed. You should however not forget to configure one node to execute jobs. This can easily be done with a specialized framework ID on one node. Another possible, but complex solution would be to go with events where a job triggers some event-based software such as ActiveMQ, which in turn decides who should execute this job. This is quite some engineering overhead however; it might be possible to live with this problem, or go the complex way of integrating a message queue if really needed. You might also create some better cache implementation (or a special implementation for locking), which acts as a single locking solution.

Changing cache clear times

It might not be a good idea to hard code the cache expiry time in the MultiNodeJob. If a job gets executed every five seconds and then stops for two hours because of one node crashing and not clearing the cache entry, you might wonder about unexpected results. Instead, you could use the time duration used in the @Every annotation and double or triple it. This would make sure the job is not run for a short period of time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.139.169