Friday, August 1, 2014

APScheduler 3.0 released

The first final version of APScheduler's 3.0 branch has been released.

For the uninitiated, APScheduler is a task scheduling and management system written in Python. Thinking of it as a cron/at daemon running inside your application is not far off, but APScheduler also provides management and monitoring of jobs, and much more. And of course it runs Python code instead of shell commands.

If one were to compare APScheduler with Celery, the difference could be summarized like this: Celery is a distributed task queue with basic scheduling capabilities, while APScheduler is a full featured scheduler with basic task queuing capabilities. Users tell me that APScheduler is easier to set up. I haven't personally used Celery, so I can't comment on that.

The 3.0 update brings many new features and enhancements, albeit at the cost of a backward-incompatible API. Virtually all of the feature requests from 2.x have been fulfilled. A guide is also provided for 2.x users for smoother migration to 3.0.

Performance improvements

Probably the most important change in 3.0 is about the job stores. In previous versions, all the job stores cached all their jobs in memory. This was to eliminate the overhead of fetching them from the backend (file or database). And that would've been fine with a small number of jobs, but when use cases started popping up that required thousands upon thousands of jobs, it became a severe problem. So starting with 3.0, persistent job stores no longer keep the jobs in memory, but instead rely on backend specific mechanisms (such as indexes) to efficiently fetch due jobs. This will greatly help reduce the memory footprints of applications that need to handle large numbers of jobs.

Time zone support

One of the most frequent complaints about APScheduler was that it always operated in the host's local time. Many users would've preferred it to always use the UTC timezone instead. Now, in 3.0, all datetimes are timezone aware. The scheduler has a set timezone which defaults to the local timezone, but can easily set to, say, UTC instead. Individual jobs can also be scheduled with different timezones if necessary.

Integration with asynchronous event loops

APScheduler now integrates with several widely used asynchronous application frameworks. The integration involves, at a minimum, the use of the event loop's built-in delayed execution mechanism. This avoids the use of a dedicated thread for the scheduler. With some frameworks, the integration can even provide a custom default executor (more on those in the next section) that runs the jobs in a built-in thread pool or similar.

Pluggable executor system

The built-in thread pool from the previous versions has been replaced with a pluggable executor system. Each scheduler subclass can specify its own default executor. For example, GeventScheduler uses a gevent specific executor that spawns jobs as greenlets. An executor based on the PEP 3148 (concurrent.futures) thread pool is used as the default executor on most scheduler subclasses.

While the thread pool in APScheduler 2.x was supposedly replaceable, using a process pool as a replacement didn't work in practice. This has been rectified by providing an officially supported ProcessPoolExecutor.

Although no such executors are yet provided, this API allows for remote execution, much like Celery does.

Scheduler API improvements

With 3.0, all the parameters of the job (except for its ID) can now be modified. In the previous versions, you had to remove and recreate the job from scratch to change anything about it. You can also pause, resume or completely reschedule jobs. This avoids having to keep the job parameters around in order to recreate the job.

The scheduler API now operates on jobs based on their IDs. This removes a lot of pain when implementing a remote scheduler service based on APScheduler. All the job related methods are also proxied on the job instances returned by add_job(). You can also now retrieve a particular job instance from the scheduler based on its ID – something that was painful to do with older versions of APScheduler.

The scheduler now allows you to schedule callables based on a text reference consisting of the fully qualified module name and a variable lookup path (for example, x.y.z:func_name). This is handy for when you need to schedule a function for which a reference can't be automatically determined, like static methods.

Finally, add_job() can now optionally replace an existing job (by its ID). This fixes a long-standing design flaw in APScheduler 2.x, in which adding a job in a persistent job store at application startup (usually using the scheduling decorators) would always add a new instance of the job without removing the old one. By supplying a static ID for the job, the user can ensure that there will be no duplicates of the job.

What's next?

A couple new features, contributed by other people, didn't make it into the 3.0 release. For one, there is a job store class for RethinkDB. Then there is support for getting the current number of running instances for each job. These will likely debut in the 3.1 release.

The rest will depend on user requirements and feedback. Happy scheduling :)

No comments:

Post a Comment