Synchronous Node.JS

Disclaimer: This is just a proof of concept.

Intro: Getters and Setters

In JavaScript, you can bind an object’s property to a getter and/or setter function to be called when the property is accessed or set. This is best explained with an example:

var obj = {foo: 'bar'};

Object.defineProperty(obj, 'json', {
    get: function () {
        return JSON.stringify(this);
    },
    set: function (data) {
        data = JSON.parse(data);
        for (var i in data) {
            this[i] = data[i];
        }
    }
});

//When accessing the property 'json', our getter function is called
console.log(obj.json);// => {"foo":"bar"}

//When setting the property, our setter function is called
obj.json = '{"hey":"cool"}';
console.log(obj.hey); // => "cool"

Getters and setters are strictly synchronous, but what if we could enable an asynchronous version of each?

Asynchronous getters

Many Node.JS frameworks rely on the concept of middleware. For example, when retrieving session data, Express/Connect passes each request through a chain of user-specified middleware, one of which might retrieve the session data. What if only some of our requests need the session data? Conditional middleware is ugly. Wouldn’t it be nice if we could just define this:

Object.defineProperty(request, 'session', {
    get: function (callback) {
        //<something async to retrieve session data>
        callback(null, {foo: 'bar'}); //(err, session)
    }
});

My node.js synchronous library uses node-fibers and some trickery to enable asynchronous getters and setters. One use for the synchronous library is to convert an asynchronous callback to an asynchronous getter.

Asynchronous getters mostly negate the need for middleware as resources don’t need to be parsed or retrieved until they’re needed. For example, we could define an asynchronous getter to retrieve session data, another to parse request parameters, etc.

How do I use it?

npm install -g synchronous

Say we have a method which parses a request body and extracts parameters:

var http = require('http')
  , query = require('querystring')
  , req = http.IncomingRequest.prototype;

req.params = function (callback) {
     var data = '';
     this.on('data', function (chunk) { data += chunk; });
     this.on('end', function () {
         try {
             callback(null, query.parse(data));
         } catch (e) {
             callback(e, null);
         }
     });
}

You’d typically use the method like this:

req.params(function (err, params) {
    console.log(params);
});

With the synchronous library, we can convert the method to a getter using the syncGetter method. When run in the same context, these two examples will output the exact same thing

require('synchronous');
req.syncGetter('params');

console.log(req.params); //No callbacks!

The syncGetter method is available to all objects and has the added benefit of caching the result so that the async function is only called once. There’s also the syncMethod and syncSetter methods available.

Doesn’t this block and defeat the purpose of Node.JS?

No. The asynchronous getters block the currently running fiber, not the entire event loop. Node is still non-blocking and fast.

So what’s the catch?

  • This is non-standard Node.JS. To use async getters, you need to run your app with `node-fibers` rather than `node`
  • The resulting synchronous code needs to be run in its own fiber. For convenience, I’ve proxied the `{http,https,net}.createServer` methods so that each request has it’s own fiber.
  • Converting async => sync code is ok if the callbacks were strictly nested (i.e. executed sequentially), but you lose the ability to use other async patterns (parallel, etc.).
  • Fibers are generally opposed by the node community.

Further reading

 

Rate limiting with Redis

tl;dr Solution & benchmarks, although I’d recommend reading the post ;)

Rate limiting can be an effective way of conserving resources and preventing automated or nefarious activities on your site.

A common use case is to limit the amount of requests an IP can make over a certain time frame. For example, you might want to restrict users from using an expensive search utility on your site. If the user attempts to search more than 5 times a minute, you can redirect them to another page informing them that they need to wait.

IP based rate limiting is already in use on larger sites. Google and Yahoo both employ the technique to prevent (or at least complicate) automated requests to their services.

The general problem

How can we count the number of times a subject performs an action over a certain interval in the immediate past?

The key issues to address when designing a solution are:

  1. How do we incorporate time given that it’s a continuous variable?
  2. How can we efficiently expire old data?
  3. How can we scale the solution so that it can handle many hundreds of subjects and/or actions per second?

The obvious (and suboptimal) solution

The easiest way to solve the problem is to log each subject + action + timestamp in a linear fashion. To find the number of times a certain subject performs an action over a set interval, we can iterate over each row and count the number of times the subject appears.

This solution has drawbacks:

  • It’s O(n) – we need to iterate over all rows in the interval in order to get a count – this won’t scale.
  • We need to manually clean up old data.

Thankfully, there is a better approach.

A short introduction to Redis

From the Redis home page: “Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

Redis’ data types allow you to store and query data in new and efficient ways. Redis also has a client for most of the popular web development languages.

We’re going to exploit some useful features in Redis when designing a solution:

  • A key is automatically created the first time it’s used.
  • We can create very space-efficient key/value pairs using the built-in hash type.
  • Redis has a method that atomically increments a hash field (HINCRBY).
  • We can chain commands together using transactions. Redis guarantees that the commands will be run sequentially.

Handling the time dimension

As we’ve seen in the suboptimal solution, storing and querying time as a continuous variable requires an O(n) range query.

An alternative way of handling the time dimension is to convert it to a discrete variable by breaking it up into small time “buckets”. Each bucket represents a configurable amount of time, e.g. two seconds, and carries an associated count. We can (approximately) obtain a count for a larger interval by working out which buckets overlap and summing their counts.

This method of storing time-series data is well suited to Redis and allows us to finely tune the trade-off between space, accuracy and speed. For example, say we’d like to obtain a count for the last 20 seconds; very small time buckets (<1 second) would give us accurate counts but would take up more space and require more computation to add up the counts of the ~20 buckets. On the other hand, larger buckets (10 seconds) would use less space but at the cost of accuracy.

Each action + subject should have a collection of buckets. We can store this in a hash:

<action>:<subject> = hash(bucket1 => count, bucketN => count)

Expiration of old data

Redis allows you to set a time-to-live (TTL) on a per-key basis. We can use this to expire our Redis keys (<action>:<subject>) at a point in the future, e.g. in 120 seconds. Redis has a lazy expiration algorithm to ensure that expiring keys does not affect performance. This method of key expiration works well for removing old subjects after a certain time frame. We could log information regarding an IP, and then efficiently purge the information after 120 seconds (or some other timeframe) of inactivity.

There’s still one issue that needs to be addressed. What if a subject remains active? How do we purge time-series data (the counts stored in our buckets) when they’re far enough in the past to be irrelevant to our queries?

My solution (inspired by the consistent hashing technique) is to think of each bucket as a point on a circle. We can map the timestamp to a point on the circle (and one of our buckets) by performing a modulo operation, e.g. `time % 3600`. When we increment the current bucket, we can also send two additional commands; a command to empty the next bucket (or two), and another to renew the key’s TTL.

The final solution

You can find benchmarks for the solution here

Options – each can be used as a trade-off for space, accuracy and speed.

  • Bucket interval – how many seconds each bucket represents
  • Bucket span – in our circle analogy, the bucket span is the total size of the circle (in seconds)
  • Subject expiry – the amount of (inactive) seconds before a subject’s time buckets expire
  • (derived) Bucket count = Bucket span / Bucket interval

Incrementing the count for a particular subject & action:

  1. Get the Redis key: <action>:<subject>, e.g. “pagerequest:143.34.200.18″
  2. Get the bucket associated with the current time `floor((timestamp % bucket span) / bucket interval)`
  3. Start a redis transaction
  4. a) Increment the count of the current bucket (HINCRBY)
  5. b) Clear the next 2 buckets (HDEL)
  6. c) Renew the key ttl (EXPIRE)
  7. End the redis transation

Checking the count over a certain interval in the immediate past:

  1. Get the Redis key: <action>:<subject>, e.g. “pagerequest:143.34.200.18″
  2. Get the bucket associated with the current time `floor((timestamp % bucket span) / bucket interval)`
  3. Work out how many N buckets the interval represents, e.g. a 30 second interval would be 10x 3 second buckets
  4. Start a redis transaction
  5. a) Get the count of the current bucket (HGET)
  6. b) Get the count of each of the N buckets from step 3 (HGET)
  7. End the redis transaction
  8. Add up each count and return

Get the code

I’ve released the rate limiting structure as part of my Redback library for Node.JS. It contains other advanced Redis structures such as the Social Graph or Full-Text Index.

To use the rate limiter:

var redback = require('redback').createClient(),
    ratelimit = redback.createRateLimit('requests');

//Increment the count for the specified IP
ratelimit.add('127.0.0.1');

//Count the number of requests in the last 20 seconds
ratelimit.count('127.0.0.1', 20, function (err, requests) {
     if (requests > 30) {
         //Throttle the user in some way..
     }
});

If you’d like to port the RateLimit structure or Redback library, let me know and I’ll create a link.

Edit: Henrik Westphal has ported the structure to PHP as part of his RedisBundle library. You can see the class here.

Conclusion

Rate limiting is a useful feature for sites with heavy traffic or computationally expensive areas. The easy but suboptimal solution to the rate limiting problem is to store the data in a linear fashion and resort to O(n) range queries, however this solution does not scale. Redis exposes fast and efficient in-memory data structures and provides some useful atomic commands. The Redis hash structure can be used to create a flexible structure where interval-related queries are O(1) amortized. The solution is fast, space-efficient, and gracefully handles the expiration of old data.

Like this post? Follow me on twitter.