This past weekend I wanted to get statistics from some chat logs, just for fun. More specifically, I wanted to count the top hundred (or so) one word phrases, two word phrases, three word phrases and so on.
My initial approach was to do the simplest thing possible: add each phrase to a hash map with the key being the phrase and the value the number of occurrences.
As it turns out, running this on 50 Mb of logs uses quite a lot of memory, somewhere around a gigabyte. My stopgap solution to this was to add a swapfile to my server, so I could actually run the script without filling up the half a gigabyte of memory available and getting the process killed. This solution obviously wasn’t good enough.
Before we can get to the good stuff, some basic information about bloom filters. They’re one of my New Favourite Things, offering a way to check if something is in a set (i.e. it has been seen before) in constant time and constant space. Magnificent.
The way bloom filters work is conceptually very simple, start with some array of boolean values (or bits) set to false, this is an empty bloom filter. To actually use this two functions need to be implemented, one for adding items and one for checking items.
Adding an item looks like this:
Example (click to step forward):
Checking if an item is in the set is very similar to the previous procedure, except the boolean values are only checked rather than set to true. If all the values are true, true is returned, otherwise false is returned.
Example (click to step forward):
In a bloom counter, instead of having the buckets be bits or boolean values, they are instead integers. Adding is modified slightly, it now returns a boolean value indicating if the number of times we’ve counted a phrase has hit a certain limit.
The modified add procedure looks like this:
Example (click to step forward):
What this actually does is filter out uncommon phrases, as phrases that aren’t encountered very often will (hopefully) never have all their counters incremented up to the limit. The bloom counter can be used as a kind of preprocessor for the data. Only phrases above the limit will be added to the data structure used for counting, rather than all the phrases encountered.
As more things are added to the set, more buckets are going to be set to true (or the limit in the counting filter). This increases the chance of it reporting false positives (reporting that something is in the set when it isn’t). In the absolute worst case, every single bucket will be set to true; this would cause the filter to return true for any item checked. The best way to combat this is a good hash function (it doesn’t need to be secure, good distribution is the most important thing for a bloom filter) or to add more buckets. The the number of buckets or hash function needed will depend on the size of the problem that is being tackled.
It is also worth bearing in mind that, while bloom filters can report false positives, they will never report false negatives. There are some situations where this is very useful, for example checking for malicious URLs in a browser. In this case, a negative result means the URL isn’t in the set of malicious URLs to match, so the browser can safely continue. A positive result means the URL could be in the set and should be submitted for more checks before continuing.
I rewrote my original naive implementation using a bloom counter. Some very unscientific benchmarks (me staring at htop as the program is running) show the memory usage sitting at around 15-20 Mb with 500,000 buckets in the filter. Quite a lot better than the gigabyte or so my original script was using.
The source code for my final bloom counter library (in go, and only for counting strings so far) can be found here.