July 22, 2011

A Gotcha of C++ map and set

C++ std::map and std::set allow you to define your own less-than comparator for the key type. But one thing to bear in mind: The comparator must follow the rules of strick weak ordering! That is:
  • x < x is always false (irreflexivity)
  • If x < y, then y < x is false (asymmetric)
  • If x < y and y < z, then x < z (transitivity)
  • If x = y and y = z, then x = z (transitivity of equivalence)
Miss any one of these rules, you will get undefined behavior!

To see what it means, let me show you a mistake I made.

I was thinking of keeping a bunch of non-repeated (unique) floats, but I wish to compare the floats with some tolerance. For example, 1.0 and 1.01 should be same. So naturally, I used float as the map key, and the mapped value can be any type I need. Let's say here I use int.

In order to compare floats with tolerance, I wrote a float comparator like this:
struct FloatCmp
    bool operator()(float a, float b)
        const float epsilon = 0.01f;
        if (fabsf(a - b) < epsilon)
            return false; // a == b
        return a < b;
You can see line 6-7 checks if two floats are equal first before using the less-than comparison (line 8). Now, I can initialize my map container happily:
std::map<float, int, FloatCmp> table;
table[1.0f] = 1;
table[1.001f] = 2;   // same as table[1.0f] = 2
table[1.002f] = 3;   // same as table[1.0f] = 3
It all seemed to work perfectly. But I found sometimes it didn't work. It produced some duplicated keys. What did I do wrong?

The answer: I broke the rule of "transitivity of equivalence" (if x = y and y = z then x = z). Both FloatCmp(1.0f, 1.006f) and FloatCmp(1.006f, 1.012f) are true, but FloatCmp(1.0f, 1.012f) is not true! This makes the result depend on how you insert your elements. Consider the following code:
#include <iostream>
#include <map>

int main()
    std::map<float, int, FloatCmp> map1;
    std::map<float, int, FloatCmp> map2;

    float a = 1.000f;
    float b = 1.006f;
    float c = 1.012f;

    map1[a] = 1;
    map1[b] = 1;
    map1[c] = 1;

    map2[b] = 1;
    map2[a] = 1;
    map2[c] = 1;

    std::cout << map1.size() << std::endl;  // print 2
    std::cout << map2.size() << std::endl;  // print 1
    return 0;
After the insertions, map1 contains a and c, and map2 contains b. This is astonishing: Two maps with the same set of elements inserted can have different sizes because you insert them in different orders. It's not hard to figure out why by tracing the code.

So how to fix this? Well, as long as "transitivity of equivalence" is followed, there would be no problems. Finally I came up with this new comparator:
inline float discretize(float a)
    return floorf(a * 100.0f) / 100.0f;
struct FloatCmp
    bool operator()(float a, float b)
        float aa = discretize(a);
        float bb = discretize(b);
        if (aa == bb)
            return false;
        return aa < bb;
At line 9-10, discretize() function truncates (round off) floating-point numbers to 2 decimal digits. For example, discretize(1.1234f) returns 1.12f. And check if they're equal with == operator directly (line 11). In this way, I make sure "transitivity of equivalence" is strictly followed. Now if you relaunch the previous main program with this new comparator, both map1 and map2 will contain two elements as expected.

July 4, 2011

Python shelve vs. sqlite3

I have a huge Python dict that I can't store in the memory. Python provides a shelve module for this purpose. It acts like a regular dict but it can be saved into a file. I wonder its performance comparing to sqlite3, so I run this little test:
import shelve
import sqlite3

def test_sqlite3():
    conn = sqlite3.connect("debug.s3db")
    cur = conn.cursor()
    cur.execute("CREATE TABLE IF NOT EXISTS [mydict] ("
                "[key] VARCHAR(255) PRIMARY KEY NOT NULL, "
                "[value] VARCHAR(255) NOT NULL)")
    for i in xrange(0, 1000000):
        cur.execute("INSERT INTO [mydict] (key, value) VALUES (?, ?)",
                    (str(i), str(i*2)))
def test_shelve():
    d = shelve.open("debug.shelf")
    for i in xrange(0, 1000000):
        d[str(i)] = str(i*2)
On my computer, test_sqlite3() cost me about 25 seconds, and about 51 seconds for test_shelve(). I also performed some fetch queries on these two, sqlite3 still runs faster than shelve. In conclusion, sqlite3 wins!

July 3, 2011

Some Notes on App Engine Datastore