Header: christian-rossow.de
Apply for a PhD/PostDoc position in my System Security research group at Saarland University.


If you can't explain it simply, you don't understand it well enough.

HowTo - Cumulative Distribution Function (CDF) with gnuplot and PostgreSQL

In statistics, CDFs are a common way to describe the probability that a random variable Z with a given probability distribution will be found at a value less than or equal to Z. Have a look at Figure 1. It shows a CDF for how many cars are passing on my street per day. You can read it like this: 35 or fewer cars (x-axis) are passing on my street in approximately 60% (y-axis) of all days.

Now comes the tricky thing. Listing 1 shows how I stored the data in my database. As everyone would do it, I just inserted one row per measurement interval (i.e., a day).

Listing 1: Plain measurement results in table 'cars_per_day'
day     cars
1          8
2         25
3         47
4         16
5         25
6         39
7          5

However, to draw a CDF, gnuplot expects the following format: Given a number of cars Z (x-value), what percentage of days at most Z cars passed on the street? To make this more clear, Listing 2 shows the outcome that gnuplot expects.

Listing 2: gnuplot data input for CDF
cars  | ratio of days
  5   |         0.143
  8   |         0.288
 16   |         0.429
 25   |         0.714
 39   |         0.857
 47   |         1.000

Clearly, we need to somehow convert the raw data we have to the format gnuplot expects. Whereas some people even write scripts for this, there is a very elegant solution with SQL and Window functions. I used PostgreSQL 8+ to draft my SQL query. As of February 2011, MySQL does not support Window functions yet, Oracle8i does. Thus, if you have a good ;) DBMS, you can just adapt the following SQL query to your own needs:

Listing 3: SQL query for CDF input data
    -- x-value: number of cars
    cars AS x, 
    -- y-value: percentage of days at most x cars passed the street
    --   calculated as the number of days with at most x cars devided by
    --   the total number days in the measurement period
    COUNT(day) OVER (ORDER BY cars) / 
       (SELECT COUNT(*) FROM cars_per_day)::real AS y
FROM cars_per_day

Save this SQL query in a file and retrieve the data readily formatted for gnuplot like this:

Listing 4: psql command to retrieve data CSV-separated
psql mydb --quiet --no-align -F ';' --pset t --pset footer -f my.sql

Once you have the data, take the following gnuplot script to plot a CDF and modify it to your needs.

Listing 4: Example gnuplot CDF script
set datafile separator ';'

set key right bottom
set xlabel "Cars per day"
set ylabel "Ratio of days"
set yrange [0:1]

set term png
set output "cdf_cars.png"
plot "cdf_cars.csv" using 1:2 notitle with linespoints

Finally, you should have your CDF:

Obviously you need to slightly change the SQL and gnuplot script to your needs, but this should be easy now.