HowTo - Cumulative Distribution Function (CDF) with gnuplot and PostgreSQL
In statistics, CDFs are a common way to describe the probability that a random variable Z with a given probability distribution will be found at a value less than or equal to Z. Have a look at Figure 1. It shows a CDF for how many cars are passing on my street per day. You can read it like this: 35 or fewer cars (x-axis) are passing on my street in approximately 60% (y-axis) of all days.
Now comes the tricky thing. Listing 1 shows how I stored the data in my database. As everyone would do it, I just inserted one row per measurement interval (i.e., a day).
Listing 1: Plain measurement results in table 'cars_per_day'
day cars 1 8 2 25 3 47 4 16 5 25 6 39 7 5
However, to draw a CDF, gnuplot expects the following format: Given a number of cars Z (x-value), what percentage of days at most Z cars passed on the street? To make this more clear, Listing 2 shows the outcome that gnuplot expects.
Listing 2: gnuplot data input for CDF
cars | ratio of days ------+--------------- 5 | 0.143 8 | 0.288 16 | 0.429 25 | 0.714 39 | 0.857 47 | 1.000
Clearly, we need to somehow convert the raw data we have to the format gnuplot expects. Whereas some people even write scripts for this, there is a very elegant solution with SQL and Window functions. I used PostgreSQL 8+ to draft my SQL query. As of February 2011, MySQL does not support Window functions yet, Oracle8i does. Thus, if you have a good ;) DBMS, you can just adapt the following SQL query to your own needs:
Listing 3: SQL query for CDF input data
SELECT -- x-value: number of cars cars AS x, -- y-value: percentage of days at most x cars passed the street -- calculated as the number of days with at most x cars devided by -- the total number days in the measurement period COUNT(day) OVER (ORDER BY cars) / (SELECT COUNT(*) FROM cars_per_day)::real AS y FROM cars_per_day
Save this SQL query in a file and retrieve the data readily formatted for gnuplot like this:
Listing 4: psql command to retrieve data CSV-separated
psql mydb --quiet --no-align -F ';' --pset t --pset footer -f my.sql
Once you have the data, take the following gnuplot script to plot a CDF and modify it to your needs.
Listing 4: Example gnuplot CDF script
set datafile separator ';' set key right bottom set xlabel "Cars per day" set ylabel "Ratio of days" set yrange [0:1] set term png set output "cdf_cars.png" plot "cdf_cars.csv" using 1:2 notitle with linespoints
Finally, you should have your CDF:
Obviously you need to slightly change the SQL and gnuplot script to your needs, but this should be easy now.