-
Header: christian-rossow.de
-
Apply for a PostDoc position in my System Security research group at CISPA, Germany.

https://www.christian-rossow.de

If you can't explain it simply, you don't understand it well enough.

HowTo - Calculate boxplot values in PostgreSQL


In statistics, box plots are a common way to express the distribution of data. Box plots take the minimum, maximum, median and two quartile values as input. Unfortunately, PostgreSQL does not ship aggregate functions to calculate the median, the 1st or the 3rd quartile. The main reason is most likely because it's not very efficient to calculate.

If you are searching for existing solutions, you will find very elegant solutions using PL/R. However, PL/R is not available as precompiled packages for CentOS and many other Linux distributions. PL/Python, on the other hand, is available for many systems. The following definitions can be used to create an aggregate function based on PL/Python that calculates the values necessary for a boxplot.

Boxplot data type, helper function and aggregate function
CREATE TYPE boxplot_values AS (
  min       numeric,
  q1        numeric,
  median    numeric,
  q3        numeric,
  max       numeric
);

CREATE OR REPLACE FUNCTION _final_boxplot(a numeric[])
   RETURNS boxplot_values AS
$$
    a.sort()
    i = len(a)
    return ( a[0], a[i/4], a[i/2], a[i*3/4], a[-1] )
$$
LANGUAGE 'plpythonu' IMMUTABLE;
 
CREATE AGGREGATE boxplot(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_boxplot,
  INITCOND='{}'
);
		

After installing these definitions, the boxplot values can be calculated as follows:

Boxplot usage example
SELECT customer_id as cid, (boxplot(price)).*
FROM orders
GROUP BY customer_id;

   cid |   min   |   q1    | median  |   q3    |   max
-------+---------+---------+---------+---------+---------
  1001 | 7.40209 | 7.80031 |  7.9551 | 7.99059 | 7.99903
  1002 | 3.44229 | 4.38172 | 4.72498 | 5.25214 | 5.98736
        

Some remarks on this solution:
  • Depending on the data type of column x, it's sometimes necessary to cast explicitly to x::numeric
  • Given a huge number of input values, the aggregate performs poorly, since all values need to be kept in memory.
  • Unfortunately, as of psql 8.4.x, PL/Python does not support arrays yet and casts all arrays to str values. It is therefore necessary to cast the string to a Python array, first.


Now take the graph tool of your choice (e.g., GNUplot) and draw your boxplot! :)


Yours

Christian

-
-