HowTo - Calculate boxplot values in PostgreSQL
In statistics, box plots are a common way to express the distribution of data. Box plots take the minimum, maximum, median and two quartile values as input. Unfortunately, PostgreSQL does not ship aggregate functions to calculate the median, the 1st or the 3rd quartile. The main reason is most likely because it's not very efficient to calculate. If you are searching for existing solutions, you will find very elegant solutions using PL/R. However, PL/R is not available as precompiled packages for CentOS and many other Linux distributions. PL/Python, on the other hand, is available for many systems. The following definitions can be used to create an aggregate function based on PL/Python that calculates the values necessary for a boxplot. Boxplot data type, helper function and aggregate function
CREATE TYPE boxplot_values AS ( min numeric, q1 numeric, median numeric, q3 numeric, max numeric ); CREATE OR REPLACE FUNCTION _final_boxplot(a numeric[]) RETURNS boxplot_values AS $$ a.sort() i = len(a) return ( a[0], a[i/4], a[i/2], a[i*3/4], a[-1] ) $$ LANGUAGE 'plpythonu' IMMUTABLE; CREATE AGGREGATE boxplot(numeric) ( SFUNC=array_append, STYPE=numeric[], FINALFUNC=_final_boxplot, INITCOND='{}' ); After installing these definitions, the boxplot values can be calculated as follows: Boxplot usage example
SELECT customer_id as cid, (boxplot(price)).* FROM orders GROUP BY customer_id; cid | min | q1 | median | q3 | max -------+---------+---------+---------+---------+--------- 1001 | 7.40209 | 7.80031 | 7.9551 | 7.99059 | 7.99903 1002 | 3.44229 | 4.38172 | 4.72498 | 5.25214 | 5.98736 Some remarks on this solution:
Now take the graph tool of your choice (e.g., GNUplot) and draw your boxplot! :)
Yours
|