Apply for a PostDoc/PhD position in my research group at CISPA in Dortmund, Germany.

https://www.christian-rossow.de

If you can't explain it simply, you don't understand it well enough.

HowTo: PostgreSQL - Custom Aggregate Functions

This article is supposed to give you a quick start guide to create your own aggregate functions within the open source database PostgreSQL. It should assist you by understanding the official documentation and answers some questions, which came up to me during my first attempts.

Basics about aggregate functions

These nice functions support you by aggregating multiple rows of a table. Assume you have a table called employees with the column salary and you are interested in the sum of all salaries. You could either sum it manually (just a joke) or use a standard aggregate function like SUM, that performs an operation for each dataset found and so sums up a total sum.

Snip 1 - Using an standardized aggregate function:

SELECT SUM(salary) AS total_sum FROM employees;

So what does this sum operation require? The big advantage of aggregate functions is now, that they can combine multiple rows, instead of maximum combine columns. In our example the SUM operation sums up all salaries, which are part of the queried rows. There is no other possibility to subsume rows than aggregate functions, unless you handle something on the part of an application.

The big advantage of PostgreSQL is that you can create custom aggregate functions, which satisfy the combination you imagined.

Quick start

Before reading this passage, please preposition the official documentation about CREATE AGGREGATE. Based on this I try to explain the main points and assist you at the most important steps.

Snip 2 - Synopsis of CREATE AGGREGATE:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
    SFUNC = sfunc,
    STYPE = state_data_type
    [ , FINALFUNC = ffunc ]
    [ , INITCOND = initial_condition ]
    [ , SORTOP = sort_operator ]
)

Imagine the AGGREGATE is a process, which has a defined state (of type STYPE, more later), which changes during iteration through the selected datasets. What specifically will be changed is defined through the function SFUNC. The final function FINALFUNC will be applied after aggregating and it's return value will be returned as result of the aggregation - the default of FINALFUNC is returning the last known state.
To assist the official documentation I give you some hints for finding the correct values.

STYPE: data type, in which the aggregator remembers it's current state. This state (of type STYPE) will be passed as first argument to the function SFUNC.
SFUNC: name of a function, which expects (state_data_type, input_data_type(s)) as argument and returns a value of the type of STYPE.
```
SFUNC(state<STYPE>, next input_data_type(s)) ---> new state<STYPE>
```
To get a list of all available functions, connect with psql, type \df <begin_of_expression> and press <Tab> (similar to Unix auto completition). If you choose <Enter> and detailed information about the function will be shown.

Snip 3 - list available functions (SFUNC's):
```
christian@rossow:~> psql -U myuser testdb

	testdb=> \df ip4

	    ip4                       ip4_in
	    ip4_and                   ip4_le
	    ip4_cmp                   ip4_xor
	    [...]

	testdb=> \df ip4_xor

	     schema |  name   | result data type | argument data types
	    --------+---------+------------------+--------------------
	     public | ip4_xor | ip4              | ip4, ip4
```
For more detailed information you can take a look in the PostgreSQL system table pg_proc.
FINALFUNC: connector between the last internal aggregator state (after aggregating) and the return value of the aggregation. Opens the possibilty to cast the last state and/or do a final operation on it.
```
FINALFUNC(state<STYPE>) ---> aggregator return value
```
INITCOND: start value of the internal state. It is strongly recommended that you state this.

Example

As I came into contact with custom aggregate functions, I had a table with many datasets containing IP-addresses. My idea was, simply put, to build up a hash code about IP-addresses, which I selected before, which can easily done by XOR-ing all IP-addresses.

Snip 4 - Finished aggregator for 'hashing' IPv4-addresses:

01  CREATE AGGREGATE hash_ips ( ip4 ) (
02	SFUNC = ip4_xor,
03	STYPE = ip4,
04	INITCOND = '0.0.0.0'
05  )

Description:
The aggregator is called hash_ips (line 01) and expects one paramter of type ip4 (line 01, look at pgFoundry for more information about ip4). It's initial state itself is also of type ip4 (line 03) and has the initial value of the IP-address 0.0.0.0 (line 04). For each dataset this get's XORed (by passing the function ip4_xor, line 02)) and will be returned after the iteration (because no FINALFUNC is stated).

Snip 5 - After creation you can use the aggregator like other aggregators.

SELECT hash_ips(column_with_ip) FROM mytable;

For more examples please have a look at the official examples.

Conclusion

The main idea was to simply create one-way-functions (called 'hash'-function), which succeeded smoothly with PostgreSQL. This shows once more, that an object relational DBMS puts developer in a position to enhance the existing database with exotic functions. We should start thinking about alternate all the MySQL installations for complex software solutions and start cooperating with a 'real' database - PostgreSQL.