Aggregates

 

Aggregates

Aggregates include Sum, Avg,  Count, Max and Min.    Max and Min aggregates work for dates and strings and also for uuid and binary values. Max and Min aggregates work for xN vector values.  

 

Using the Count aggregate function:

 

 

Aggregates ignore NULL values.   

 

The COLLECT Aggregate

The COLLECT aggregate collects values from a subgroup into a table.  COLLECT returns a table with one or more fields and zero or more records.  The values in that table normally are just regular values and are not tables.

 

Modern databases such as Oracle or PostgreSQL often allow control over the values seen by aggregates. For example, a user might write a query that contains several Sum aggregate functions using the same field with each Sum computing a partial sum utilizing a filter specified within the Sum, for example,

 

SUM(x FILTER ...)

 

The COLLECT aggregate provides a generic aggregate for performing such operations and other operations as well.   For example:

 

SELECT name, Count(*),

SPLIT (COLLECT property, value)

FROM mfd_meta GROUP BY name;

 

In the above example SPLIT is used merely to show which values are in each COLLECT.  

 

The table returned by COLLECT is typically fed into a function. This essentially allows creating custom aggregates, including with scripts.  

 

We can also COLLECT expressions similar to how SELECT can use expressions:

 

SELECT name, Count(*),

SPLIT (COLLECT property, StringToUppercase(value))

FROM mfd_meta GROUP BY name;

 

One way to understand what COLLECT does is to start with what GROUP does.   GROUP takes a table, sorts the records within that table into groups and then produces a record for each group by applying aggregates to derive that one record for the group.

 

Each aggregate takes a group, which is a subset of records from the original table, and from the values in that group computes one or more resulting values for the resulting record.   For example, the aggregate Sum(f) takes all values in the group, sums them and returns the sum.

 

COLLECT works the same way: it is like a SELECT which runs on a group.   COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT.

 

Notes on COLLECT

 

COLLECT supports ORDER BY using the same syntax as using ORDER BY in SELECT statements, including ordering by expression.

 

COLLECT supports DISTINCT, for example as in

 

(COLLECT DISTINCT value)

 

COLLECT supports WHERE.

 

Other Aggregates

The following aggregate functions are available:

 

 

 

Notes

Nulls not skipped - First and Last do not skip NULLs because if they did a construction like

 

SELECT First(a), First(b)

 

could return values from different records.

 

Unused arguments - A function that does not use one of its arguments will not fail to compile when passed an aggregate.  Instead, the aggregate is optimized away and is not computed.  For example:

 

FUNCTION f(t NVARCHAR) INT32 AS 5 END

SELECT name, f(Max(property))

FROM mfd_meta GROUP BY name;

 

Will compile and run OK.

 

Streaming mode - Aggregates operate in streaming mode, spilling excess data to disk if required. This both allows handling bigger amounts of data at predictable speed and also protects against running out of memory if the system is low on memory.

 

Protection against running out of memory is not guaranteed by streaming as it is still possible to run out of memory, but doing so is much more difficult.

 

Too far away  - When implementing aggregates any query engine must decide how far an aggregate can go from the producing GROUP, whether that is explicit or implicit.   For example, considering the fragment

 

SELECT ... Max(a) ... FROM t GROUP BY b

 

the query engine must decide what can be allowed or disallowed within the ellipses.  Manifold allows everything except nested SELECTs and functions.

 

Nested Selects -  An aggregate may not go into a nested SELECT.   So, for example, the following works:

 

FUNCTION f(t TABLE, u NVARCHAR) TABLE AS

(SELECT mfd_id, type FROM t WHERE type>u) END

SELECT name, SPLIT CALL f(mfd_root, Max(value))

FROM mfd_meta GROUP BY name;

 

But the following does not work:

 

SELECT name, SPLIT

(SELECT mfd_id, type FROM mfd_root

WHERE type>Max(value))

FROM mfd_meta GROUP BY name;

 

The above does not work because the inner SELECT treats Max(value) as its own, not that of the outer SELECT. If the inner SELECT accepted aggregates from the outer SELECT, we could end up with something like the following, which does not work:

 

SELECT name, SPLIT

  (SELECT mfd_id, type FROM mfd_root

WHERE Max(name) > Max(value)

   GROUP BY mfd_id, type)

FROM mfd_meta GROUP BY name;

 

In the above it would be difficult for a reader to tell which Max in the inner SELECT belongs to that SELECT and which belonged to the outer SELECT.   In such cases  it would be easy to create queries that would compile and run but which did not work as intended.  

 

To prevent such problems, in Manifold a SELECT acts as a fence for aggregates.  All aggregates inside a SELECT belong to that SELECT.

 

Functions - An aggregate may not cross over between the outside and the inside of the body of a FUNCTION.  That does not mean we cannot use aggregates within functions.  Aggregates may be used within functions but they either must be completely inside a function or completely outside a function.  The boundary between the inside and outside of a function is also like a fence that an aggregate may not cross.

 

For example, the following works because the aggregate is completely outside the function:

 

FUNCTION f(a INT32) INT32 AS a+100 END

SELECT name, f(Count(*))

FROM mfd_meta GROUP BY name;

 

The following works as well (aggregate completely inside):

 

FUNCTION f(t TABLE) TABLE AS

(SELECT name, Count(*)+100 FROM t GROUP BY name) END

EXECUTE CALL f(mfd_meta);

 

The following works too with aggregates inside and outside the function but  separated from each other:

 

FUNCTION f(t TABLE, n NVARCHAR) TABLE AS

(SELECT Count(*) FROM t WHERE name=n) END

SELECT SPLIT CALL f(mfd_meta, Max(name)) FROM mfd_root;

 

This works:

 

SELECT Min(mfd_id) + Max(mfd_id) FROM mfd_meta;

 

But the following does not work:

 

FUNCTION f(a INT32) INT32 AS Min(a) + Max(a) END

SELECT f(mfd_id) FROM mfd_meta;

 

But this does work:

 

FUNCTION f(a INT32, b INT32) INT32 AS a + b END

SELECT f(Min(mfd_id), Max(mfd_id)) FROM mfd_meta;

 

In general, everything else apart from nested SELECTs and mixing between the inside and outside of functions is fair game.   For example, the following works:

 

SELECT SPLIT (VALUES (Min(mfd_id)), (Max(mfd_id)))

FROM mfd_meta;

 

..and this works as well:

 

SELECT SPLIT

  (EXECUTE WITH (n NVARCHAR=Max(name))

    [[ SELECT * FROM mfd_root WHERE name=n ]])

FROM mfd_meta;

 

Examples of Union and Merge Aggregates

Merge and Union aggregates usually also have a non-aggregate "Pair" version of the function that operates between two such types.  For example, the aggregate GeomMergeLines function has a GeomMergeLinesPair equivalent that operates to merge two line objects.

 

We will try passing a non-line to GeomMergeLinesPair:

 

--SQL

EXECUTE CALL GeomToCoords(GeomMergeLinesPair(

  GeomConvertToArea(GeomMakeRect(VectorMakeX4(3, 3, 4, 4))),

  GeomConvertToLine(GeomMakeRect(VectorMakeX4(7, 7, 8, 8)))

))

 

...it ignored the non-line and returned branches for the line, the second argument. If we passed two non-lines it would have returned a NULL value.

 

The non-aggregate merge and union functions only take two arguments.  If we have three objects, how do we merge them?

 

We can do it like this:

 

--SQL

SELECT GeomMergePoints(f) FROM

(VALUES

 (GeomMakePoint(VectorMakeX2(0, 0))),

 (GeomMakePoint(VectorMakeX2(1, 1))),

 (GeomMakePoint(VectorMakeX2(2, 2)))

AS (f))

 

Or like this in the case of five objects (the same approach works for more than five):

 

--SQL

FUNCTION MakeSeveralPoints(p GEOM, q GEOM, r GEOM, s GEOM, t GEOM) TABLE AS

  (VALUES (p), (q), (r), (s), (t) AS (f))

END

SELECT GeomMergePoints(f) FROM CALL MakeSeveralPoints(

  GeomMakePoint(VectorMakeX2(0, 0)),

  GeomMakePoint(VectorMakeX2(1, 1)),

  GeomMakePoint(VectorMakeX2(2, 2)),

  GeomMakePoint(VectorMakeX2(3, 3)),

  GeomMakePoint(VectorMakeX2(4, 4)))

 

Finally, we take a look at the difference between GeomMergeAreas and GeomUnionAreas.  We run the following one by one:

 

--SQL

SELECT SPLIT CALL GeomToCoords(GeomMergeAreas(f)) FROM

(VALUES

 (GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A

 (GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B

 (GeomMakeRect(VectorMakeX4(8, 8, 9, 9)))  -- C

AS (f))

 

--SQL

SELECT SPLIT CALL GeomToCoords(GeomUnionAreas(f)) FROM

(VALUES

 (GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A

 (GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B

 (GeomMakeRect(VectorMakeX4(8, 8, 9, 9)))  -- C

AS (f))

 

GeomMergeAreas produces three branches while GeomUnionAreas produces two branches because B is completely covered by A and disappears in the union.