Aggregates include Sum, Avg, Count, Max and Min. Max and Min aggregates work for dates and strings and also for uuid and binary values. Max and Min aggregates work for xN vector values.
Using the Count aggregate function:
Count(field) - will count the number of non-NULL values in field.
Count(expression) - will count the number of non-NULL values of the expression.
Count(*) - will always count the total number of records.
Aggregates ignore NULL values.
The COLLECT aggregate collects values from a subgroup into a table. COLLECT returns a table with one or more fields and zero or more records. The values in that table normally are just regular values and are not tables.
Modern databases such as Oracle or PostgreSQL often allow control over the values seen by aggregates. For example, a user might write a query that contains several Sum aggregate functions using the same field with each Sum computing a partial sum utilizing a filter specified within the Sum, for example,
SUM(x FILTER ...)
The COLLECT aggregate provides a generic aggregate for performing such operations and other operations as well. For example:
SELECT name, Count(*),
SPLIT (COLLECT property, value)
FROM mfd_meta GROUP BY name;
In the above example SPLIT is used merely to show which values are in each COLLECT.
The table returned by COLLECT is typically fed into a function. This essentially allows creating custom aggregates, including with scripts.
We can also COLLECT expressions similar to how SELECT can use expressions:
SELECT name, Count(*),
SPLIT (COLLECT property, StringToUppercase(value))
FROM mfd_meta GROUP BY name;
One way to understand what COLLECT does is to start with what GROUP does. GROUP takes a table, sorts the records within that table into groups and then produces a record for each group by applying aggregates to derive that one record for the group.
Each aggregate takes a group, which is a subset of records from the original table, and from the values in that group computes one or more resulting values for the resulting record. For example, the aggregate Sum(f) takes all values in the group, sums them and returns the sum.
COLLECT works the same way: it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT.
Notes on COLLECT
COLLECT supports ORDER BY using the same syntax as using ORDER BY in SELECT statements, including ordering by expression.
COLLECT supports DISTINCT, for example as in
(COLLECT DISTINCT value)
COLLECT supports WHERE.
The following aggregate functions are available:
GeomMergeAreas takes a set of areas and returns an area geom that contains branches. Curves and Z values are removed.
GeomMergeLines takes a set of lines and returns a line geom that contains branches. Curves and Z values are removed.
GeomMergePoints takes a set of lines and returns a line geom that contains branches. Z values are removed.
GeomUnionAreas takes a set of areas and returns their union.
GeomUnionRects takes a set of x4 values and returns their union.
First and Last return the first and last value respectively. Unlike most other aggregates, First and Last do not skip NULLs.
JoinAnd, JoinOr and JoinXor combine boolean values using And, Or and Xor operators.
JoinBitAnd, JoinBitOr and JoinBitXor combine numeric values using BitAnd, BitOr and BitXor operators.
Median returns the median value of an arbitrary type.
StDev and StDevPop compute sample standard deviation and population standard deviation.
Var and VarPop compute sample variance and population variance.
Covar and CovarPop compute sample covariance and population covariance, taking two parameters.
Corr computes correlation, taking two parameters.
Nulls not skipped - First and Last do not skip NULLs because if they did a construction like
SELECT First(a), First(b)
could return values from different records.
Unused arguments - A function that does not use one of its arguments will not fail to compile when passed an aggregate. Instead, the aggregate is optimized away and is not computed. For example:
FUNCTION f(t NVARCHAR) INT32 AS 5 END
SELECT name, f(Max(property))
FROM mfd_meta GROUP BY name;
Will compile and run OK.
Streaming mode - Aggregates operate in streaming mode, spilling excess data to disk if required. This both allows handling bigger amounts of data at predictable speed and also protects against running out of memory if the system is low on memory.
Protection against running out of memory is not guaranteed by streaming as it is still possible to run out of memory, but doing so is much more difficult.
Too far away - When implementing aggregates any query engine must decide how far an aggregate can go from the producing GROUP, whether that is explicit or implicit. For example, considering the fragment
SELECT ... Max(a) ... FROM t GROUP BY b
the query engine must decide what can be allowed or disallowed within the ellipses. Manifold allows everything except nested SELECTs and functions.
Nested Selects - An aggregate may not go into a nested SELECT. So, for example, the following works:
FUNCTION f(t TABLE, u NVARCHAR) TABLE AS
(SELECT mfd_id, type FROM t WHERE type>u) END
SELECT name, SPLIT CALL f(mfd_root, Max(value))
FROM mfd_meta GROUP BY name;
But the following does not work:
SELECT name, SPLIT
(SELECT mfd_id, type FROM mfd_root
WHERE type>Max(value))
FROM mfd_meta GROUP BY name;
The above does not work because the inner SELECT treats Max(value) as its own, not that of the outer SELECT. If the inner SELECT accepted aggregates from the outer SELECT, we could end up with something like the following, which does not work:
SELECT name, SPLIT
(SELECT mfd_id, type FROM mfd_root
WHERE Max(name) > Max(value)
GROUP BY mfd_id, type)
FROM mfd_meta GROUP BY name;
In the above it would be difficult for a reader to tell which Max in the inner SELECT belongs to that SELECT and which belonged to the outer SELECT. In such cases it would be easy to create queries that would compile and run but which did not work as intended.
To prevent such problems, in Manifold a SELECT acts as a fence for aggregates. All aggregates inside a SELECT belong to that SELECT.
Functions - An aggregate may not cross over between the outside and the inside of the body of a FUNCTION. That does not mean we cannot use aggregates within functions. Aggregates may be used within functions but they either must be completely inside a function or completely outside a function. The boundary between the inside and outside of a function is also like a fence that an aggregate may not cross.
For example, the following works because the aggregate is completely outside the function:
FUNCTION f(a INT32) INT32 AS a+100 END
SELECT name, f(Count(*))
FROM mfd_meta GROUP BY name;
The following works as well (aggregate completely inside):
FUNCTION f(t TABLE) TABLE AS
(SELECT name, Count(*)+100 FROM t GROUP BY name) END
EXECUTE CALL f(mfd_meta);
The following works too with aggregates inside and outside the function but separated from each other:
FUNCTION f(t TABLE, n NVARCHAR) TABLE AS
(SELECT Count(*) FROM t WHERE name=n) END
SELECT SPLIT CALL f(mfd_meta, Max(name)) FROM mfd_root;
This works:
SELECT Min(mfd_id) + Max(mfd_id) FROM mfd_meta;
But the following does not work:
FUNCTION f(a INT32) INT32 AS Min(a) + Max(a) END
SELECT f(mfd_id) FROM mfd_meta;
But this does work:
FUNCTION f(a INT32, b INT32) INT32 AS a + b END
SELECT f(Min(mfd_id), Max(mfd_id)) FROM mfd_meta;
In general, everything else apart from nested SELECTs and mixing between the inside and outside of functions is fair game. For example, the following works:
SELECT SPLIT (VALUES (Min(mfd_id)), (Max(mfd_id)))
FROM mfd_meta;
..and this works as well:
SELECT SPLIT
(EXECUTE WITH (n NVARCHAR=Max(name))
[[ SELECT * FROM mfd_root WHERE name=n ]])
FROM mfd_meta;
Merge and Union aggregates usually also have a non-aggregate "Pair" version of the function that operates between two such types. For example, the aggregate GeomMergeLines function has a GeomMergeLinesPair equivalent that operates to merge two line objects.
We will try passing a non-line to GeomMergeLinesPair:
--SQL
EXECUTE CALL GeomToCoords(GeomMergeLinesPair(
GeomConvertToArea(GeomMakeRect(VectorMakeX4(3, 3, 4, 4))),
GeomConvertToLine(GeomMakeRect(VectorMakeX4(7, 7, 8, 8)))
))
...it ignored the non-line and returned branches for the line, the second argument. If we passed two non-lines it would have returned a NULL value.
The non-aggregate merge and union functions only take two arguments. If we have three objects, how do we merge them?
We can do it like this:
--SQL
SELECT GeomMergePoints(f) FROM
(VALUES
(GeomMakePoint(VectorMakeX2(0, 0))),
(GeomMakePoint(VectorMakeX2(1, 1))),
(GeomMakePoint(VectorMakeX2(2, 2)))
AS (f))
Or like this in the case of five objects (the same approach works for more than five):
--SQL
FUNCTION MakeSeveralPoints(p GEOM, q GEOM, r GEOM, s GEOM, t GEOM) TABLE AS
(VALUES (p), (q), (r), (s), (t) AS (f))
END
SELECT GeomMergePoints(f) FROM CALL MakeSeveralPoints(
GeomMakePoint(VectorMakeX2(0, 0)),
GeomMakePoint(VectorMakeX2(1, 1)),
GeomMakePoint(VectorMakeX2(2, 2)),
GeomMakePoint(VectorMakeX2(3, 3)),
GeomMakePoint(VectorMakeX2(4, 4)))
Finally, we take a look at the difference between GeomMergeAreas and GeomUnionAreas. We run the following one by one:
--SQL
SELECT SPLIT CALL GeomToCoords(GeomMergeAreas(f)) FROM
(VALUES
(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A
(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B
(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C
AS (f))
--SQL
SELECT SPLIT CALL GeomToCoords(GeomUnionAreas(f)) FROM
(VALUES
(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A
(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B
(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C
AS (f))
GeomMergeAreas produces three branches while GeomUnionAreas produces two branches because B is completely covered by A and disappears in the union.