One way to understand what aggregate functions
or clauses do is to start with what **GROUP BY** does. GROUP takes
a table, sorts the records within that table into groups and then produces
one record for each group by applying aggregates to derive that one record
for the group.

An **aggregate**
function or clause takes a group, which is a subset of records from the
original table, and from the values in that group computes one or more
resulting values for the resulting record. For example, the
aggregate **Sum(f)** takes all values
in the group, sums them and returns the sum. Aggregates ignore
**NULL **values.

Aggregates include **Sum**,** Avg**,**
** **Count**, **Max** and
**Min**. **Max**
and **Min** aggregates work for dates
and strings and also for **uuid** and
**binary** values. **Max** and
**Min** aggregates work for **xN** vector values. See
also the **SQL
Functions** and **INLINE**
topics for examples.

Depending on the argument, the **Count**
aggregate function counts either the number of non-NULL values in a field
or expression within a group, or it counts the total number of records
within a group:

**Count(field)**- will count the number of non-NULL values in**field**.**Count(expression)**- will count the number of non-NULL values of the**expression**.**Count(*)**- will always count the total number of records.

A classic example is finding the number
of duplicates in a **City**
field within a table, using **Count(*)**:

SELECT [City], Count(*) FROM [Employees]

GROUP BY [City]

HAVING Count(*) > 1;

See the discussion of the above example
in the **Queries**
topic.

The **COLLECT**
aggregate collects values from a subgroup into a table. COLLECT
returns a table with one or more fields and zero or more records. The
values in that table normally are just
regular values and are not tables.

Modern databases such as Oracle or PostgreSQL
often allow control over the values seen by aggregates. For example, a
user might write a query that contains several **Sum
**aggregate functions using the same field with each **Sum**
computing a partial sum utilizing a filter specified within the **Sum**, for example,

SUM(x FILTER ...)

The **COLLECT**
aggregate provides a generic aggregate for performing such operations
and other operations as well. For example:

SELECT name, Count(*),

SPLIT (COLLECT property, value)

FROM mfd_meta GROUP BY name;

In the above example SPLIT is used merely to show which values are in each COLLECT.

The table returned by COLLECT is typically fed into a function. This essentially allows creating custom aggregates, including with scripts.

We can also COLLECT expressions similar to how SELECT can use expressions:

SELECT name, Count(*),

SPLIT (COLLECT property, StringToUppercase(value))

FROM mfd_meta GROUP BY name;

One way to understand what COLLECT does is to start with what GROUP does. GROUP takes a table, sorts the records within that table into groups and then produces a record for each group by applying aggregates to derive that one record for the group.

Each aggregate takes a group, which is
a subset of records from the original table, and from the values in that
group computes one or more resulting values for the resulting record.
For example, the aggregate **Sum(f)**
takes all values in the group, sums them and returns the sum.

COLLECT works the same way: it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT.

**Notes
on COLLECT**

COLLECT supports ORDER BY using the same syntax as using ORDER BY in SELECT statements, including ordering by expression.

COLLECT supports DISTINCT, for example as in

(COLLECT DISTINCT value)

COLLECT supports WHERE.

The following aggregate functions are available:

Diversity - Takes a set of numbers and computes the total number of different values.

**DiversityIndex**- Takes a set of numbers and computes a measure of diversity using the formula**1 - sum(individualcount^2) / (totalcount^2)**A diversity index of 0 means that all values are the same.**GeomMergeAreas**takes a set of areas and returns an area geom that contains branches. Curves and Z values are removed.**GeomMergeLines**takes a set of lines and returns a line geom that contains branches. Curves and Z values are removed.**GeomMergePoints**takes a set of points and returns a multipoint geom that contains branches. Z values are removed.**GeomUnionAreas**takes a set of areas and returns their union.**GeomUnionRects**takes a set of**x4**values and returns their union.**First**and**Last**return the first and last value respectively. Unlike most other aggregates, First and Last do not skip NULLs.**JoinAnd**,**JoinOr**and**JoinXor**combine boolean values using**And**,**Or**and**Xor**operators.**JoinBitAnd**,**JoinBitOr**and**JoinBitXor**combine numeric values using**BitAnd**,**BitOr**and**BitXor**operators.**Median**returns the median value of an arbitrary type.**Major**- Takes a set of numbers and returns the most frequently occurring value.**StDev**and**StDevPop**compute sample standard deviation and population standard deviation.**Var**and**VarPop**compute sample variance and population variance.**Covar**and**CovarPop**compute sample covariance and population covariance, taking two parameters.**Corr**computes correlation, taking two parameters.

**2 billion
/ 1 billion record limitations** - Median, Diversity, DiversityIndex,
Major are currently limited to 2 billion values, as a practical matter
limiting their use to aggregates of no more than 2 billion records.. Corr,
Covar, CovarPop are currently limited to 1 billion pairs of values. Future
builds will remove those limits. StDev, StDevPop, Var, VarPop were
also limited to 2 billion values in the past, but no longer have that
limit.

**Median
aggregate ** - The behavior of Median on an even number of values
is an implementation choice. Consider taking the median of two integers,
1, and 2. What is the median? 8 selects the lowest value of
a pair of central values. 9 selects the highest value, primarily for historical
reasons. Manifold code could be adjusted to selecting the lowest value
like 8 does, simply to avoid breaking compatibility for little reason.
That might happen in future builds. However, it could be that
was is really needed here is a variant of Median that takes an average.

**Nulls
not skipped** - **First** and
**Last** do not skip NULLs because
if they did a construction like

SELECT First(a), First(b)

could return values from different records.

**Unused
arguments** - A function that does not use one of its arguments will
not fail to compile when passed an aggregate. Instead, the aggregate
is optimized away and is not computed. For example:

FUNCTION f(@t NVARCHAR) INT32 AS 5 END;

SELECT name, f(Max(property))

FROM mfd_meta GROUP BY name;

Will compile and run OK.

Streaming mode - Aggregates operate in streaming mode, spilling excess data to disk if required. This both allows handling bigger amounts of data at predictable speed and also protects against running out of memory if the system is low on memory.

Protection against running out of memory is not guaranteed by streaming as it is still possible to run out of memory, but doing so is much more difficult.

**Too far
away** - When implementing aggregates any query engine must
decide how far an aggregate can go from the producing GROUP, whether that
is explicit or implicit. For example, considering the fragment

SELECT ... Max(a) ... FROM t GROUP BY b

the query engine must decide what can be allowed or disallowed within the ellipses. Manifold allows everything except nested SELECTs and functions.

**Nested Selects -** An aggregate
may **not** go into a nested SELECT.
So, for example, the following works:

FUNCTION f(@t TABLE, @u NVARCHAR) TABLE AS

(SELECT mfd_id, type FROM @t WHERE type>@u) END;

SELECT name, SPLIT CALL f(mfd_root, Max(value))

FROM mfd_meta GROUP BY name;

But the following does **not**
work:

SELECT name, SPLIT

(SELECT mfd_id, type FROM mfd_root

WHERE type>Max(value))

FROM mfd_meta GROUP BY name;

The above does not work because the inner
SELECT treats **Max(value)** as its
own, not that of the outer SELECT. If the inner SELECT accepted aggregates
from the outer SELECT, we could end up with something like the following,
which does **not** work:

SELECT name, SPLIT

(SELECT mfd_id, type FROM mfd_root

WHERE Max(name) > Max(value)

GROUP BY mfd_id, type)

FROM mfd_meta GROUP BY name;

In the above it would be difficult for
a reader to tell which **Max** in
the inner SELECT belongs to that SELECT and which belonged to the outer
SELECT. In such cases it would be easy to create queries
that would compile and run but which did not work as intended.

To prevent such problems, in Manifold a SELECT acts as a fence for aggregates. All aggregates inside a SELECT belong to that SELECT.

**Functions**
- An aggregate may **not** cross over
between the outside and the inside of the body of a FUNCTION. That
does not mean we cannot use aggregates within functions. Aggregates
may be used within functions but they either must be completely inside
a function or completely outside a function. The boundary between
the inside and outside of a function is also like a fence that an aggregate
may not cross.

For example, the following works because the aggregate is completely outside the function:

FUNCTION f(@a INT32) INT32 AS @a+100 END;

SELECT name, f(Count(*))

FROM mfd_meta GROUP BY name;

The following works as well (aggregate completely inside):

FUNCTION f(@t TABLE) TABLE AS

(SELECT name, Count(*)+100 FROM @t GROUP BY name) END;

TABLE CALL f(mfd_meta);

The following works too with aggregates inside and outside the function but separated from each other:

FUNCTION f(@t TABLE, @n NVARCHAR) TABLE AS

(SELECT Count(*) FROM @t WHERE name=@n) END;

SELECT SPLIT CALL f(mfd_meta, Max(name)) FROM mfd_root;

This works:

SELECT Min(mfd_id) + Max(mfd_id) FROM mfd_meta;

But the following does **not**
work:

FUNCTION f(@a INT32) INT32 AS Min(@a) + Max(@a) END;

SELECT f(mfd_id) FROM mfd_meta;

But this does work:

FUNCTION f(@a INT32, @b INT32) INT32 AS @a + @b END;

SELECT f(Min(mfd_id), Max(mfd_id)) FROM mfd_meta;

In general, everything else apart from nested SELECTs and mixing between the inside and outside of functions is fair game. For example, the following works:

SELECT SPLIT (VALUES (Min(mfd_id)), (Max(mfd_id)))

FROM mfd_meta;

..and this works as well:

SELECT SPLIT

(EXECUTE WITH (@n NVARCHAR=Max(name))

[[ SELECT * FROM mfd_root WHERE name=@n ]])

FROM mfd_meta;

Merge and Union aggregates usually also
have a non-aggregate "Pair" version of the function that operates
between two such types. For example, the aggregate **GeomMergeLines**
function has a **GeomMergeLinesPair**
equivalent that operates to merge two line objects.

We will try passing a non-line to **GeomMergeLinesPair**:

--SQL

TABLE CALL GeomToCoords(GeomMergeLinesPair(

GeomConvertToArea(GeomMakeRect(VectorMakeX4(3, 3, 4, 4))),

GeomConvertToLine(GeomMakeRect(VectorMakeX4(7, 7, 8, 8)))

))

...it ignored the non-line and returned branches for the line, the second argument. If we passed two non-lines it would have returned a NULL value.

The non-aggregate merge and union functions only take two arguments. If we have three objects, how do we merge them?

We can do it like this:

--SQL

SELECT GeomMergePoints(f) FROM

(VALUES

(GeomMakePoint(VectorMakeX2(0, 0))),

(GeomMakePoint(VectorMakeX2(1, 1))),

(GeomMakePoint(VectorMakeX2(2, 2)))

AS (f))

Or like this in the case of five objects (the same approach works for more than five):

--SQL

FUNCTION MakeSeveralPoints(@p GEOM, @q GEOM, @r GEOM, @s GEOM, @t GEOM) TABLE AS

(VALUES (@p), (@q), (@r), (@s), (@t) AS (f))

END;

SELECT GeomMergePoints(f) FROM CALL MakeSeveralPoints(

GeomMakePoint(VectorMakeX2(0, 0)),

GeomMakePoint(VectorMakeX2(1, 1)),

GeomMakePoint(VectorMakeX2(2, 2)),

GeomMakePoint(VectorMakeX2(3, 3)),

GeomMakePoint(VectorMakeX2(4, 4)))

Finally, we take a look at the difference
between **GeomMergeAreas** and **GeomUnionAreas**. We run the
following one by one:

--SQL

SELECT SPLIT CALL GeomToCoords(GeomMergeAreas(f)) FROM

(VALUES

(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A

(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B

(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C

AS (f))

--SQL

SELECT SPLIT CALL GeomToCoords(GeomUnionAreas(f)) FROM

(VALUES

(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A

(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B

(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C

AS (f))

**GeomMergeAreas**
produces three branches while **GeomUnionAreas**
produces two branches because **B**
is completely covered by **A** and
disappears in the union.