COLLECT

COLLECT is an aggregate that returns a table.  It is usually used with GROUP BY.  COLLECT gathers up the specified fields and their values from within each subgroup formed by GROUP BY, so those fields can be used in an arbitrary SQL expression or together with some other aggregate function.  That provides a way of doing aggregate operations using arbitrary SQL expressions, and not just using one of the built-in aggregate functions.   In the simplest form,  COLLECT just returns a list of values from a table within a desired field.

 

Modern databases such as Oracle or PostgreSQL allow some control over the values seen by aggregates. For example, a user might be able to write a query that contains several Sum aggregates of the same field with each Sum computing a partial sum guided by a filter specified within the Sum, for example,  SUM(x FILTER ...).   Manifold provides a generic aggregate, COLLECT, that allows doing these and other operations.   

 

The COLLECT aggregate returns a table of values collected from a subgroup, typically guided by GROUP BY.  The results table usually is then fed into a function or other clause, for example, SPLIT. This essentially allows creating custom aggregates and provides control over how values are used in other aggregates.    

 

The COLLECT statement is often used together with the SPLIT clause and the GROUP BY clause.  After reviewing this topic and the SPLIT and GROUP BY topics, the GROUP BY, COLLECT, and SPLIT Tutorial topic will help explain how those three SQL constructs work together.  

Example

The mfd_meta table in Manifold lists all properties for all components in a project.  Suppose we would like to create a table that for each component listed each property and its value for that component, and also listed the name of the component and the number of properties in it.  We would like to group the results by property name and show the results in order sorted by the total number of properties.  

 

That is easy to do using COLLECT:

 

SELECT name, Count(*) AS [Total Properties], SPLIT (COLLECT property, value)

  FROM mfd_meta GROUP BY name ORDER BY [Total Properties];

 

SPLIT is used in the above to show which values are in each COLLECT.  

 

A typical result:

 

 

As with SELECT, we can COLLECT expressions, for example:

 

SELECT name, Count(*) AS [Total Properties], SPLIT (COLLECT property, StringToUpperCase(value))

  FROM mfd_meta GROUP BY name ORDER BY [Total Properties];

 

In the above we use the StringToUpperCase SQL function to convert the contents of the Value field to upper case:

 

 

Since we have not aliased the result of the function to some desired column name (as we did with [Total Properties]) the column for that result is simply named result.

 

COLLECT Highlights:

 

 

 

 

 

Order of Precedence

Consider a query and discussion contributed by SQL master tjhb in the forum:

 

SELECT <field1>,

SPLIT

    (COLLECT <field2>

    ORDER BY <field2> DESC

    FETCH 3

    )

FROM <table>

GROUP BY <field1>

;

 

The order of evaluation is GROUP BY -> COLLECT -> ORDER BY -> FETCH (filter) -> SPLIT, as follows:

 

(GROUP BY)  In <table>, find the distinct values in <field1>, and for each value, group the records that share that value into a set.  

 

For each set A created by GROUP BY:

 

(COLLECT) List all values in <field2> into a new set B, then

 

(ORDER BY) Sort the records in set B, putting the highest value in <field2> at the top, then

 

(FETCH) Keep up to the first 3 records in set B, removing any others.

 

(SPLIT) List each member of set A into a separate record for each member of its own set B, writing the <field1> value from A beside each <field2> value from B.

 

The GROUP BY operates in conjunction with an aggregate.   In the above case the aggregate function is COLLECT.

Examples

Consider the Invoices table seen at right.   Each record has a Name, a Job classification, and a Payment.   Names and job classifications include some repetition, for example, with the name Alex occurring in three records, twice with Electrical jobs and once with a Carpentry job.

 

A simple SELECT query:

 

SELECT Job, Name

  FROM Invoices

  ORDER BY Job;

 

The query lists the job and name for all records with the records listed in alphabetical order by Job.

 

COLLECT is normally used to feed other functions, often using SPLIT.    For a quick refresher, let us see what SPLIT does with a simple SELECT query, like the above, but simply selecting Name within the SPLIT.

 

SELECT Job, SPLIT (SELECT Name FROM Invoices)

  FROM Invoices

  ORDER BY Job;

 

One might think that a SPLIT which permuted each instance of a Name from the results table of the SELECT together with the Job would return something similar to SELECT Job, Name FROM Invoices.   But that is not what SPLIT does.   

 

The SELECT statement within the SPLIT clause, SELECT Name FROM Invoices, returns a table with 14 records since there are that many records in the Invoices table.  For each record in that table the SELECT returns the value of the Name field, and thus returns 14 records.   SPLIT creates a permutation with all the records for the other field involved, in this case Job.   

 

The SELECT Job, SPLIT (SELECT.... statement is like saying SELECT Job, {for each thing in... (SELECT...)}...   For each record that results in a Job value for that initial SELECT, the SPLIT will generate 14 records.  We therefore end up with a table of 14 * 14 = 196 records.

 

Suppose we do not want 196 records but instead would like a more controlled permutation with each Name value extracted from the Invoices table to be used just once, so there is a total of only 14 records and not 196?   We can accomplish that by using the COLLECT aggregate within SPLIT together with GROUP BY to control how the COLLECT aggregate is used:

 

SELECT Job, SPLIT (COLLECT Name)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

In the above query GROUP BY Job operates first:  it scans the entire Invoices table and separates all the records into four groups or sets, a group of four records that all are Carpentry jobs, a group of two records that both are Electrical jobs, a group of five records that all are Masonry jobs and a group of three records that all are Plumbing jobs.

 

 

For each of the four groups created by GROUP BY Job, the COLLECT Name scans the group to see all the different values found in the Name field and makes a list of those values.  It "collects" the names found in each GROUP BY bundle.    

 

 

For example, the collection of Name values found in the group of Carpentry jobs is the set Tina, Alex, Tom, Tina.   The SPLIT permutes those four values from the collection with the single value of Carpentry for the Job so that in the case of Carpentry jobs the result of the SELECT Job, SPLIT(... is

 

 

For each of the other Job groups, the permutation of the Job with the collection of names within that group created by COLLECT has an analogous result.

 

The result of using COLLECT, except for the name of one of the columns,  is the same results table as the simple SELECT query:

 

SELECT Job, Name

  FROM Invoices

  ORDER BY Job;

 

If the result is the same as SELECT, why use COLLECT?  COLLECT provides helps us control other aggregates, for example, allowing us to create sub-totals using Sum.  If we were doing simple queries that did not involve aggregates we would not use COLLECT.  But in the case of queries that do use aggregates where we want more control, COLLECT can package inputs into those aggregates, as guided by the GROUP BY criteria, in ways that SELECT cannot.  We have seen a hint of such differences between COLLECT and a simple SELECT in the example above that showed how using SELECT within a SPLIT resulted in 196 records when only 14 were of interest.

 

To see how COLLECT works with other aggregates, let us consider a series of examples using the Sum aggregate function.   Sum takes a field from a set of records as an argument and returns a single record with the same field containing the sum of values of the field in the set.

 

Consider a simple SELECT using Sum and GROUP BY:

 

SELECT Job, Sum(Payment)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

The results table reports the total of payments within each category.   For example, the two payments for Electrical jobs in the table, with a Payment of 43 and a Payment of 44, add up to the sum of 87 shown in the result column for the Electrical job category.

 

Suppose now that in addition to seeing a total for each category we also want to see each payment by name within that category.   Let us take one step in that direction by first creating a query that also reports a Name in the results table.

 

The wrong way to try to do that would be:

 

SELECT Job, Name, Sum(Payment)  --error

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

This is an error because the Sum aggregate is combining multiple records into one record, grouped by Job as guided by the GROUP BY clause.   The Job value is the same for all of those multiple records grouped into one because they are all being grouped by the Job value, so there is no question what job category to write next to the sum.  But that is not the case with the Name value.  The multiple records being combined into one sum record may have multiple names, so just writing Name does not say which Name value should be used.

 

Manifold could just pick a Name value at random from the names in the group of records that are all carpentry jobs, but professional quality SQL engines, including Manifold, usually  avoid such magic choices made by the engine instead of by the user.  Instead, they require explicit guidance, where if we set off to aggregate multiple records into one record we must explicitly state how each resulting field should be aggregated from multiple records into one record.   In our case, both Job and the result field for Sum are explicitly defined by the GROUP BY Job operating with Sum.    If we want to include a Name field in the resulting record we must explicitly state how one  Name value will be picked from the multiple records that are being aggregated by Sum.     

 

A convenient way to do that for lack of any brighter ideas is to use the First aggregate, which simply returns the first of multiple values it encounters:

 

SELECT Job, First(Name), Sum(Payment)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

This approach adds a name, but the name in each record tells us nothing about the composition of the total sum.  The name value given may as well have been chosen at random, since the first name in an unordered list has no meaning other than the name value is a member of the list.

 

A different query using SELECT within a SPLIT shows the limitations of using SELECT:

 

SELECT Job, Sum(Payment), SPLIT (SELECT Name FROM Invoices)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

 

That is an even worse idea because it blindly permutes all the names in the table with all of the job groups in the table, including those names that do not appear in carpentry job records.  Jane, for example, appears in no carpentry jobs in the Invoices table.

 

The most useful approach is to use COLLECT within the SPLIT:

 

SELECT Job, Sum(Payment), SPLIT (COLLECT Name, Payment)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

In the above query the GROUP BY Job clause first creates sets of records grouped by job category and then within each of those sets the COLLECT puts together a list of the Name and Payment for each such record.  Those are permuted with the Sum of the payments to create a list of records that will go into the results table for that job category.   

 

As an example, let us consider records where the Job is Carpentry.

 

 

GROUP BY Job creates a set of four records out of the Invoices table that have the value Carpentry as the Job.   The above group is the source population of records for what happens next, both in SELECT and also in COLLECT.

 

SPLIT permutes, that is, finds all the combinations of, two lists of records, in this case the two lists of records resulting from the action of GROUP BY Job.   One list of records is created inside the query engine by SELECT Job, Sum(Payment) as guided by GROUP BY Job.  That list for the carpentry group is one record:

 

 

The other list of records to be used in the permutation by SPLIT comes from COLLECT Name, Payment.   When COLLECT makes a list of all the Name and Payment occurrences in  the set of four records where the Job is Carpentry, the result is a list:

 

 

When SPLIT forms all combinations of the above two lists, one list with one record and the other list with four records, we get the Carpentry portion of the results table:

 

 

When Manifold does the above for all of the four job categories set up by GROUP BY Job we get a set of records of individual payments that is grouped by job category, and for each record has the total of payments in that category.

 

If we do not like the default name of result used for the Sum we can use the AS clause to specify any name for that column we prefer.

 

SELECT Job, Sum(Payment) AS Total, SPLIT (COLLECT Name, Payment)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

The results table is the same, but just using a field name of Total instead of result.

 

ORDER BY within COLLECT

In the examples above we ordered the output table by the job category.   Suppose we would like to sort by Name as well?

 

We cannot use constructions such as:

 

SELECT Job, Sum(Payment), SPLIT (COLLECT Name, Payment)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job, Name;    -- error

 

or

 

SELECT Job, Sum(Payment), SPLIT (COLLECT Name, Payment)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job, Invoices.Name;    -- error

 

Both of the above throw an error because the Sum aggregate collapses down different groups of names into a single record when creating a one record result such as:

 

 

We can, however, use ORDER BY with COLLECT to order the list it creates that will be permuted by SPLIT.   Instead of:

If we use the clause:

 

COLLECT Name, Payment ORDER BY Name

 

 

The list sent by COLLECT to be permuted by SPLIT will be ordered by Name.

 

The bottom line is that to order by Name within the larger ordering by Job we take advantage of how COLLECT supports the use of an ORDER BY clause within the COLLECT:

 

SELECT Job, Sum(Payment), SPLIT (COLLECT Name, Payment ORDER BY Name)

  FROM Invoices

  GROUP BY Job

  ORDER BY Job;

 

 

Notes

No brackets - In this topic to reduce visual clutter we do not use square [ ] brackets around simple field and table names.    Square brackets are optional if the name is unambiguous.

 

Action inside the query engine - Illustrations like the one below that are used in the narrative above help explain how a query functions inside the Manifold query engine in a narrative way.  

 

 

The query engine's function is in reality more complicated.  It does not actually build tables with column headings like Sum(Payment) for internal use.   But the illustration shows columns with names like that to make it easier for us to understand the process.

 

Use without GROUP BY - Is it possible to use COLLECT without GROUP BY?  Yes, it is possible but doing so misses the point of using COLLECT.  We can write:

 

SELECT SPLIT (COLLECT Name) FROM Invoices;

 

However, a query such as the above produces the same results, except for the name of the results column, as:

 

SELECT Name FROM Invoices;

 

COLLECT is in Manifold to assist when using aggregates and GROUP BY, and not to get to the same place as a SELECT by doing a back flip, possible though that may be.

 

See Also

Queries

 

Command Window

 

Query Builder

 

Aggregates

 

SQL Statements

 

SQL Functions

 

FETCH

 

GROUP BY

 

INLINE

 

ORDER BY

 

SPLIT Clause

 

GROUP BY, COLLECT, and SPLIT Tutorial