Why do Non-Capture Groups Exist?

See the Regular Expressions topic for the context of this essay.


The only difference between capture groups and non-capture groups is that the former captures the matched character sequences for possible later re-use with a numbered back reference while a non-capture group does not.   Both are used to group subexpressions, which is the main reason most people will utilize parentheses ( ) within regular expressions.


For example, if we want to match a repeating sequence such as Urgent we could use either (Urgent)+ or (?:Urgent)+.  Both will match sequences such as UrgentUrgent and UrgentUrgentUrgent.   The only difference is that the capture group consisting just of parentheses ( ) stores the matched pattern internally in a results array from which we can summon it later in that same regular expression using the back reference \1 while the non-capture group consisting of an opening sequence of (?: with a closing sequence of ) does not store the matched pattern internally.


Given that most users of regular expressions do not utilize back references and use parentheses ( ) simply for grouping the way they have learned to do so from a lifetime of use in mathematics and most other programming, and given that there seems to be no apparent harm in always using capture groups (who cares if a label is attached internally since we are not required to ever use that label?), why bother having non-capture groups at all?   Is it not just a syntax quirk that seems only to exist to eliminate a totally reliable method of determining what number a given capture group is by counting left parentheses?  


As far as syntax quirks go, to get insight into why capture groups are the way they are we have to look at design decisions going back many years.   A key notion is that although we may be interested in capture groups primarily for the grouping effect of using parentheses ( ) the original impetus behind capture groups was a desire to capture patterns detected earlier in the regular expression.   When scripting the capturing is especially important because it is more common to want to use captured results outside of the regular expression, as opposed to using back references within the regular expression itself as might be done within interactive uses such as in the Select panel template tab.


It was simply a less than ideal syntax choice for a construction distinguished by its ability to capture, that is, harvest chunks of prior results for later re-use, to utilize the parentheses characters ( ) to mean something more than grouping while using the unusual construction in the form of (?:  )  to mean just grouping.     It would have made more sense to have parentheses ( ) indicate grouping, perhaps even calling them simply "groups," while reserving a specially escaped syntax in the form of (?: ) to indicate grouping plus capturing of results within an internal array to be available later.


But whether or not the choice of using ( ) or (?: ) was done backwards compared to how must people conceptualize ( ) does not answer why non-capture groups exist at all.   The answer is that in modern times there is no compelling reason beyond inertia from an earlier day.  It is true there is some overhead required to save subexpression results from capture groups in a result array, so  there might be some performance boost from using a non-capture group within a regular expression when we have no intention of utilizing a back reference.    But given the confusing and little-known nature of the (?: ) syntax compared to the more expected and better known ( ) it's probably unlikely many people will use non-capture groups for that reason.


Another possible reason for non-capture groups is that in programming the capturing of results may have unexpected behavioral side effects  But that's also a case where using a mix of capture and non-capture groups is asking for errors, or at least confusion.


The bottom line is that while it might have been prettier to set the default syntax of ( ) parentheses to simply mean grouping and to utilize an escaped (?: ) syntax to mean grouping with capturing, the syntax is what it is and after over 40 years of embedding within numerous uses of regular expressions the syntax is not going to change.  That non-capture groups exist at all does not mean we must use them, except of course to bedevil colleagues with a syntax quirk we know that they do not.   The syntax for capture groups versus non-capture groups is just another quirk to be learned that has been inherited from the great antiquity of regular expressions.  


That antiquity is not a bad thing as one reason regular expressions have been around for so long is precisely because, despite the occasionally less than ideal choice of syntax, they are so tremendously useful and powerful.