Python RegEx through the ages

1. Introduction

Python has gone through three regular expression modules during its history (regexp, regex and re) and a fourth has been greenlighted (also called regex, confusingly). Furthermore, more than one of these were reïmplemented at least once during their existence. As they were superseded, disappeared from the documentation and were finally removed from the distribution, I am very much aware that the interface and functionality of the older modules may well be an important thing to have a reference for if trying to understand legacy Python code. The module name collision only adds to this, so I thought it would be sensible to write this up.

Further to the goal of allowing understanding and porting of legacy Python code, and in case any unwisely written code depends on particular details of an individual implementation, I also attempt to “document the undocumented” in terms of implementation details, low-level interfaces, numerical values of constants et cetera which did not qualify for coverage in the standard library manuals.

It is also of interest from an academic perspective, as each new module and each revision of an existing module contributed to the current interface as it stands. Outside of Python, the history of Python regular expressions is also fundamentally tied with the history of named groups in regular expressions, so this may be of passing interest to that topic.

This document is currently a work in progress, though an overview table is fairly complete. It also currently covers CPython only; coverage of Jython, PyPy or IronPython is a more distant goal, though the basic API should be compatible between them.

While limited detail on this is discernable from Python’s HISTORY file, this is not adequate to establish the API changes between the modules themselves. Other sources include old source distributions, historical VCS (which is currently incomplete for the earliest releases), older versions of the documentation, et cetera. In any case, I felt it sensible to write it up in one place. So here it is: the history of Python’s regular expressions.

2. First API

Python’s original regular expression support accepted Posix Extended syntax, and could use either UNIXv8 regular expressions or Henry Spencer’s reïmplementation (a version of which was included).

This was present in Python 0.9.1. This incorporated only minor changes from Python 0.9.0, the first public source release. Python 0.9.1 was posted on Usenet alt.sources and consequently preserved in archives (a tarball conversion was previously offered on python.org and still is on legacy.python.org).

The same cannot be said of the vast majority of Python 0.9.x release packages, which were distributed for limited periods only on CWI’s FTP site. While the HISTORY file does detail changes between these versions, taken from previous versions of the NEWS file, this is not in especially great detail (the NEWS file entries of 0.9.x are the equivalent of the “What’s new in” documents of 2.x and 3.x, not the detailed changelogs which they now are).

The cpython-fullhistory repository does however provide an archive of old VCS, tagged back to 0.9.8 (it actually goes back to 1990, before 0.9.0), but doesn’t seem to include all the files that made it into the release (in particular, regexpmodule.c is nowhere to be seen). Additionally, its directory structure in these old commits seems to be influenced by later file moves/renames rather than preserving the original directory structure evident in the release package. I come to suspect that files that ceased to exist before a certain revision simply are not preserved. Despite that repository being now Mercurial, before that it was Subversion, before that it was CVS and I don’t know if even that was the first, so it probably cannot be assumed to be as useful/dependable as e.g. modern Git.

2.1. The `regexp` module (original version)

Guido’s Python 0.9.1 library reference (provided as LaTeX in the alt.sources release) gives the following documentation:

3.4 Built-in Module regexp
This module provides a regular expression matching operation. It is always available. The module defines a function and an exception:
compile(pattern) Compile a regular expression given as a string into a regular expression object. The string must be an egrep-style regular expression; this means that the characters '(' ')' '*' '+' '?' '|' '^' '$' are special. (It is implemented using Henry Spencer’s regular expression matching functions.)
regexp.error Exception raised when a string passed to compile() is not a valid regular expression (e.g., unmatched parentheses) or when some other error occurs during compilation or matching ("no match found" is not an error).
Compiled regular expression objects support a single method:
exec(str) Find the first occurrence of the compiled regular expression in the string str. The return value is a tuple of pairs specifying where a match was found and where matches were found for subpatterns specified with '(' and ')' in the pattern. If no match is found, an empty tuple is returned; otherwise the first item of the tuple is a pair of slice indices into the search string giving the match found. If there were any subpatterns in the pattern, the returned tuple has an additional item for each subpattern, giving the slice indices into the search string where that subpattern was found.

Licence for the above documentation:

Expand/Hide Spoiler

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Stichting Mathematisch Centrum or CWI not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

2.2. The `regexp` module (versions from Python 0.9.2+)

The above API documentation is correct for Python 0.9.1. Following that version, the API was changed:

The exec method was renamed to match.
An additional match(pat, str) global function was added. This would compile the pattern and then call its match method, although at least some versions would reüse a compiled pattern when the same pattern was supplied consecutively.

Determining when these changes were made is complicated by regexpmodule.c not being tracked for reasons speculated above. The HISTORY file is silent on this, understandable for the second but somewhat frustrating for the first. By checking changes to API usage in other modules which are tracked, namely the grep module, it is however apparent that the .exec method had been renamed or at least aliased to .match by the time Python 0.9.2 was released.

In Python 0.9.5, the first regex module was introduced and the regexp module was reimplemented in Python as a wrapper. By this point, the match global function had been added, and the method was available only as .match.

The built-in exec function became a keyword in Python 1.0 (this decision was reversed in Python 3.0). In response, os.exec become os.execv. One might be tempted to speculate that the renaming of the regular expression .exec method in Python 0.9.2 may have been motivated by plans to do this, despite being somewhat far in advance. However, considering that os.exec was introduced under that name in Python 0.9.2, this is almost certainly not the case. Nonetheless, the renaming may have been influenced by the introduction of os.exec.

The regexp module was removed in Python 1.5.

3. Second API

Python 0.9.5 introduced the regex module, was “released as Macintosh application only, 2 Jan 1992” per the HISTORY file and appears to have used “GNU regex.c, from subdirectory regex”, per contemporary source comments in VCS. In Python 0.9.6 (evident from both VCS and the HISTORY file), Guido switched to a non-copylefted libre reïmplementation of GNU regex which Tatu Ylönen had written and posted to comp.sources.misc, thus avoiding bringing Python under the GPL.

3.1. The first `regex` module

Official documentation for the first regex module: Python 1.5.1, 1.5.2, 1.6^† (undocumented in Python 2.0+)

Not to be confused with the modern, identically named (second) regex module planned for future inclusion in the standard library.

This was the main module, the C module, and the direct successor of the regexp module. However, there were many important differences from the comparatively simple first API:

Defaulting to Emacs-style regular expression syntax, not to Posix Extended. In particular, escaped $, \|, $ were used for groups and unescaped (|) were literal. The \s syntax table escapes (e.g. \s<) and the \= escape are not supported due to not actually being attached to Emacs.
A new global set_syntax function to set the syntax-flags integer, the regex_syntax module supplied these flags and precomposed such integers for styles of awk, grep et cetera The first API was emulated using “awk” mode, not “egrep” mode, because the latter mode treated newline as an “or” operator. Calling set_syntax also returned the previous syntax-flags integer so they could be restored. Needless to say, this could be extremely thread-unsafe, particularly as there was originally no way to query the syntax without changing it (a get_syntax function was added much later, but not before the third API had already been introduced).
The match method looked only at the start (or provided offset), while the new search method would scan the entire string. They accepted an optional second argument (pos), which specified on offset to start looking from. Both were still available as global functions with single-pattern caches, but the usefulness of these is limited for the reason explained below (and they didn’t accept the pos argument).
match and search no longer returned tuples of tuples of slice indices. Instead, match returned a length (or -1 if it doesn’t match) and search returned an offset (or -1). The tuple of tuples of slice indices was made available as the regs (for “registers”) attribute of the compiled regular expression object itself (…yes, really) following matching/searching, which was None if there was no match. Needless to say, this was not thread-safe.
The first-API regexp.py file (wrapping the second-API first regex module) contained code for removing zero or more (-1, -1) tuples (i.e. groups that did not contribute) from the end of regs before returning. This presumably represents a further API difference.

The following additions were made to the API in Python 0.9.9:

The compile function now accepted an optional second argument, called translate. This took a length-256 string of bytes onto which input bytes were to be mapped. The casefold global provided one of these for ASCII casefolding. Needless to say, this was not exactly Unicode-ready, but this was Python 0.9.9 and the early 1990s, and Unicode support was a major introduction in Python 2,^† by which point the third API was already in place.
Compiled regular expression objects provided a group method which returned the string matched by the group of a given number. If multiple arguments are given, a tuple was returned. Passing no arguments returned all groups.
Additional properties (besides regs) were made available on the regular expression object:
- last — the last string passed to match/search if a match was found. Otherwise None.
- translate — the translate argument with which the compiled regular expression object was created. Otherwise None.

The symcomp function was added in Python 1.0. This worked as compile, but would parse a <name> string at the start of a parenthetical group as a group name, which could be passed to the group method. The Ylönen backend was modified to export the syntax code so it was freely visible to symcomp (but still not visible without side effects to native Python code until get_syntax was added much later). Supporting this, three new regular expression object properties were added:

givenpat — the regular expression pattern from which the regular expression object was compiled.
realpat — the regular expression pattern, but with group names stripped if symcomp was used.
groupindex — dict mapping group names to group indices.

Following Emacs style, \b for word bounderies included the start and end of the string unconditionally, and \B for word non-boundaries excluded the start and end of the string unconditionally. The escapes \< for word start boundaries and \> for word end boundaries, on the other hand, would more conventionally match at the start and end of the string (respectively) if next to a word character.

Similarly, the underscore was not initially included in the definition of \w for word characters, again following Emacs style. This seems to have been changed in Python 1.5 as one of Jeffrey Ollie’s revisions, although without updating the documentation for the regex module (I’m unsure how much impacting the regex module was intentional here, since the change accompanied the addition of additional syntax table flags used only by the re1 module—see below). The Emacs syntax \_< and \_> for “symbol” boundaries (i.e. like \< and \> but treating the underscore as part of the word/symbol) was never supported.

Having been emitting DeprecationWarning since at least Python 2.1, the first regex module was at length removed in Python 2.5.

3.2. The `regex_syntax` module

The regex_syntax module was added in Python 0.9.5 and thereäfter functionally changed only once, six years later (in Python 1.5, with the addition of two more flags). Subsequent minor changes post-obsolescence added a module docstring and converted tabs in the comments to spaces.

The regex_syntax module was never separately documented; the documentation advised reading its source code. It defined the following syntax flags:

Flag name	Value	Description
RE_NO_BK_PARENS	1	Make plain `()` grouping and escaped `` act literal.
RE_NO_BK_VBAR	2	Make plain `\|` act as an or-operator and escaped `\\|` act literal.
RE_BK_PLUS_QM	4	Make plain `+?` act literal and escaped `\+\?` act as operators.
RE_TIGHT_VBAR	8	Make `\|` bind tighter than `^$`.
RE_NEWLINE_OR	16	Make line breaks in the pattern behave as or-operators.
RE_CONTEXT_INDEP_OPS	32	As (inaccurately) documented in source code comments, treat `^$*+?` as literal in contexts where they don’t otherwise make sense. Actually, the other way around (makes them an error in contexts where they don’t make sense, as one might deduce from the flag name).
RE_ANSI_HEX	64	Added to the module in 1.5: process `\n`, `\x00` et cetera. Bizarrely, this seems to also have turned on e.g. `\v10` for accessing group number 10, which actually collides with `\v` for vertical tab; turning it off would have made more sense.
RE_NO_GNU_EXTENSIONS	128	Added to the module in 1.5; disable syntax for matching the start of the entire string (\`), end of the entire string (`\'`), word boundaries and word characters.

The following pre-composed (by bitwise-or) flag combinations were also defined. Note that the {count}, {minimum,} and {minimum,maximum} syntaxes for repetitions were not supported at all, hence the lack of a flag for them (Emacs and grep styles backslash them, while egrep style does not).

Mode name	Flags
RE_SYNTAX_AWK	RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS
RE_SYNTAX_EGREP	RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS, RE_NEWLINE_OR
RE_SYNTAX_GREP	RE_BK_PLUS_QM, RE_NEWLINE_OR
RE_SYNTAX_EMACS	No flags set (zero)

The regex_syntax module was removed in Python 2.5.

3.3. The `regsub` module

Official documentation for the regsub module: Python 1.5.1, 1.5.2, 1.6^† (undocumented in Python 2.0+)

The regsub module was added in either Python 0.9.7 or Python 0.9.8. It provided the following. This may seem peripheral, but it is quite important in comparison to the third. The main reason why these weren’t in the first regex module is presumably that the latter was a C module, as these were written in Python.

The following functions were present initially:

sub (pat, repl, str)
gsub (pat, repl, str)
split (str, pat)
Internal compile (pat)
Internal expand (repl, regs, str)

These are mostly self-explanatory. The sub function replaced one instance, using backslash notation in the replacement for reference to subpatterns. The gsub did the same for multiple instances, but with zero-length matches adjacent to a previous match not counted (other zero-length matches are). The split function did not split at zero-length matches.

The regsub.compile internal function used by the others wrapped the usual one with its own multiple-pattern caching of compiled regular expression objects, which wouldn’t always have been a good idea due to the ideosyncracies of the API. In particular, it did not originally keep track of the syntax flags so would have to be manually cleared when they were changed. This could be achieved through the somewhat crude assignment regsub.cache = {}. It was later changed to keep track of syntax flags (keying the cache with a (pat, syntax) tuple) and offer a clear_cache() function, but only once the regex.get_syntax function had been added, by which point it had already been superseded by the third API.

The regsub.expand internal function, used by sub and gsub, took a replacement string containing backslash-references to groups (e.g. \1), along with the necessary match data (the regs tuple and the string which it indexes), and returned the substituted string.

The following API extensions were made in Python 1.4.

capwords (s[, pat])
split (str, pat[, maxsplit])
splitx (str, pat[, maxsplit])

Python 1.4beta3 added a maxsplit argument to regsub.split, matching the invocation of string.split. It also added splitx as a version which retains the delimitors. This was used by capwords. When capwords was added in 1.4beta1, a different, incompatible use of the third argument to split was added for the same purpose, but this never made it into any non-alpha version of Python and so is largely inconsequential (hence breaking compatibility with it was not a concern).

Having been emitting DeprecationWarning since Python 2.1, the regsub module was at length removed in Python 2.5.

4. Third API: the `re` module

Due to the fundamentally thread-unsafe nature of the second API, a third API was introduced in Python 1.5 in the re module, accompanied to a switch to the more powerful Perl-style regular expression syntax.

The other thing which was introduced here was the separation of an implementation-specific low-level backend module written in C from a high-level API written in Python, rather than the entire main module being written in C. This allowed the functionality of the regsub module to be integrated into the re module, rather than being a separate Python module.

4.1. API changes from the first `regex` module to `re`

The third API introduced the concept of a “match object”: contrasting with the second API’s thread-unsafe practice of setting and exposing the properties and methods relating to specific match/search results on the compiled regular expression objects themselves, the match and search methods return “match objects” (not integers) upon which these are set and exposed. This allows the module to be fully thread-safe. A return value of None is encountered in the absence of a match; as None has a false boolean value while a match object has a true boolean value, this makes it very succinctly possible to simply check for a match.

The distinction between match (matches at the start / specified position) and search (matches anywhere) remains. When invoked as methods, they accept both pos and endpos arguments to define the start and end of the range to match from / search over. When invoked as globals, they accept neither argument.

While the regs attribute, now on the match object, remains in the module, it is not documented as part of the API, so is presumably not supposed to be accessed directly anymore (though it still sometimes is). Equivalent functionality is provided through the span method, which returns the (start, end) index tuple for a given group name or number, while the start and end methods return only the one index (the indices being, as previously, -1 if the group did not contribute). This allows names to be used interchangably with numbers, unlike indexing regs directly.

The group method is accordingly now on the match objects but, except in very early versions (see re1 below), it now behaves like start, end and span in defaulting to group 0 when no arguments are supplied (rather than returning all groups). A new groups method was added to return all groups (1 and up).

Finally, symcomp is not present, because a dedicated named group syntax is supported by compile (see below).

4.2. API changes from `regsub` to `re`

The functionality of regsub is, due to the re module itself being written in Python, incorporated into re. Some differences to note:

capwords is absent.
sub behaves like the old gsub by default (gsub is noted to have been more commonly used). While the first three arguments are the same as before, it takes a count integer as an optional fourth argument. This defaults to 0, meaning no limit, but it can be set to a number of substitutions to make, e.g. 1 to emulate old sub.
subn is invoked like sub, but returns a tuple of the result and the number of substitutions made.
split will include the content of any capturing parenthetical groups in the return value, thus negating the need for splitx. A bug in 1.5.0 caused maxsplit to be ignored due to never incrementing the loop counter, this was fixed in 1.5.1 and later.
The existing behaviour of split and gsub regarding zero-length matches was retained in their successors until Python 3.7, when they were changed to consider all zero-width matches.
All of the above are, whist still available as global functions, also available as methods on the compiled regular expression objects themselves.
The flags argument explained below was added as a further optional argument to the global functions for the above in Python 3.1, which change was backported to Python 2.7. This was consistant with match and search already accepting it.
The global purge function is equivalent to latter-era regsub’s clear_cache function.
findall was added in Python 1.5.2. Invoked with the same pattern as match and search (though the global one didn’t acquire the optional flags argument until Python 2.4, which was still ahead of sub and friends), it returns a list of matching strings (left to right without overlaps). If the pattern contains a capturing group, matches of that group are returned instead of matches of the entire pattern. If the pattern contains multiple capturing groups, a list of tuples is returned.

4.3. Syntax changes from the first `regex` module to `re`

The new API was also taken as an opportunity to switch to the more powerful Perl-style regular expression syntax (a more powerful extension of the Posix Extended syntax), as opposed to the adjustable Emacs-style syntax previously supported. This means that:

Unescaped (|) are always grouping (while this was possible behaviour under the first regex module, it was not the default).
Minimal (non-greedy) matching is possible, by using +? and *? as non-greedy equivalents to + and *.
The character class escapes \d\D for digits (and non-digits) and \s\S for whitespace join the existing \w\W for word characters. The \w definition intentionally includes the underscore.
The \<\> escapes (for specific word boundaries) are dropped, while \b (for word boundaries per se) is retained. Except in re1, the behaviour of \b at the start and end of the string follows the union of the former \< and \>, rather than unconditionally matching; \B is in any case its inverse. Again, the underscore is intentionally included within a word.
The less problematic and more immediately visually distinct \A and \Z replace \` and \' for the absolute start and end of the entire string. The ^ and $ also match only the start and end of the first and last line respectively, not the start and end of all lines, unless in multiline mode (see below), although $ still differs from \Z in that $ can match before the last of any trailing newlines (although it matches the absolute end of the string as well).
The {count}, {minimum,} and {minimum,maximum} syntaxes for repetitions are supported.
The \v syntax for groups with numbers greater than 9 is not supported; instead, the plain \+digit syntax is extended to take up to two digits, provided the first digit is not zero.
Several extensions are available taking the otherwise-invalid (?…) form, including non-capturing groups and (insofar as this is supported by the underlying engine) look-ahead and look-behind assertions.

The (? syntactical extensions in the Perl syntax were further extended with named group support (noted in source comments to be “Python extensions”), superseding the earlier symcomp system, but keeping a similar syntax (with (?P<like>this) replacing the earlier (<like>this)). These extensions were later adopted by PCRE and by Perl, but with the inserted P made optional (it is still mandatory in Python). Needless to say, symcomp itself was not retained.

A reconvert module was added to aid in conversion of existing patterns, but was later removed in Python 2.5.

4.4. Original flags of the `re` module

The adoption of Perl style regular expressions also saw the removal of the existing syntax flags (though they were accepted by reconvert) and addition of the set of flags used by Perl-style regular expressions. Being thread-safe, these flags are set for each compiled regular expression object and passed to compile and some other global functions such as match, as opposed to being set globally.

They can also be specified at the start of the pattern itself, the original such flags (supported by pre and noted as being “standard flags” in sre_parse source comments) were as follows. These constants were also made available under shorter names corresponding to the uppercase of their letter codes. Actual numerical values can and do differ between implementations, and are thus listed in the implementation details further below. Also note that this is not a complete list of the flags currently regarded as part of the API, see the below documentation for the sre implementation for the rest.

Syntax	Flag	Meaning
`(?i)`	IGNORECASE	Match the pattern case-insensitively; supersedes use of the `casefold` string.
`(?L)`	LOCALE	Not in `re1`; makes `\w\b` et cetera and `IGNORECASE` follow the single-byte locale, not just ASCII.
`(?m)`	MULTILINE	Makes `^$` match the start and end of any line, not just the respective first and last lines.
`(?s)`	DOTALL	Makes `.` include the newline.
`(?x)`	VERBOSE	Outside a hard-bracketed character class, whitespace and anything between `#` and newline ignored.

※ Explanatory note: the ^ matches the start of a string and, in multiline mode, the start of a line. If the string ends in a newline, $ will match both before and after that newline, while in multiline mode it will also match the end of every line. The \A and \Z escapes (which replace the first regex module’s \` and \') match only the actual start and end of the string.

4.5. The `re1` implementation of the third API

The short-lived first implementation of the re module was written by Jeffery Ollie and backed by the reop module, a newly introduced direct interface to the pattern bytecode handling (as opposed to compilation) routines of Ylönen’s engine (which Ollie had substantially refactored). It was introduced in 1.5.0alpha3, and was superseded shortly thereafter in 1.5.0alpha4 by a new implementation of re (the one later renamed to pre). The original implementation was retained as re1 for the 1.5.0 release, then removed.

A cursory read through this module is very telling of the struggles to support a regular expression syntax itself not natively supported by the engine used. The bytecode compilation of the regular expression is done in pure Python, making the main module quite lengthy in comparision to its immediate replacement. Support is, perhaps understandably, still limited to what can be achieved with the same pattern-bytecode engine: in particular, it would appear that trying to use look-ahead assertions will raise an error with “zero-width positive [or negative] lookahead assertion is unsupported”.

The ALL_CAPITAL constants and the error exception from the backend reop module were exposed through. Other constants exported included the usual flags (not yet including LOCALE, UNICODE or ASCII), which were assigned the following values:

Flags	Value
IGNORECASE	1
MULTILINE	2
DOTALL	4
VERBOSE	8

In the only version of this implementation ever officially released (i.e. in Python 1.5.0, as re1), the RegexObject.split method never actually increments its loop counter so its maxsplit argument actually does nothing. While this bug was also present in the main re (later pre) module in that version, it was fixed in Python 1.5.1, by which point the re1 module had been removed.

Uniquely among released versions (for a given value of “released”) of the re module, however, there is no groups method: the group method (despite now being on the match objects) still behaves like in the first regex module, returning all groups (1 and up) when no arguments are passed.

4.6. The `pre` implementation of the third API

Official documentation for the re module covering the pre implementation: Python 1.5.1, 1.5.2, 1.6^†

In Python 1.5.0alpha4, the re.py which had been introduced only one alpha version ago was deprecated and moved to re1.py (in Guido’s words, “just in case you need it for comparison”), being replaced with a new (API-compatible) re.py using a contemporary (late 1990s) version of Philip Hazel’s PCRE (Perl Compatible RegEx) engine. Adoption of PCRE enabled use of look-ahead and look-behind assertions, which had been unavailable in re1 due to the limitations of the underlying engine. This accordingly become the first module to make it into a release under the re name.

Andrew Kuchling credits Neal Becker with bringing this engine to the attention of the Python String Special Interest Group (String-SIG), mentioning that it had been written for Exim but had been attracting attention due to Perl’s own regex code not being readily isolatable.

This new re.py was substantially shorter, as it offloaded the work of compiling (not just matching) the regular expressions onto the underlying pcre module (of no relation to the more recent binding module by the same name). Much of the other code was largely recycled from the original re/re1 though, including for example the split method, bringing in with it the maxsplit bug (which was eventually fixed in Python 1.5.1).

Since the re/pre module’s first proper release in Python 1.5.0, and differing from re1, the group method now behaves like start, end and span in returning group 0 when no arguments are supplied; a new groups method will return all groups 1 and up (although in Python 1.5.0 itself, it would return a string is cases where a singleton tuple would be expected: this was fixed in Python 1.5.1). This new behaviour was inherited by subsequent implementations.

The flags had values as follows. The ANCHORED flag is used internally by the match method and is not intended to form part of the interface. As not all supported flags were exported in Python, but the others could theoretically be used as magic numbers, the names given to them in the C code are also listed.

Name from Python	Name from C	Value
IGNORECASE	PCRE_CASELESS	1
VERBOSE	PCRE_EXTENDED	2
ANCHORED	PCRE_ANCHORED	4
MULTILINE	PCRE_MULTILINE	8
DOTALL	PCRE_DOTALL	16
(not exported)	PCRE_DOLLAR_ENDONLY	32
(not exported)	PCRE_EXTRA	64
(not exported)	PCRE_NOTBOL	128
(not exported)	PCRE_NOTEOL	256
LOCALE	PCRE_LOCALE	512

A groupdict method, which returns a mapping of group names to matched strings, was added to the match objects of pre in Python 1.6,^† as well as being supported by the then-new sre.

The pre implementation of the re module was never updated for more recent versions of PCRE, being instead superseded by the (API-compatible) sre/re module. It was retained as pre with a frozen PCRE version until it was removed altogether in Python 2.4, leaving future PCRE support open for third-party bindings.

4.7. The `sre` implementation of the third API

Official documentation for the re module covering the sre implementation: Python 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.current

SecretLabs SRE was written by Fredrik Lundh for Python as a Unicode-supporting implementation of the third API (introducing the UNICODE flag), and introduced in Python 2.0 (actually 1.6).^† Due to initially using a recursive matching scheme which would potentially run into stack limits (changed in Python 2.4.0alpha1), as well as the possibility of the SecretLabs engine behaving differently to PCRE, the older implementation was initally retained as pre, with the SecretLabs implementation being introduced as sre.

Upon the introduction of sre, the re module was set up to import-star from either sre or pre; while it was configured to use sre in the standard source tree, this was supposed to be edited by sites/vendors where necessary. The pre module was removed in Python 2.4, now that sre had been changed to match non-recursively, leaving re as a mere alias to sre. Importing the module as sre was then deprecated in Python 2.5, with the actual module code being moved to re. The sre name was then removed in Python 3.0, leaving re as the sole name of the module.

Versions of the sre implementation added several flags absent from re1 and pre. Of these, UNICODE, ASCII and eventually DEBUG made it into the documentation and may accordingly be considered later additions to the essential API, while TEMPLATE appears still to be an implementation detail.

As text strings were made Unicode by default in Python 3.0, UNICODE matching became the default. Matching in accordance with pure ASCII (i.e. despite the passing of Unicode strings) was added as the ASCII flag, while UNICODE became a no-op, retained for compatibility only. While LOCALE was retained (for use on byte strings only), its use is discouraged, and its usefulness is limited given that text strings are likely to be Unicode to begin with.

Syntax	Flag	Added in	Meaning
`(?u)`	UNICODE	1.6	Make `\w\W\b\B` et cetera and `IGNORECASE` follow Unicode (no-op in 3.x).
`(?t)`	TEMPLATE	1.6	Disable backtracking (not a documented flag).
(none)	DEBUG	2.1	Prints the pattern bytecode disassembly following compilation.
`(?a)`	ASCII	3.0	Make `\w\W\b\B` et cetera and `IGNORECASE` follow plain ASCII.

The actual numerical values both of these and of the other flags are as follows. In the original Python 1.6^† version, DEBUG and ASCII were absent, but the numerical values were otherwise the same as in the current version.

Constant	Value
TEMPLATE	1
IGNORECASE	2
LOCALE	4
MULTILINE	8
DOTALL	16
UNICODE	32
VERBOSE	64
DEBUG	128
ASCII	256

Although PCRE was subsequently updated with Unicode support, it was not re-adopted by the Python Standard Library (the existing modules were maintained as legacy and then removed), and descendants of the original sre module have been used to this day. This means that subsequent improvements to PCRE, and subsequent additions to its syntax, did not percolate down to Python, although a third-party binding exists for more recent PCRE versions.

The sre implementation in Python 2.0 (or possibly 1.6) seems to have introduced the .expand method of match objects, per appearance in documentation; despite no mention being made of it being new, it was never provided by any revision of re1 or pre. Interestingly, this seems to be the first time such a function has been treated as part of the API rather than as an implementation detail. It functions much like its regsub predecessor, only with the addition of the \g<name> syntax for named group references. However, since it’s a method on the actual match object, the only argument it takes is the repl string to expand. The same functionality had been provided by the internal global _expand in re1 and the internal global pcre.prce_expand in pre, both of which took the match object as the first argument.

Python 2.2 introduced finditer, which is similar to findall, differing only in that (a) it is a generator function and (b) it yields match objects as opposed to strings. Although pre was still being included at this point, this was not backported to pre.

Python 2.4 intoduced the (?(groupid)then|else) syntax for matching conditional to another group having participated in the match.

The existing match and search operations were joined in Python 3.4 by the fullmatch operation, which will only match the entire string, or the entire range between pos and endpos.

Python 3.6 introduced the ability to obtain groups as strings using indexing syntax, i.e. aliasing group to __getitem__.

Behaviour of sub, subn and split with regards to zero-length matches was changed in Python 3.7 to be more logical: split was changed so it will split a string at a zero-length match, and sub so that zero-length matches adjacent to a non-empty match are also replaced. This changed a behaviour which had lasted since the original regsub module in Python 0.9.8.

4.8. The second `regex` module

PyPI package: regex

In 2008, Matthew Barnett submitted a bug ticket including a major reworking of (at the time) the Python 2.5.2 re module (i.e. the sre implementation), adding atomic grouping and possessive quantifiers (i.e. variants of groups and greedy quantifiers which cannot backtrack), as well as variable-length look-behind assertions. He also mentioned that it was typically two times as fast as the standard one. This effort was joined by Jeffrey C. Jacobs (timehorse), who had already been working on improvements to the re module, slated at the time for Python 2.7.

※ Explanatory note: a lazy quantifier (e.g. +? or *?) matches the fewest possible of something that will allow the rest of the pattern to match. A greedy quantifier (e.g. + or *) matches as many of something as possible that will allow the rest of the pattern to match. A possessive quantifier (e.g. ++ or *+) matches as many of something as are present, even if this causes the rest of the pattern to fail.

The changes became quite dramatic, with one of them radically reörganising the code, dramatically reducing the number of support modules, changing the engine to use a node network rather than a linear bytecode sequence, as well as other improvements. In light of this, Georg Brandl suggested releasing it as a stand-alone package, on the basis that it would be difficult to review, expecting it to have acquired enough use for any issues to have been ironed out by the time Python 2.7 came around. Barnett promptly renamed it to regex so it could be installed as an extension module (although it is unrelated to the historic module by that name).

Suffice it to say that it was not included in Python 2.7, although it does have a mention in the documentation and “in principle approval for eventual stdlib inclusion”, pending a PEP to sort out the details. Additionally, although intended to add support for Unicode regular expressions and succeeding in adding basic support, the original SecretLabs engine has been the subject of criticism for not providing selectors for Unicode properties (besides those which correspond to the syntaxes for ASCII regular expression properties), for using casemapping for case-insensitive matching rather than the more appropriate casefolding, and for not offering proper handling for multi-codepoint grapheme clusters (e.g. how they might affect what constitutes sequences/bounds of word characters). The second regex module is considered vastly improved in this respect. (The bigger mentioned egregious flaw of Python 2.7 and 3.2 treating Unicode strings as UCS-2 in several respects on Windows (as a result of PEP 261) was finally, and with much rejoicing, fixed in the very next version (Python 3.3) with the adoption of PEP 393.)

4.8.1. API extensions made by the second `regex` module

Some API extensions relative to the lastest (Python 3.7) version of re are given below. Some of the simpler changes, such as the VERSION1 behaviour for zero-length matches and the addition of fullmatch, have also been percolated through to recent versions of standard re.

splititer as a generator version of split
overlapped argument for findall and finditer
pos and endpos available on sub and subn
For use with matches with repeated groups, captures, capturesdict, starts, ends and spans work like group, groupdict, start, end and span respectively, but return a list of values of every instance of that group in the match, not only the last (in the case of capturesdict, a dict with such lists as values).
literal_spaces argument to escape, suppresses escaping of spaces.
special_only argument to escape, escapes only special characters.
detach_string method on the match object, deletes reference to the original string (useful if it is a large string).
expandf, subf and subfn, differing from the ones without an f in that they use str.format style format strings instead of backslash syntax.
partial argument to match, search, fullmatch and finditer, allows truncated matches (i.e. where the pattern is still matching up, but runs into the end of the string before it’s finished). Mainly useful for validating an input field whilst it is still being filled in. The match objects are given a partial attribute indicating if the match is partial in this sense.

4.8.2. Flag changes between `re` and the second `regex` module

In an effort to remain backward compatible with re, and to provide additional functionality, the second regex module introduces a few more flags:

Syntax	Flag	Meaning
`(?V0)`	VERSION0	Conservatively match the behaviour of the `re` module, zero-length behaviour depends on Python version.
`(?V1)`	VERSION1	Support scoped flags, set operations, default to full case-folding, new zero-length behaviour always.
`(?f)`	FULLCASE	Make `IGNORECASE` use full case-folding, implied by `VERSION1` but can still be disabled.
`(?w)`	WORD	Use Unicode definitions of word boundaries, and consider all line breaks (rather than just LF).
`(?r)`	REVERSE	Begin searching from the end of the string.
`(?p)`	POSIX	Return the leftmost longest match, as stipulated by POSIX (takes longer).
`(?e)`	ENHANCEMATCH	When handling a fuzzy-match sequence, try to improve the fit of the match found.
`(?b)`	BESTMATCH	When handling a fuzzy-match sequence, exhaustively search for the least deviant match, not the first.

The values given to these flags and of the existing flags are as follows (note that ASCII and DEBUG have different values than in re, although all of the other flags are either new in the second regex module or match re):

Constant	Value
TEMPLATE	1
IGNORECASE	2
LOCALE	4
MULTILINE	8
DOTALL	16
UNICODE	32
VERBOSE	64
ASCII	128
VERSION1	256
DEBUG	512
REVERSE	1024
WORD	2048
BESTMATCH	4096
VERSION0	8192
FULLCASE	16384
ENHANCEMATCH	32768
POSIX	65536

4.8.3. Syntax changes between `re` and the second `regex` module

The second regex module introduces a number of powerful syntax innovations.

TODO more here.

Appendices

→ Appendix A: Low level support modules of Third API implementations (reop, first pcre, _sre and sre_*, _regex)
→ Appendix B: Other third-party regular expression modules (including the second pcre module)
→ Appendix C: Summary table

† Python 1.6 is basically the state of Python 2 at the point that Guido left CNRI, released under contractual obligation or something similar: it incorporates many but not all distinctively 2.x features, it is accordingly not a true Python 1.x release; hence, “What’s new in Python 2.0” compares it with Python 1.5.2.

Flag name	Value	Description
RE_NO_BK_PARENS	1	Make plain `()` grouping and escaped `\(\)` act literal.
RE_NO_BK_VBAR	2	Make plain `\|` act as an or-operator and escaped `\\|` act literal.
RE_BK_PLUS_QM	4	Make plain `+?` act literal and escaped `\+\?` act as operators.
RE_TIGHT_VBAR	8	Make `\|` bind tighter than `^$`.
RE_NEWLINE_OR	16	Make line breaks in the pattern behave as or-operators.
RE_CONTEXT_INDEP_OPS	32	As (inaccurately) documented in source code comments, treat `^$*+?` as literal in contexts where they don’t otherwise make sense. Actually, the other way around (makes them an error in contexts where they don’t make sense, as one might deduce from the flag name).
RE_ANSI_HEX	64	Added to the module in 1.5: process `\n`, `\x00` et cetera. Bizarrely, this seems to also have turned on e.g. `\v10` for accessing group number 10, which actually collides with `\v` for vertical tab; turning it off would have made more sense.
RE_NO_GNU_EXTENSIONS	128	Added to the module in 1.5; disable syntax for matching the start of the entire string (\`), end of the entire string (`\'`), word boundaries and word characters.

Python RegEx through the ages

1. Introduction

2. First API

2.1. The regexp module (original version)

3.4 Built-in Module regexp

2.2. The regexp module (versions from Python 0.9.2+)

3. Second API

3.1. The first regex module

3.2. The regex_syntax module

3.3. The regsub module

4. Third API: the re module

4.1. API changes from the first regex module to re

4.2. API changes from regsub to re

4.3. Syntax changes from the first regex module to re

4.4. Original flags of the re module

4.5. The re1 implementation of the third API

4.6. The pre implementation of the third API

4.7. The sre implementation of the third API

4.8. The second regex module

4.8.1. API extensions made by the second regex module

4.8.2. Flag changes between re and the second regex module

4.8.3. Syntax changes between re and the second regex module