Python RegEx through the ages

1. Introduction

Python has gone through three regular expression modules during its history (regexp, regex and re) and a fourth has been greenlighted (also called regex, confusingly). Furthermore, more than one of these were reïmplemented at least once during their existence. As they were superseded, disappeared from the documentation and were finally removed from the distribution, I am very much aware that the interface and functionality of the older modules may well be an important thing to have a reference for if trying to understand legacy Python code. The module name collision only adds to this, so I thought it would be sensible to write this up.

Further to the goal of allowing understanding and porting of legacy Python code, and in case any unwisely written code depends on particular details of an individual implementation, I also attempt to document the undocumented” in terms of implementation details, low-level interfaces, numerical values of constants et cetera which did not qualify for coverage in the standard library manuals.

It is also of interest from an academic perspective, as each new module and each revision of an existing module contributed to the current interface as it stands. Outside of Python, the history of Python regular expressions is also fundamentally tied with the history of named groups in regular expressions, so this may be of passing interest to that topic.

This document is currently a work in progress, though an overview table is fairly complete. It also currently covers CPython only; coverage of Jython, PyPy or IronPython is a more distant goal, though the basic API should be compatible between them.

While limited detail on this is discernable from Python’s HISTORY file, this is not adequate to establish the API changes between the modules themselves. Other sources include old source distributions, historical VCS (which is currently incomplete for the earliest releases), older versions of the documentation, et cetera. In any case, I felt it sensible to write it up in one place. So here it is: the history of Python’s regular expressions.

2. First API

Python’s original regular expression support accepted Posix Extended syntax, and could use either UNIXv8 regular expressions or Henry Spencer’s reïmplementation (a version of which was included).

This was present in Python 0.9.1. This incorporated only minor changes from Python 0.9.0, the first public source release. Python 0.9.1 was posted on Usenet alt.sources and consequently preserved in archives (a tarball conversion was previously offered on python.org and still is on legacy.python.org).

The same cannot be said of the vast majority of Python 0.9.x release packages, which were distributed for limited periods only on CWI’s FTP site. While the HISTORY file does detail changes between these versions, taken from previous versions of the NEWS file, this is not in especially great detail (the NEWS file entries of 0.9.x are the equivalent of the What’s new in” documents of 2.x and 3.x, not the detailed changelogs which they now are).

The cpython-fullhistory repository does however provide an archive of old VCS, tagged back to 0.9.8 (it actually goes back to 1990, before 0.9.0), but doesn’t seem to include all the files that made it into the release (in particular, regexpmodule.c is nowhere to be seen). Additionally, its directory structure in these old commits seems to be influenced by later file moves/renames rather than preserving the original directory structure evident in the release package. I come to suspect that files that ceased to exist before a certain revision simply are not preserved. Despite that repository being now Mercurial, before that it was Subversion, before that it was CVS and I don’t know if even that was the first, so it probably cannot be assumed to be as useful/dependable as e.g. modern Git.

2.1. The regexp module (original version)

Guido’s Python 0.9.1 library reference (provided as LaTeX in the alt.sources release) gives the following documentation:

3.4 Built-in Module regexp

This module provides a regular expression matching operation. It is always available. The module defines a function and an exception:

compile(pattern) Compile a regular expression given as a string into a regular expression object. The string must be an egrep-style regular expression; this means that the characters '(' ')' '*' '+' '?' '|' '^' '$' are special. (It is implemented using Henry Spencer’s regular expression matching functions.)

regexp.error Exception raised when a string passed to compile() is not a valid regular expression (e.g., unmatched parentheses) or when some other error occurs during compilation or matching ("no match found" is not an error).

Compiled regular expression objects support a single method:

exec(str) Find the first occurrence of the compiled regular expression in the string str. The return value is a tuple of pairs specifying where a match was found and where matches were found for subpatterns specified with '(' and ')' in the pattern. If no match is found, an empty tuple is returned; otherwise the first item of the tuple is a pair of slice indices into the search string giving the match found. If there were any subpatterns in the pattern, the returned tuple has an additional item for each subpattern, giving the slice indices into the search string where that subpattern was found.

Licence for the above documentation:

Expand/Hide Spoiler

Copyright 1991 by Stichting Mathematisch Centrum, Amsterdam, The Netherlands.

       All Rights Reserved

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Stichting Mathematisch Centrum or CWI not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

2.2. The regexp module (versions from Python 0.9.2+)

The above API documentation is correct for Python 0.9.1. Following that version, the API was changed:

Determining when these changes were made is complicated by regexpmodule.c not being tracked for reasons speculated above. The HISTORY file is silent on this, understandable for the second but somewhat frustrating for the first. By checking changes to API usage in other modules which are tracked, namely the grep module, it is however apparent that the .exec method had been renamed or at least aliased to .match by the time Python 0.9.2 was released.

In Python 0.9.5, the first regex module was introduced and the regexp module was reimplemented in Python as a wrapper. By this point, the match global function had been added, and the method was available only as .match.

The built-in exec function became a keyword in Python 1.0 (this decision was reversed in Python 3.0). In response, os.exec become os.execv. One might be tempted to speculate that the renaming of the regular expression .exec method in Python 0.9.2 may have been motivated by plans to do this, despite being somewhat far in advance. However, considering that os.exec was introduced under that name in Python 0.9.2, this is almost certainly not the case. Nonetheless, the renaming may have been influenced by the introduction of os.exec.

The regexp module was removed in Python 1.5.

3. Second API

Python 0.9.5 introduced the regex module, was released as Macintosh application only, 2 Jan 1992” per the HISTORY file and appears to have used GNU regex.c, from subdirectory regex”, per contemporary source comments in VCS. In Python 0.9.6 (evident from both VCS and the HISTORY file), Guido switched to a non-copylefted libre reïmplementation of GNU regex which Tatu Ylönen had written and posted to comp.sources.misc, thus avoiding bringing Python under the GPL.

3.1. The first regex module

Official documentation for the first regex module: Python 1.5.1, 1.5.2, 1.6 (undocumented in Python 2.0+)

Not to be confused with the modern, identically named (second) regex module planned for future inclusion in the standard library.

This was the main module, the C module, and the direct successor of the regexp module. However, there were many important differences from the comparatively simple first API:

The following additions were made to the API in Python 0.9.9:

The symcomp function was added in Python 1.0. This worked as compile, but would parse a <name> string at the start of a parenthetical group as a group name, which could be passed to the group method. The Ylönen backend was modified to export the syntax code so it was freely visible to symcomp (but still not visible without side effects to native Python code until get_syntax was added much later). Supporting this, three new regular expression object properties were added:

Following Emacs style, \b for word bounderies included the start and end of the string unconditionally, and \B for word non-boundaries excluded the start and end of the string unconditionally. The escapes \< for word start boundaries and \> for word end boundaries, on the other hand, would more conventionally match at the start and end of the string (respectively) if next to a word character.

Similarly, the underscore was not initially included in the definition of \w for word characters, again following Emacs style. This seems to have been changed in Python 1.5 as one of Jeffrey Ollie’s revisions, although without updating the documentation for the regex module (I’m unsure how much impacting the regex module was intentional here, since the change accompanied the addition of additional syntax table flags used only by the re1 module—see below). The Emacs syntax \_< and \_> for symbol” boundaries (i.e. like \< and \> but treating the underscore as part of the word/symbol) was never supported.

Having been emitting DeprecationWarning since at least Python 2.1, the first regex module was at length removed in Python 2.5.

3.2. The regex_syntax module

The regex_syntax module was added in Python 0.9.5 and thereäfter functionally changed only once, six years later (in Python 1.5, with the addition of two more flags). Subsequent minor changes post-obsolescence added a module docstring and converted tabs in the comments to spaces.

The regex_syntax module was never separately documented; the documentation advised reading its source code. It defined the following syntax flags:

Flag name            Value Description
RE_NO_BK_PARENS            1Make plain () grouping and escaped \(\) act literal.
RE_NO_BK_VBAR              2Make plain | act as an or-operator and escaped \| act literal.
RE_BK_PLUS_QM              4Make plain +? act literal and escaped \+\? act as operators.
RE_TIGHT_VBAR              8Make | bind tighter than ^$.
RE_NEWLINE_OR            16Make line breaks in the pattern behave as or-operators.
RE_CONTEXT_INDEP_OPS     32As (inaccurately) documented in source code comments, treat ^$*+? as literal in contexts where they don’t otherwise make sense. Actually, the other way around (makes them an error in contexts where they don’t make sense, as one might deduce from the flag name).
RE_ANSI_HEX              64Added to the module in 1.5: process \n, \x00 et cetera. Bizarrely, this seems to also have turned on e.g. \v10 for accessing group number 10, which actually collides with \v for vertical tab; turning it off would have made more sense.
RE_NO_GNU_EXTENSIONS     128Added to the module in 1.5; disable syntax for matching the start of the entire string (\`), end of the entire string (\'), word boundaries and word characters.

The following pre-composed (by bitwise-or) flag combinations were also defined. Note that the {count}, {minimum,} and {minimum,maximum} syntaxes for repetitions were not supported at all, hence the lack of a flag for them (Emacs and grep styles backslash them, while egrep style does not).

Mode name      Flags
RE_SYNTAX_AWK  RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS
RE_SYNTAX_EGREP RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS, RE_NEWLINE_OR
RE_SYNTAX_GREP  RE_BK_PLUS_QM, RE_NEWLINE_OR
RE_SYNTAX_EMACS No flags set (zero)

The regex_syntax module was removed in Python 2.5.

3.3. The regsub module

Official documentation for the regsub module: Python 1.5.1, 1.5.2, 1.6 (undocumented in Python 2.0+)

The regsub module was added in either Python 0.9.7 or Python 0.9.8. It provided the following. This may seem peripheral, but it is quite important in comparison to the third. The main reason why these weren’t in the first regex module is presumably that the latter was a C module, as these were written in Python.

The following functions were present initially:

These are mostly self-explanatory. The sub function replaced one instance, using backslash notation in the replacement for reference to subpatterns. The gsub did the same for multiple instances, but with zero-length matches adjacent to a previous match not counted (other zero-length matches are). The split function did not split at zero-length matches.

The regsub.compile internal function used by the others wrapped the usual one with its own multiple-pattern caching of compiled regular expression objects, which wouldn’t always have been a good idea due to the ideosyncracies of the API. In particular, it did not originally keep track of the syntax flags so would have to be manually cleared when they were changed. This could be achieved through the somewhat crude assignment regsub.cache = {}. It was later changed to keep track of syntax flags (keying the cache with a (pat, syntax) tuple) and offer a clear_cache() function, but only once the regex.get_syntax function had been added, by which point it had already been superseded by the third API.

The regsub.expand internal function, used by sub and gsub, took a replacement string containing backslash-references to groups (e.g. \1), along with the necessary match data (the regs tuple and the string which it indexes), and returned the substituted string.

The following API extensions were made in Python 1.4.

Python 1.4beta3 added a maxsplit argument to regsub.split, matching the invocation of string.split. It also added splitx as a version which retains the delimitors. This was used by capwords. When capwords was added in 1.4beta1, a different, incompatible use of the third argument to split was added for the same purpose, but this never made it into any non-alpha version of Python and so is largely inconsequential (hence breaking compatibility with it was not a concern).

Having been emitting DeprecationWarning since Python 2.1, the regsub module was at length removed in Python 2.5.

4. Third API: the re module

Due to the fundamentally thread-unsafe nature of the second API, a third API was introduced in Python 1.5 in the re module, accompanied to a switch to the more powerful Perl-style regular expression syntax.

The other thing which was introduced here was the separation of an implementation-specific low-level backend module written in C from a high-level API written in Python, rather than the entire main module being written in C. This allowed the functionality of the regsub module to be integrated into the re module, rather than being a separate Python module.

4.1. API changes from the first regex module to re

The third API introduced the concept of a match object”: contrasting with the second API’s thread-unsafe practice of setting and exposing the properties and methods relating to specific match/search results on the compiled regular expression objects themselves, the match and search methods return match objects” (not integers) upon which these are set and exposed. This allows the module to be fully thread-safe. A return value of None is encountered in the absence of a match; as None has a false boolean value while a match object has a true boolean value, this makes it very succinctly possible to simply check for a match.

The distinction between match (matches at the start / specified position) and search (matches anywhere) remains. When invoked as methods, they accept both pos and endpos arguments to define the start and end of the range to match from / search over. When invoked as globals, they accept neither argument.

While the regs attribute, now on the match object, remains in the module, it is not documented as part of the API, so is presumably not supposed to be accessed directly anymore (though it still sometimes is). Equivalent functionality is provided through the span method, which returns the (start, end) index tuple for a given group name or number, while the start and end methods return only the one index (the indices being, as previously, -1 if the group did not contribute). This allows names to be used interchangably with numbers, unlike indexing regs directly.

The group method is accordingly now on the match objects but, except in very early versions (see re1 below), it now behaves like start, end and span in defaulting to group 0 when no arguments are supplied (rather than returning all groups). A new groups method was added to return all groups (1 and up).

Finally, symcomp is not present, because a dedicated named group syntax is supported by compile (see below).

4.2. API changes from regsub to re

The functionality of regsub is, due to the re module itself being written in Python, incorporated into re. Some differences to note:

4.3. Syntax changes from the first regex module to re

The new API was also taken as an opportunity to switch to the more powerful Perl-style regular expression syntax (a more powerful extension of the Posix Extended syntax), as opposed to the adjustable Emacs-style syntax previously supported. This means that:

The (? syntactical extensions in the Perl syntax were further extended with named group support (noted in source comments to be Python extensions”), superseding the earlier symcomp system, but keeping a similar syntax (with (?P<like>this) replacing the earlier (<like>this)). These extensions were later adopted by PCRE and by Perl, but with the inserted P made optional (it is still mandatory in Python). Needless to say, symcomp itself was not retained.

A reconvert module was added to aid in conversion of existing patterns, but was later removed in Python 2.5.

4.4. Original flags of the re module

The adoption of Perl style regular expressions also saw the removal of the existing syntax flags (though they were accepted by reconvert) and addition of the set of flags used by Perl-style regular expressions. Being thread-safe, these flags are set for each compiled regular expression object and passed to compile and some other global functions such as match, as opposed to being set globally.

They can also be specified at the start of the pattern itself, the original such flags (supported by pre and noted as being standard flags” in sre_parse source comments) were as follows. These constants were also made available under shorter names corresponding to the uppercase of their letter codes. Actual numerical values can and do differ between implementations, and are thus listed in the implementation details further below. Also note that this is not a complete list of the flags currently regarded as part of the API, see the below documentation for the sre implementation for the rest.

Syntax Flag Meaning
(?i) IGNORECASE Match the pattern case-insensitively; supersedes use of the casefold string.
(?L) LOCALE Not in re1; makes \w\b et cetera and IGNORECASE follow the single-byte locale, not just ASCII.
(?m) MULTILINE Makes ^$ match the start and end of any line, not just the respective first and last lines.
(?s) DOTALL Makes . include the newline.
(?x) VERBOSE Outside a hard-bracketed character class, whitespace and anything between # and newline ignored.

 ※ Explanatory note: the ^ matches the start of a string and, in multiline mode, the start of a line. If the string ends in a newline, $ will match both before and after that newline, while in multiline mode it will also match the end of every line. The \A and \Z escapes (which replace the first regex module’s \` and \') match only the actual start and end of the string.

4.5. The re1 implementation of the third API

The short-lived first implementation of the re module was written by Jeffery Ollie and backed by the reop module, a newly introduced direct interface to the pattern bytecode handling (as opposed to compilation) routines of Ylönen’s engine (which Ollie had substantially refactored). It was introduced in 1.5.0alpha3, and was superseded shortly thereafter in 1.5.0alpha4 by a new implementation of re (the one later renamed to pre). The original implementation was retained as re1 for the 1.5.0 release, then removed.

A cursory read through this module is very telling of the struggles to support a regular expression syntax itself not natively supported by the engine used. The bytecode compilation of the regular expression is done in pure Python, making the main module quite lengthy in comparision to its immediate replacement. Support is, perhaps understandably, still limited to what can be achieved with the same pattern-bytecode engine: in particular, it would appear that trying to use look-ahead assertions will raise an error with zero-width positive [or negative] lookahead assertion is unsupported”.

The ALL_CAPITAL constants and the error exception from the backend reop module were exposed through. Other constants exported included the usual flags (not yet including LOCALE, UNICODE or ASCII), which were assigned the following values:

Flags      Value
IGNORECASE 1
MULTILINE  2
DOTALL    4
VERBOSE    8

In the only version of this implementation ever officially released (i.e. in Python 1.5.0, as re1), the RegexObject.split method never actually increments its loop counter so its maxsplit argument actually does nothing. While this bug was also present in the main re (later pre) module in that version, it was fixed in Python 1.5.1, by which point the re1 module had been removed.

Uniquely among released versions (for a given value of released”) of the re module, however, there is no groups method: the group method (despite now being on the match objects) still behaves like in the first regex module, returning all groups (1 and up) when no arguments are passed.

4.6. The pre implementation of the third API

Official documentation for the re module covering the pre implementation: Python 1.5.1, 1.5.2, 1.6

In Python 1.5.0alpha4, the re.py which had been introduced only one alpha version ago was deprecated and moved to re1.py (in Guido’s words, just in case you need it for comparison”), being replaced with a new (API-compatible) re.py using a contemporary (late 1990s) version of Philip Hazel’s PCRE (Perl Compatible RegEx) engine. Adoption of PCRE enabled use of look-ahead and look-behind assertions, which had been unavailable in re1 due to the limitations of the underlying engine. This accordingly become the first module to make it into a release under the re name.

Andrew Kuchling credits Neal Becker with bringing this engine to the attention of the Python String Special Interest Group (String-SIG), mentioning that it had been written for Exim but had been attracting attention due to Perl’s own regex code not being readily isolatable.

This new re.py was substantially shorter, as it offloaded the work of compiling (not just matching) the regular expressions onto the underlying pcre module (of no relation to the more recent binding module by the same name). Much of the other code was largely recycled from the original re/re1 though, including for example the split method, bringing in with it the maxsplit bug (which was eventually fixed in Python 1.5.1).

Since the re/pre module’s first proper release in Python 1.5.0, and differing from re1, the group method now behaves like start, end and span in returning group 0 when no arguments are supplied; a new groups method will return all groups 1 and up (although in Python 1.5.0 itself, it would return a string is cases where a singleton tuple would be expected: this was fixed in Python 1.5.1). This new behaviour was inherited by subsequent implementations.

The flags had values as follows. The ANCHORED flag is used internally by the match method and is not intended to form part of the interface. As not all supported flags were exported in Python, but the others could theoretically be used as magic numbers, the names given to them in the C code are also listed.

Name from PythonName from C        Value
IGNORECASE      PCRE_CASELESS      1
VERBOSE        PCRE_EXTENDED      2
ANCHORED        PCRE_ANCHORED      4
MULTILINE      PCRE_MULTILINE    8
DOTALL          PCRE_DOTALL        16
(not exported)  PCRE_DOLLAR_ENDONLY32
(not exported)  PCRE_EXTRA        64
(not exported)  PCRE_NOTBOL        128
(not exported)  PCRE_NOTEOL        256
LOCALE          PCRE_LOCALE        512

A groupdict method, which returns a mapping of group names to matched strings, was added to the match objects of pre in Python 1.6, as well as being supported by the then-new sre.

The pre implementation of the re module was never updated for more recent versions of PCRE, being instead superseded by the (API-compatible) sre/re module. It was retained as pre with a frozen PCRE version until it was removed altogether in Python 2.4, leaving future PCRE support open for third-party bindings.

4.7. The sre implementation of the third API

Official documentation for the re module covering the sre implementation: Python 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.current

SecretLabs SRE was written by Fredrik Lundh for Python as a Unicode-supporting implementation of the third API (introducing the UNICODE flag), and introduced in Python 2.0 (actually 1.6). Due to initially using a recursive matching scheme which would potentially run into stack limits (changed in Python 2.4.0alpha1), as well as the possibility of the SecretLabs engine behaving differently to PCRE, the older implementation was initally retained as pre, with the SecretLabs implementation being introduced as sre.

Upon the introduction of sre, the re module was set up to import-star from either sre or pre; while it was configured to use sre in the standard source tree, this was supposed to be edited by sites/vendors where necessary. The pre module was removed in Python 2.4, now that sre had been changed to match non-recursively, leaving re as a mere alias to sre. Importing the module as sre was then deprecated in Python 2.5, with the actual module code being moved to re. The sre name was then removed in Python 3.0, leaving re as the sole name of the module.

Versions of the sre implementation added several flags absent from re1 and pre. Of these, UNICODE, ASCII and eventually DEBUG made it into the documentation and may accordingly be considered later additions to the essential API, while TEMPLATE appears still to be an implementation detail.

As text strings were made Unicode by default in Python 3.0, UNICODE matching became the default. Matching in accordance with pure ASCII (i.e. despite the passing of Unicode strings) was added as the ASCII flag, while UNICODE became a no-op, retained for compatibility only. While LOCALE was retained (for use on byte strings only), its use is discouraged, and its usefulness is limited given that text strings are likely to be Unicode to begin with.

Syntax Flag Added in Meaning
(?u) UNICODE 1.6 Make \w\W\b\B et cetera and IGNORECASE follow Unicode (no-op in 3.x).
(?t) TEMPLATE 1.6 Disable backtracking (not a documented flag).
(none) DEBUG 2.1 Prints the pattern bytecode disassembly following compilation.
(?a) ASCII 3.0 Make \w\W\b\B et cetera and IGNORECASE follow plain ASCII.

The actual numerical values both of these and of the other flags are as follows. In the original Python 1.6 version, DEBUG and ASCII were absent, but the numerical values were otherwise the same as in the current version.

Constant  Value
TEMPLATE  1
IGNORECASE 2
LOCALE    4
MULTILINE  8
DOTALL    16
UNICODE    32
VERBOSE    64
DEBUG      128
ASCII      256

Although PCRE was subsequently updated with Unicode support, it was not re-adopted by the Python Standard Library (the existing modules were maintained as legacy and then removed), and descendants of the original sre module have been used to this day. This means that subsequent improvements to PCRE, and subsequent additions to its syntax, did not percolate down to Python, although a third-party binding exists for more recent PCRE versions.

The sre implementation in Python 2.0 (or possibly 1.6) seems to have introduced the .expand method of match objects, per appearance in documentation; despite no mention being made of it being new, it was never provided by any revision of re1 or pre. Interestingly, this seems to be the first time such a function has been treated as part of the API rather than as an implementation detail. It functions much like its regsub predecessor, only with the addition of the \g<name> syntax for named group references. However, since it’s a method on the actual match object, the only argument it takes is the repl string to expand. The same functionality had been provided by the internal global _expand in re1 and the internal global pcre.prce_expand in pre, both of which took the match object as the first argument.

Python 2.2 introduced finditer, which is similar to findall, differing only in that (a) it is a generator function and (b) it yields match objects as opposed to strings. Although pre was still being included at this point, this was not backported to pre.

Python 2.4 intoduced the (?(groupid)then|else) syntax for matching conditional to another group having participated in the match.

The existing match and search operations were joined in Python 3.4 by the fullmatch operation, which will only match the entire string, or the entire range between pos and endpos.

Python 3.6 introduced the ability to obtain groups as strings using indexing syntax, i.e. aliasing group to __getitem__.

Behaviour of sub, subn and split with regards to zero-length matches was changed in Python 3.7 to be more logical: split was changed so it will split a string at a zero-length match, and sub so that zero-length matches adjacent to a non-empty match are also replaced. This changed a behaviour which had lasted since the original regsub module in Python 0.9.8.

4.8. The second regex module

PyPI package: regex

In 2008, Matthew Barnett submitted a bug ticket including a major reworking of (at the time) the Python 2.5.2 re module (i.e. the sre implementation), adding atomic grouping and possessive quantifiers (i.e. variants of groups and greedy quantifiers which cannot backtrack), as well as variable-length look-behind assertions. He also mentioned that it was typically two times as fast as the standard one. This effort was joined by Jeffrey C. Jacobs (timehorse), who had already been working on improvements to the re module, slated at the time for Python 2.7.

 ※ Explanatory note: a lazy quantifier (e.g. +? or *?) matches the fewest possible of something that will allow the rest of the pattern to match. A greedy quantifier (e.g. + or *) matches as many of something as possible that will allow the rest of the pattern to match. A possessive quantifier (e.g. ++ or *+) matches as many of something as are present, even if this causes the rest of the pattern to fail.

The changes became quite dramatic, with one of them radically reörganising the code, dramatically reducing the number of support modules, changing the engine to use a node network rather than a linear bytecode sequence, as well as other improvements. In light of this, Georg Brandl suggested releasing it as a stand-alone package, on the basis that it would be difficult to review, expecting it to have acquired enough use for any issues to have been ironed out by the time Python 2.7 came around. Barnett promptly renamed it to regex so it could be installed as an extension module (although it is unrelated to the historic module by that name).

Suffice it to say that it was not included in Python 2.7, although it does have a mention in the documentation and in principle approval for eventual stdlib inclusion”, pending a PEP to sort out the details. Additionally, although intended to add support for Unicode regular expressions and succeeding in adding basic support, the original SecretLabs engine has been the subject of criticism for not providing selectors for Unicode properties (besides those which correspond to the syntaxes for ASCII regular expression properties), for using casemapping for case-insensitive matching rather than the more appropriate casefolding, and for not offering proper handling for multi-codepoint grapheme clusters (e.g. how they might affect what constitutes sequences/bounds of word characters). The second regex module is considered vastly improved in this respect. (The bigger mentioned egregious flaw of Python 2.7 and 3.2 treating Unicode strings as UCS-2 in several respects on Windows (as a result of PEP 261) was finally, and with much rejoicing, fixed in the very next version (Python 3.3) with the adoption of PEP 393.)

4.8.1. API extensions made by the second regex module

Some API extensions relative to the lastest (Python 3.7) version of re are given below. Some of the simpler changes, such as the VERSION1 behaviour for zero-length matches and the addition of fullmatch, have also been percolated through to recent versions of standard re.

4.8.2. Flag changes between re and the second regex module

In an effort to remain backward compatible with re, and to provide additional functionality, the second regex module introduces a few more flags:

Syntax Flag Meaning
(?V0) VERSION0 Conservatively match the behaviour of the re module, zero-length behaviour depends on Python version.
(?V1) VERSION1 Support scoped flags, set operations, default to full case-folding, new zero-length behaviour always.
(?f) FULLCASE Make IGNORECASE use full case-folding, implied by VERSION1 but can still be disabled.
(?w) WORD Use Unicode definitions of word boundaries, and consider all line breaks (rather than just LF).
(?r) REVERSE Begin searching from the end of the string.
(?p) POSIX Return the leftmost longest match, as stipulated by POSIX (takes longer).
(?e) ENHANCEMATCH When handling a fuzzy-match sequence, try to improve the fit of the match found.
(?b) BESTMATCH When handling a fuzzy-match sequence, exhaustively search for the least deviant match, not the first.

The values given to these flags and of the existing flags are as follows (note that ASCII and DEBUG have different values than in re, although all of the other flags are either new in the second regex module or match re):

Constant    Value
TEMPLATE    1
IGNORECASE  2
LOCALE      4
MULTILINE    8
DOTALL      16
UNICODE      32
VERBOSE      64
ASCII        128
VERSION1    256
DEBUG        512
REVERSE      1024
WORD        2048
BESTMATCH    4096
VERSION0    8192
FULLCASE    16384
ENHANCEMATCH 32768
POSIX        65536

4.8.3. Syntax changes between re and the second regex module

The second regex module introduces a number of powerful syntax innovations.

TODO more here.

Appendices

† Python 1.6 is basically the state of Python 2 at the point that Guido left CNRI, released under contractual obligation or something similar: it incorporates many but not all distinctively 2.x features, it is accordingly not a true Python 1.x release; hence, What’s new in Python 2.0” compares it with Python 1.5.2.