Python RegEx through the ages
1. Introduction
Python has gone through three regular expression modules during its history (regexp
, regex
and re
) and a fourth has been greenlighted (also called regex
, confusingly). Furthermore, more than one of these were reïmplemented at least once during their existence. As they were superseded, disappeared from the documentation and were finally removed from the distribution, I am very much aware that the interface and functionality of the older modules may well be an important thing to have a reference for if trying to understand legacy Python code. The module name collision only adds to this, so I thought it would be sensible to write this up.
Further to the goal of allowing understanding and porting of legacy Python code, and in case any unwisely written code depends on particular details of an individual implementation, I also attempt to “document the undocumented” in terms of implementation details, low-level interfaces, numerical values of constants et cetera which did not qualify for coverage in the standard library manuals.
It is also of interest from an academic perspective, as each new module and each revision of an existing module contributed to the current interface as it stands. Outside of Python, the history of Python regular expressions is also fundamentally tied with the history of named groups in regular expressions, so this may be of passing interest to that topic.
This document is currently a work in progress, though an overview table is fairly complete. It also currently covers CPython only; coverage of Jython, PyPy or IronPython is a more distant goal, though the basic API should be compatible between them.
While limited detail on this is discernable from Python’s HISTORY file, this is not adequate to establish the API changes between the modules themselves. Other sources include old source distributions, historical VCS (which is currently incomplete for the earliest releases), older versions of the documentation, et cetera. In any case, I felt it sensible to write it up in one place. So here it is: the history of Python’s regular expressions.
2. First API
Python’s original regular expression support accepted Posix Extended syntax, and could use either UNIXv8 regular expressions or Henry Spencer’s reïmplementation (a version of which was included).
This was present in Python 0.9.1. This incorporated only minor changes from Python 0.9.0, the first public source release. Python 0.9.1 was posted on Usenet alt.sources and consequently preserved in archives (a tarball conversion was previously offered on python.org and still is on legacy.python.org).
The same cannot be said of the vast majority of Python 0.9.x release packages, which were distributed for limited periods only on CWI’s FTP site. While the HISTORY file does detail changes between these versions, taken from previous versions of the NEWS file, this is not in especially great detail (the NEWS file entries of 0.9.x are the equivalent of the “What’s new in” documents of 2.x and 3.x, not the detailed changelogs which they now are).
The cpython-fullhistory repository does however provide an archive of old VCS, tagged back to 0.9.8 (it actually goes back to 1990, before 0.9.0), but doesn’t seem to include all the files that made it into the release (in particular, regexpmodule.c
is nowhere to be seen). Additionally, its directory structure in these old commits seems to be influenced by later file moves/renames rather than preserving the original directory structure evident in the release package. I come to suspect that files that ceased to exist before a certain revision simply are not preserved. Despite that repository being now Mercurial, before that it was Subversion, before that it was CVS and I don’t know if even that was the first, so it probably cannot be assumed to be as useful/dependable as e.g. modern Git.
2.1. The regexp
module (original version)
Guido’s Python 0.9.1 library reference (provided as LaTeX in the alt.sources release) gives the following documentation:
3.4 Built-in Module
regexp
This module provides a regular expression matching operation. It is always available. The module defines a function and an exception:
compile(pattern) Compile a regular expression given as a string into a regular expression object. The string must be an egrep-style regular expression; this means that the characters '(' ')' '*' '+' '?' '|' '^' '$' are special. (It is implemented using Henry Spencer’s regular expression matching functions.)
regexp.error Exception raised when a string passed to compile() is not a valid regular expression (e.g., unmatched parentheses) or when some other error occurs during compilation or matching ("no match found" is not an error).
Compiled regular expression objects support a single method:
exec(str) Find the first occurrence of the compiled regular expression in the string str. The return value is a tuple of pairs specifying where a match was found and where matches were found for subpatterns specified with '(' and ')' in the pattern. If no match is found, an empty tuple is returned; otherwise the first item of the tuple is a pair of slice indices into the search string giving the match found. If there were any subpatterns in the pattern, the returned tuple has an additional item for each subpattern, giving the slice indices into the search string where that subpattern was found.
Licence for the above documentation:
Expand/Hide Spoiler
Copyright 1991 by Stichting Mathematisch Centrum, Amsterdam, The Netherlands.
All Rights Reserved
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Stichting Mathematisch Centrum or CWI not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.
STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
2.2. The regexp
module (versions from Python 0.9.2+)
The above API documentation is correct for Python 0.9.1. Following that version, the API was changed:
- The
exec
method was renamed tomatch
. - An additional match(pat, str) global function was added. This would compile the pattern and then call its
match
method, although at least some versions would reüse a compiled pattern when the same pattern was supplied consecutively.
Determining when these changes were made is complicated by regexpmodule.c
not being tracked for reasons speculated above. The HISTORY file is silent on this, understandable for the second but somewhat frustrating for the first. By checking changes to API usage in other modules which are tracked, namely the grep
module, it is however apparent that the .exec
method had been renamed or at least aliased to .match
by the time Python 0.9.2 was released.
In Python 0.9.5, the first regex
module was introduced and the regexp
module was reimplemented in Python as a wrapper. By this point, the match
global function had been added, and the method was available only as .match
.
The built-in exec
function became a keyword in Python 1.0 (this decision was reversed in Python 3.0). In response, os.exec
become os.execv
. One might be tempted to speculate that the renaming of the regular expression .exec
method in Python 0.9.2 may have been motivated by plans to do this, despite being somewhat far in advance. However, considering that os.exec
was introduced under that name in Python 0.9.2, this is almost certainly not the case. Nonetheless, the renaming may have been influenced by the introduction of os.exec
.
The regexp
module was removed in Python 1.5.
3. Second API
Python 0.9.5 introduced the regex
module, was “released as Macintosh application only, 2 Jan 1992” per the HISTORY file and appears to have used “GNU regex.c, from subdirectory regex”, per contemporary source comments in VCS. In Python 0.9.6 (evident from both VCS and the HISTORY file), Guido switched to a non-copylefted libre reïmplementation of GNU regex which Tatu Ylönen had written and posted to comp.sources.misc, thus avoiding bringing Python under the GPL.
3.1. The first regex
module
Official documentation for the first regex
module: Python 1.5.1, 1.5.2, 1.6† (undocumented in Python 2.0+)
Not to be confused with the modern, identically named (second) regex
module planned for future inclusion in the standard library.
This was the main module, the C module, and the direct successor of the regexp
module. However, there were many important differences from the comparatively simple first API:
- Defaulting to Emacs-style regular expression syntax, not to Posix Extended. In particular, escaped
\(
,\|
,\)
were used for groups and unescaped(|)
were literal. The\s
syntax table escapes (e.g.\s<
) and the\=
escape are not supported due to not actually being attached to Emacs. - A new global
set_syntax
function to set the syntax-flags integer, theregex_syntax
module supplied these flags and precomposed such integers for styles of awk, grep et cetera The first API was emulated using “awk” mode, not “egrep” mode, because the latter mode treated newline as an “or” operator. Callingset_syntax
also returned the previous syntax-flags integer so they could be restored. Needless to say, this could be extremely thread-unsafe, particularly as there was originally no way to query the syntax without changing it (aget_syntax
function was added much later, but not before the third API had already been introduced). - The
match
method looked only at the start (or provided offset), while the newsearch
method would scan the entire string. They accepted an optional second argument (pos
), which specified on offset to start looking from. Both were still available as global functions with single-pattern caches, but the usefulness of these is limited for the reason explained below (and they didn’t accept thepos
argument). match
andsearch
no longer returned tuples of tuples of slice indices. Instead,match
returned a length (or-1
if it doesn’t match) andsearch
returned an offset (or-1
). The tuple of tuples of slice indices was made available as theregs
(for “registers”) attribute of the compiled regular expression object itself (…yes, really) following matching/searching, which wasNone
if there was no match. Needless to say, this was not thread-safe.- The first-API
regexp.py
file (wrapping the second-API firstregex
module) contained code for removing zero or more(-1, -1)
tuples (i.e. groups that did not contribute) from the end ofregs
before returning. This presumably represents a further API difference.
The following additions were made to the API in Python 0.9.9:
- The
compile
function now accepted an optional second argument, calledtranslate
. This took a length-256 string of bytes onto which input bytes were to be mapped. Thecasefold
global provided one of these for ASCII casefolding. Needless to say, this was not exactly Unicode-ready, but this was Python 0.9.9 and the early 1990s, and Unicode support was a major introduction in Python 2,† by which point the third API was already in place. - Compiled regular expression objects provided a
group
method which returned the string matched by the group of a given number. If multiple arguments are given, a tuple was returned. Passing no arguments returned all groups. - Additional properties (besides
regs
) were made available on the regular expression object:last
— the last string passed to match/search if a match was found. OtherwiseNone
.translate
— thetranslate
argument with which the compiled regular expression object was created. OtherwiseNone
.
The symcomp
function was added in Python 1.0. This worked as compile
, but would parse a <name>
string at the start of a parenthetical group as a group name, which could be passed to the group
method. The Ylönen backend was modified to export the syntax code so it was freely visible to symcomp
(but still not visible without side effects to native Python code until get_syntax
was added much later). Supporting this, three new regular expression object properties were added:
givenpat
— the regular expression pattern from which the regular expression object was compiled.realpat
— the regular expression pattern, but with group names stripped ifsymcomp
was used.groupindex
— dict mapping group names to group indices.
Following Emacs style, \b
for word bounderies included the start and end of the string unconditionally, and \B
for word non-boundaries excluded the start and end of the string unconditionally. The escapes \<
for word start boundaries and \>
for word end boundaries, on the other hand, would more conventionally match at the start and end of the string (respectively) if next to a word character.
Similarly, the underscore was not initially included in the definition of \w
for word characters, again following Emacs style. This seems to have been changed in Python 1.5 as one of Jeffrey Ollie’s revisions, although without updating the documentation for the regex
module (I’m unsure how much impacting the regex
module was intentional here, since the change accompanied the addition of additional syntax table flags used only by the re1
module—see below). The Emacs syntax \_<
and \_>
for “symbol” boundaries (i.e. like \<
and \>
but treating the underscore as part of the word/symbol) was never supported.
Having been emitting DeprecationWarning since at least Python 2.1, the first regex
module was at length removed in Python 2.5.
3.2. The regex_syntax
module
The regex_syntax
module was added in Python 0.9.5 and thereäfter functionally changed only once, six years later (in Python 1.5, with the addition of two more flags). Subsequent minor changes post-obsolescence added a module docstring and converted tabs in the comments to spaces.
The regex_syntax
module was never separately documented; the documentation advised reading its source code. It defined the following syntax flags:
Flag name | Value | Description |
---|---|---|
RE_NO_BK_PARENS | 1 | Make plain () grouping and escaped \(\) act literal. |
RE_NO_BK_VBAR | 2 | Make plain | act as an or-operator and escaped \| act literal. |
RE_BK_PLUS_QM | 4 | Make plain +? act literal and escaped \+\? act as operators. |
RE_TIGHT_VBAR | 8 | Make | bind tighter than ^$ . |
RE_NEWLINE_OR | 16 | Make line breaks in the pattern behave as or-operators. |
RE_CONTEXT_INDEP_OPS | 32 | As (inaccurately) documented in source code comments, treat ^$*+? as literal in contexts where they don’t otherwise make sense. Actually, the other way around (makes them an error in contexts where they don’t make sense, as one might deduce from the flag name). |
RE_ANSI_HEX | 64 | Added to the module in 1.5: process \n , \x00 et cetera. Bizarrely, this seems to also have turned on e.g. \v10 for accessing group number 10, which actually collides with \v for vertical tab; turning it off would have made more sense. |
RE_NO_GNU_EXTENSIONS | 128 | Added to the module in 1.5; disable syntax for matching the start of the entire string (\` ), end of the entire string (\' ), word boundaries and word characters. |
The following pre-composed (by bitwise-or) flag combinations were also defined. Note that the {count}
, {minimum,}
and {minimum,maximum}
syntaxes for repetitions were not supported at all, hence the lack of a flag for them (Emacs and grep
styles backslash them, while egrep
style does not).
Mode name | Flags |
---|---|
RE_SYNTAX_AWK | RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS |
RE_SYNTAX_EGREP | RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS, RE_NEWLINE_OR |
RE_SYNTAX_GREP | RE_BK_PLUS_QM, RE_NEWLINE_OR |
RE_SYNTAX_EMACS | No flags set (zero) |
The regex_syntax
module was removed in Python 2.5.
3.3. The regsub
module
Official documentation for the regsub
module: Python 1.5.1, 1.5.2, 1.6† (undocumented in Python 2.0+)
The regsub
module was added in either Python 0.9.7 or Python 0.9.8. It provided the following. This may seem peripheral, but it is quite important in comparison to the third. The main reason why these weren’t in the first regex
module is presumably that the latter was a C module, as these were written in Python.
The following functions were present initially:
- sub (pat, repl, str)
- gsub (pat, repl, str)
- split (str, pat)
- Internal compile (pat)
- Internal expand (repl, regs, str)
These are mostly self-explanatory. The sub
function replaced one instance, using backslash notation in the replacement for reference to subpatterns. The gsub
did the same for multiple instances, but with zero-length matches adjacent to a previous match not counted (other zero-length matches are). The split
function did not split at zero-length matches.
The regsub.compile
internal function used by the others wrapped the usual one with its own multiple-pattern caching of compiled regular expression objects, which wouldn’t always have been a good idea due to the ideosyncracies of the API. In particular, it did not originally keep track of the syntax flags so would have to be manually cleared when they were changed. This could be achieved through the somewhat crude assignment regsub.cache = {}
. It was later changed to keep track of syntax flags (keying the cache with a (pat, syntax)
tuple) and offer a clear_cache()
function, but only once the regex.get_syntax
function had been added, by which point it had already been superseded by the third API.
The regsub.expand
internal function, used by sub
and gsub
, took a replacement string containing backslash-references to groups (e.g. \1
), along with the necessary match data (the regs
tuple and the string which it indexes), and returned the substituted string.
The following API extensions were made in Python 1.4.
- capwords (s[, pat])
- split (str, pat[, maxsplit])
- splitx (str, pat[, maxsplit])
Python 1.4beta3 added a maxsplit
argument to regsub.split
, matching the invocation of string.split
. It also added splitx
as a version which retains the delimitors. This was used by capwords
. When capwords
was added in 1.4beta1, a different, incompatible use of the third argument to split
was added for the same purpose, but this never made it into any non-alpha version of Python and so is largely inconsequential (hence breaking compatibility with it was not a concern).
Having been emitting DeprecationWarning since Python 2.1, the regsub
module was at length removed in Python 2.5.
4. Third API: the re
module
Due to the fundamentally thread-unsafe nature of the second API, a third API was introduced in Python 1.5 in the re
module, accompanied to a switch to the more powerful Perl-style regular expression syntax.
The other thing which was introduced here was the separation of an implementation-specific low-level backend module written in C from a high-level API written in Python, rather than the entire main module being written in C. This allowed the functionality of the regsub
module to be integrated into the re
module, rather than being a separate Python module.
4.1. API changes from the first regex
module to re
The third API introduced the concept of a “match object”: contrasting with the second API’s thread-unsafe practice of setting and exposing the properties and methods relating to specific match/search results on the compiled regular expression objects themselves, the match
and search
methods return “match objects” (not integers) upon which these are set and exposed. This allows the module to be fully thread-safe. A return value of None
is encountered in the absence of a match; as None
has a false boolean value while a match object has a true boolean value, this makes it very succinctly possible to simply check for a match.
The distinction between match
(matches at the start / specified position) and search
(matches anywhere) remains. When invoked as methods, they accept both pos
and endpos
arguments to define the start and end of the range to match from / search over. When invoked as globals, they accept neither argument.
While the regs
attribute, now on the match object, remains in the module, it is not documented as part of the API, so is presumably not supposed to be accessed directly anymore (though it still sometimes is). Equivalent functionality is provided through the span
method, which returns the (start, end) index tuple for a given group name or number, while the start
and end
methods return only the one index (the indices being, as previously, -1
if the group did not contribute). This allows names to be used interchangably with numbers, unlike indexing regs
directly.
The group
method is accordingly now on the match objects but, except in very early versions (see re1
below), it now behaves like start
, end
and span
in defaulting to group 0 when no arguments are supplied (rather than returning all groups). A new groups
method was added to return all groups (1 and up).
Finally, symcomp
is not present, because a dedicated named group syntax is supported by compile
(see below).
4.2. API changes from regsub
to re
The functionality of regsub
is, due to the re
module itself being written in Python, incorporated into re
. Some differences to note:
capwords
is absent.sub
behaves like the oldgsub
by default (gsub
is noted to have been more commonly used). While the first three arguments are the same as before, it takes a count integer as an optional fourth argument. This defaults to0
, meaning no limit, but it can be set to a number of substitutions to make, e.g.1
to emulate oldsub
.subn
is invoked likesub
, but returns a tuple of the result and the number of substitutions made.split
will include the content of any capturing parenthetical groups in the return value, thus negating the need forsplitx
. A bug in 1.5.0 causedmaxsplit
to be ignored due to never incrementing the loop counter, this was fixed in 1.5.1 and later.- The existing behaviour of
split
andgsub
regarding zero-length matches was retained in their successors until Python 3.7, when they were changed to consider all zero-width matches. - All of the above are, whist still available as global functions, also available as methods on the compiled regular expression objects themselves.
- The flags argument explained below was added as a further optional argument to the global functions for the above in Python 3.1, which change was backported to Python 2.7. This was consistant with
match
andsearch
already accepting it. - The global
purge
function is equivalent to latter-eraregsub
’sclear_cache
function. findall
was added in Python 1.5.2. Invoked with the same pattern asmatch
andsearch
(though the global one didn’t acquire the optionalflags
argument until Python 2.4, which was still ahead ofsub
and friends), it returns a list of matching strings (left to right without overlaps). If the pattern contains a capturing group, matches of that group are returned instead of matches of the entire pattern. If the pattern contains multiple capturing groups, a list of tuples is returned.
4.3. Syntax changes from the first regex
module to re
The new API was also taken as an opportunity to switch to the more powerful Perl-style regular expression syntax (a more powerful extension of the Posix Extended syntax), as opposed to the adjustable Emacs-style syntax previously supported. This means that:
- Unescaped
(|)
are always grouping (while this was possible behaviour under the firstregex
module, it was not the default). - Minimal (non-greedy) matching is possible, by using
+?
and*?
as non-greedy equivalents to+
and*
. - The character class escapes
\d\D
for digits (and non-digits) and\s\S
for whitespace join the existing\w\W
for word characters. The\w
definition intentionally includes the underscore. - The
\<\>
escapes (for specific word boundaries) are dropped, while\b
(for word boundaries per se) is retained. Except inre1
, the behaviour of\b
at the start and end of the string follows the union of the former\<
and\>
, rather than unconditionally matching;\B
is in any case its inverse. Again, the underscore is intentionally included within a word. - The less problematic and more immediately visually distinct
\A
and\Z
replace\`
and\'
for the absolute start and end of the entire string. The^
and$
also match only the start and end of the first and last line respectively, not the start and end of all lines, unless in multiline mode (see below), although$
still differs from\Z
in that$
can match before the last of any trailing newlines (although it matches the absolute end of the string as well). - The
{count}
,{minimum,}
and{minimum,maximum}
syntaxes for repetitions are supported. - The
\v
syntax for groups with numbers greater than 9 is not supported; instead, the plain\
+digit syntax is extended to take up to two digits, provided the first digit is not zero. - Several extensions are available taking the otherwise-invalid
(?…)
form, including non-capturing groups and (insofar as this is supported by the underlying engine) look-ahead and look-behind assertions.
The (?
syntactical extensions in the Perl syntax were further extended with named group support (noted in source comments to be “Python extensions”), superseding the earlier symcomp
system, but keeping a similar syntax (with (?P<like>this)
replacing the earlier (<like>this)
). These extensions were later adopted by PCRE and by Perl, but with the inserted P
made optional (it is still mandatory in Python). Needless to say, symcomp
itself was not retained.
A reconvert
module was added to aid in conversion of existing patterns, but was later removed in Python 2.5.
4.4. Original flags of the re
module
The adoption of Perl style regular expressions also saw the removal of the existing syntax flags (though they were accepted by reconvert
) and addition of the set of flags used by Perl-style regular expressions. Being thread-safe, these flags are set for each compiled regular expression object and passed to compile
and some other global functions such as match
, as opposed to being set globally.
They can also be specified at the start of the pattern itself, the original such flags (supported by pre
and noted as being “standard flags” in sre_parse
source comments) were as follows. These constants were also made available under shorter names corresponding to the uppercase of their letter codes. Actual numerical values can and do differ between implementations, and are thus listed in the implementation details further below. Also note that this is not a complete list of the flags currently regarded as part of the API, see the below documentation for the sre
implementation for the rest.
Syntax | Flag | Meaning |
---|---|---|
(?i) | IGNORECASE | Match the pattern case-insensitively; supersedes use of the casefold string. |
(?L) | LOCALE | Not in re1 ; makes \w\b et cetera and IGNORECASE follow the single-byte locale, not just ASCII. |
(?m) | MULTILINE | Makes ^$ match the start and end of any line, not just the respective first and last lines. |
(?s) | DOTALL | Makes . include the newline. |
(?x) | VERBOSE | Outside a hard-bracketed character class, whitespace and anything between # and newline ignored. |
※ Explanatory note: the ^
matches the start of a string and, in multiline mode, the start of a line. If the string ends in a newline, $
will match both before and after that newline, while in multiline mode it will also match the end of every line. The \A
and \Z
escapes (which replace the first regex
module’s \`
and \'
) match only the actual start and end of the string.
4.5. The re1
implementation of the third API
The short-lived first implementation of the re
module was written by Jeffery Ollie and backed by the reop
module, a newly introduced direct interface to the pattern bytecode handling (as opposed to compilation) routines of Ylönen’s engine (which Ollie had substantially refactored). It was introduced in 1.5.0alpha3, and was superseded shortly thereafter in 1.5.0alpha4 by a new implementation of re
(the one later renamed to pre
). The original implementation was retained as re1
for the 1.5.0 release, then removed.
A cursory read through this module is very telling of the struggles to support a regular expression syntax itself not natively supported by the engine used. The bytecode compilation of the regular expression is done in pure Python, making the main module quite lengthy in comparision to its immediate replacement. Support is, perhaps understandably, still limited to what can be achieved with the same pattern-bytecode engine: in particular, it would appear that trying to use look-ahead assertions will raise an error with “zero-width positive [or negative] lookahead assertion is unsupported”.
The ALL_CAPITAL constants and the error
exception from the backend reop
module were exposed through. Other constants exported included the usual flags (not yet including LOCALE
, UNICODE
or ASCII
), which were assigned the following values:
Flags | Value |
---|---|
IGNORECASE | 1 |
MULTILINE | 2 |
DOTALL | 4 |
VERBOSE | 8 |
In the only version of this implementation ever officially released (i.e. in Python 1.5.0, as re1
), the RegexObject.split
method never actually increments its loop counter so its maxsplit
argument actually does nothing. While this bug was also present in the main re
(later pre
) module in that version, it was fixed in Python 1.5.1, by which point the re1
module had been removed.
Uniquely among released versions (for a given value of “released”) of the re
module, however, there is no groups
method: the group
method (despite now being on the match objects) still behaves like in the first regex
module, returning all groups (1 and up) when no arguments are passed.
4.6. The pre
implementation of the third API
Official documentation for the re
module covering the pre
implementation: Python 1.5.1, 1.5.2, 1.6†
In Python 1.5.0alpha4, the re.py
which had been introduced only one alpha version ago was deprecated and moved to re1.py
(in Guido’s words, “just in case you need it for comparison”), being replaced with a new (API-compatible) re.py
using a contemporary (late 1990s) version of Philip Hazel’s PCRE (Perl Compatible RegEx) engine. Adoption of PCRE enabled use of look-ahead and look-behind assertions, which had been unavailable in re1
due to the limitations of the underlying engine. This accordingly become the first module to make it into a release under the re
name.
Andrew Kuchling credits Neal Becker with bringing this engine to the attention of the Python String Special Interest Group (String-SIG), mentioning that it had been written for Exim but had been attracting attention due to Perl’s own regex code not being readily isolatable.
This new re.py
was substantially shorter, as it offloaded the work of compiling (not just matching) the regular expressions onto the underlying pcre
module (of no relation to the more recent binding module by the same name). Much of the other code was largely recycled from the original re
/re1
though, including for example the split
method, bringing in with it the maxsplit
bug (which was eventually fixed in Python 1.5.1).
Since the re
/pre
module’s first proper release in Python 1.5.0, and differing from re1
, the group
method now behaves like start
, end
and span
in returning group 0 when no arguments are supplied; a new groups
method will return all groups 1 and up (although in Python 1.5.0 itself, it would return a string is cases where a singleton tuple would be expected: this was fixed in Python 1.5.1). This new behaviour was inherited by subsequent implementations.
The flags had values as follows. The ANCHORED
flag is used internally by the match
method and is not intended to form part of the interface. As not all supported flags were exported in Python, but the others could theoretically be used as magic numbers, the names given to them in the C code are also listed.
Name from Python | Name from C | Value |
---|---|---|
IGNORECASE | PCRE_CASELESS | 1 |
VERBOSE | PCRE_EXTENDED | 2 |
ANCHORED | PCRE_ANCHORED | 4 |
MULTILINE | PCRE_MULTILINE | 8 |
DOTALL | PCRE_DOTALL | 16 |
(not exported) | PCRE_DOLLAR_ENDONLY | 32 |
(not exported) | PCRE_EXTRA | 64 |
(not exported) | PCRE_NOTBOL | 128 |
(not exported) | PCRE_NOTEOL | 256 |
LOCALE | PCRE_LOCALE | 512 |
A groupdict
method, which returns a mapping of group names to matched strings, was added to the match objects of pre
in Python 1.6,† as well as being supported by the then-new sre
.
The pre
implementation of the re
module was never updated for more recent versions of PCRE, being instead superseded by the (API-compatible) sre
/re
module. It was retained as pre
with a frozen PCRE version until it was removed altogether in Python 2.4, leaving future PCRE support open for third-party bindings.
4.7. The sre
implementation of the third API
Official documentation for the re
module covering the sre
implementation: Python 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.current
SecretLabs SRE was written by Fredrik Lundh for Python as a Unicode-supporting implementation of the third API (introducing the UNICODE
flag), and introduced in Python 2.0 (actually 1.6).† Due to initially using a recursive matching scheme which would potentially run into stack limits (changed in Python 2.4.0alpha1), as well as the possibility of the SecretLabs engine behaving differently to PCRE, the older implementation was initally retained as pre
, with the SecretLabs implementation being introduced as sre
.
Upon the introduction of sre
, the re
module was set up to import-star from either sre
or pre
; while it was configured to use sre
in the standard source tree, this was supposed to be edited by sites/vendors where necessary. The pre
module was removed in Python 2.4, now that sre
had been changed to match non-recursively, leaving re
as a mere alias to sre
. Importing the module as sre
was then deprecated in Python 2.5, with the actual module code being moved to re
. The sre
name was then removed in Python 3.0, leaving re
as the sole name of the module.
Versions of the sre
implementation added several flags absent from re1
and pre
. Of these, UNICODE
, ASCII
and eventually DEBUG
made it into the documentation and may accordingly be considered later additions to the essential API, while TEMPLATE
appears still to be an implementation detail.
As text strings were made Unicode by default in Python 3.0, UNICODE
matching became the default. Matching in accordance with pure ASCII (i.e. despite the passing of Unicode strings) was added as the ASCII
flag, while UNICODE
became a no-op, retained for compatibility only. While LOCALE
was retained (for use on byte strings only), its use is discouraged, and its usefulness is limited given that text strings are likely to be Unicode to begin with.
Syntax | Flag | Added in | Meaning |
---|---|---|---|
(?u) | UNICODE | 1.6 | Make \w\W\b\B et cetera and IGNORECASE follow Unicode (no-op in 3.x). |
(?t) | TEMPLATE | 1.6 | Disable backtracking (not a documented flag). |
(none) | DEBUG | 2.1 | Prints the pattern bytecode disassembly following compilation. |
(?a) | ASCII | 3.0 | Make \w\W\b\B et cetera and IGNORECASE follow plain ASCII. |
The actual numerical values both of these and of the other flags are as follows. In the original Python 1.6† version, DEBUG
and ASCII
were absent, but the numerical values were otherwise the same as in the current version.
Constant | Value |
---|---|
TEMPLATE | 1 |
IGNORECASE | 2 |
LOCALE | 4 |
MULTILINE | 8 |
DOTALL | 16 |
UNICODE | 32 |
VERBOSE | 64 |
DEBUG | 128 |
ASCII | 256 |
Although PCRE was subsequently updated with Unicode support, it was not re-adopted by the Python Standard Library (the existing modules were maintained as legacy and then removed), and descendants of the original sre
module have been used to this day. This means that subsequent improvements to PCRE, and subsequent additions to its syntax, did not percolate down to Python, although a third-party binding exists for more recent PCRE versions.
The sre
implementation in Python 2.0 (or possibly 1.6) seems to have introduced the .expand
method of match objects, per appearance in documentation; despite no mention being made of it being new, it was never provided by any revision of re1
or pre
. Interestingly, this seems to be the first time such a function has been treated as part of the API rather than as an implementation detail. It functions much like its regsub
predecessor, only with the addition of the \g<name>
syntax for named group references. However, since it’s a method on the actual match object, the only argument it takes is the repl
string to expand. The same functionality had been provided by the internal global _expand
in re1
and the internal global pcre.prce_expand
in pre
, both of which took the match object as the first argument.
Python 2.2 introduced finditer
, which is similar to findall
, differing only in that (a) it is a generator function and (b) it yields match objects as opposed to strings. Although pre
was still being included at this point, this was not backported to pre
.
Python 2.4 intoduced the (?(groupid)then|else)
syntax for matching conditional to another group having participated in the match.
The existing match
and search
operations were joined in Python 3.4 by the fullmatch
operation, which will only match the entire string, or the entire range between pos
and endpos
.
Python 3.6 introduced the ability to obtain groups as strings using indexing syntax, i.e. aliasing group
to __getitem__
.
Behaviour of sub
, subn
and split
with regards to zero-length matches was changed in Python 3.7 to be more logical: split
was changed so it will split a string at a zero-length match, and sub
so that zero-length matches adjacent to a non-empty match are also replaced. This changed a behaviour which had lasted since the original regsub
module in Python 0.9.8.
4.8. The second regex
module
PyPI package: regex
In 2008, Matthew Barnett submitted a bug ticket including a major reworking of (at the time) the Python 2.5.2 re
module (i.e. the sre
implementation), adding atomic grouping and possessive quantifiers (i.e. variants of groups and greedy quantifiers which cannot backtrack), as well as variable-length look-behind assertions. He also mentioned that it was typically two times as fast as the standard one. This effort was joined by Jeffrey C. Jacobs (timehorse), who had already been working on improvements to the re
module, slated at the time for Python 2.7.
※ Explanatory note: a lazy quantifier (e.g. +?
or *?
) matches the fewest possible of something that will allow the rest of the pattern to match. A greedy quantifier (e.g. +
or *
) matches as many of something as possible that will allow the rest of the pattern to match. A possessive quantifier (e.g. ++
or *+
) matches as many of something as are present, even if this causes the rest of the pattern to fail.
The changes became quite dramatic, with one of them radically reörganising the code, dramatically reducing the number of support modules, changing the engine to use a node network rather than a linear bytecode sequence, as well as other improvements. In light of this, Georg Brandl suggested releasing it as a stand-alone package, on the basis that it would be difficult to review, expecting it to have acquired enough use for any issues to have been ironed out by the time Python 2.7 came around. Barnett promptly renamed it to regex
so it could be installed as an extension module (although it is unrelated to the historic module by that name).
Suffice it to say that it was not included in Python 2.7, although it does have a mention in the documentation and “in principle approval for eventual stdlib inclusion”, pending a PEP to sort out the details. Additionally, although intended to add support for Unicode regular expressions and succeeding in adding basic support, the original SecretLabs engine has been the subject of criticism for not providing selectors for Unicode properties (besides those which correspond to the syntaxes for ASCII regular expression properties), for using casemapping for case-insensitive matching rather than the more appropriate casefolding, and for not offering proper handling for multi-codepoint grapheme clusters (e.g. how they might affect what constitutes sequences/bounds of word characters). The second regex
module is considered vastly improved in this respect. (The bigger mentioned egregious flaw of Python 2.7 and 3.2 treating Unicode strings as UCS-2 in several respects on Windows (as a result of PEP 261) was finally, and with much rejoicing, fixed in the very next version (Python 3.3) with the adoption of PEP 393.)
4.8.1. API extensions made by the second regex
module
Some API extensions relative to the lastest (Python 3.7) version of re
are given below. Some of the simpler changes, such as the VERSION1 behaviour for zero-length matches and the addition of fullmatch
, have also been percolated through to recent versions of standard re
.
splititer
as a generator version ofsplit
overlapped
argument forfindall
andfinditer
pos
andendpos
available onsub
andsubn
- For use with matches with repeated groups,
captures
,capturesdict
,starts
,ends
andspans
work likegroup
,groupdict
,start
,end
andspan
respectively, but return a list of values of every instance of that group in the match, not only the last (in the case ofcapturesdict
, a dict with such lists as values). literal_spaces
argument toescape
, suppresses escaping of spaces.special_only
argument toescape
, escapes only special characters.detach_string
method on the match object, deletes reference to the original string (useful if it is a large string).expandf
,subf
andsubfn
, differing from the ones without anf
in that they usestr.format
style format strings instead of backslash syntax.partial
argument tomatch
,search
,fullmatch
andfinditer
, allows truncated matches (i.e. where the pattern is still matching up, but runs into the end of the string before it’s finished). Mainly useful for validating an input field whilst it is still being filled in. The match objects are given apartial
attribute indicating if the match is partial in this sense.
4.8.2. Flag changes between re
and the second regex
module
In an effort to remain backward compatible with re
, and to provide additional functionality, the second regex
module introduces a few more flags:
Syntax | Flag | Meaning |
---|---|---|
(?V0) | VERSION0 | Conservatively match the behaviour of the re module, zero-length behaviour depends on Python version. |
(?V1) | VERSION1 | Support scoped flags, set operations, default to full case-folding, new zero-length behaviour always. |
(?f) | FULLCASE | Make IGNORECASE use full case-folding, implied by VERSION1 but can still be disabled. |
(?w) | WORD | Use Unicode definitions of word boundaries, and consider all line breaks (rather than just LF). |
(?r) | REVERSE | Begin searching from the end of the string. |
(?p) | POSIX | Return the leftmost longest match, as stipulated by POSIX (takes longer). |
(?e) | ENHANCEMATCH | When handling a fuzzy-match sequence, try to improve the fit of the match found. |
(?b) | BESTMATCH | When handling a fuzzy-match sequence, exhaustively search for the least deviant match, not the first. |
The values given to these flags and of the existing flags are as follows (note that ASCII
and DEBUG
have different values than in re
, although all of the other flags are either new in the second regex
module or match re
):
Constant | Value |
---|---|
TEMPLATE | 1 |
IGNORECASE | 2 |
LOCALE | 4 |
MULTILINE | 8 |
DOTALL | 16 |
UNICODE | 32 |
VERBOSE | 64 |
ASCII | 128 |
VERSION1 | 256 |
DEBUG | 512 |
REVERSE | 1024 |
WORD | 2048 |
BESTMATCH | 4096 |
VERSION0 | 8192 |
FULLCASE | 16384 |
ENHANCEMATCH | 32768 |
POSIX | 65536 |
4.8.3. Syntax changes between re
and the second regex
module
The second regex
module introduces a number of powerful syntax innovations.
TODO more here.
Appendices
- → Appendix A: Low level support modules of Third API implementations (
reop
, firstpcre
,_sre
andsre_*
,_regex
) - → Appendix B: Other third-party regular expression modules (including the second
pcre
module) - → Appendix C: Summary table
† Python 1.6 is basically the state of Python 2 at the point that Guido left CNRI, released under contractual obligation or something similar: it incorporates many but not all distinctively 2.x features, it is accordingly not a true Python 1.x release; hence, “What’s new in Python 2.0” compares it with Python 1.5.2.