Jump to content.

Python RegEx through the ages

1. Overview table

Included withBase APIEngineMain module nameC module nameOther modulesNotes
0.9.0–0.9.1First (exec)Henry SpencerregexpregexpPosix Extended format. First public release of Python. Very simple API.
0.9.2–0.9.4First (match)Henry Spencerregexpregexpexec method renamed to match.
0.9.5SecondGNU(1st)regex(1st)regexregex_syntaxEmacs-style format by default, but changeable with flags.
0.9.6–2.4SecondYlönen based(1st)regex(1st)regexregex_syntax, later also regsubChanged to engine written by Tatu Ylönen. Emitting DeprecationWarning since 2.1, finally removed in 2.5.0alpha1.
0.9.5–1.4First (match)(same as (1st)regex)regexp(1st)regex Wrapper around the contemporary (1st)regex module. Module removed in 1.5.0beta1.
1.5.0ThirdYlönen basedre or re1reop Perl-style format. New thread-safe API defined. Introduced as re in 1.5.0alpha3, implementation never fully debugged before it was superseded in 1.5.0alpha4. Retained as re1 for the 1.5.0 release, removed before 1.5.1.
1.5–2.3ThirdPCRE (old)re or pre(1st)pcreIntroduced in 1.5.0alpha4 using what was then (late 1990s) a contemporary PCRE library. In particular, Unicode regular expressions were not yet supported. Renamed to pre in Python 2.
1.6–presentThirdSecretLabs basedsre or re_sresre_compile, sre_parse, sre_constantsIntroduced with Python 2 as sre due to the need for a re implementation that supported Unicode regular expressions. Initially used a recursive matching scheme, this was changed in Python 2.4.0alpha1.
PyPI python-pcremostly ThirdPCRE(2nd)pcre_pcreExtension binding by Arkadiusz Wahlig. Uses the third API except for substitution format string syntax, and can be configured to use the third-API syntax there too. Lacks the scanner APIs and DEBUG and LOCALE flags of the third API. Current PCRE does support Unicode regular expressions; note the name collision.
Planned for future inclusion; PyPI regexThird with extensionsSecretLabs based(2nd)regex_regex_regex_coreExtension by Matthew Barnett, forked from SRE. Prominent enough to get a mention in the documentation and in principle approval for eventual stdlib inclusion. Note the name collision with the second API. Backward-incompatibile behaviour fixes behind a version switch selectable in regex syntax and in API.

2. First API

Python’s original regular expression support could use either UNIXv8 regular expressions or Henry Spencer’s reïmplementation (a version of which was included).

This was present in Python 0.9.1. This incorporated only minor changes from Python 0.9.0, the first public source release. Python 0.9.1 was posted on Usenet alt.sources and consequently preserved in archives (a tarball conversion was previously offered on python.org and still is on legacy.python.org).

The same cannot be said of the vast majority of Python 0.9.x release packages, which were distributed for limited periods only on CWI’s FTP site. While the HISTORY file does detail changes between these versions, taken from previous versions of the NEWS file, this is not in especially great detail (the NEWS file entries of 0.9.x are the equivalent of the What’s new in documents of 2.x and 3.x, not the detailed changelogs which they now are).

The cpython-fullhistory repository does however provide an archive of old VCS, tagged back to 0.9.8 (it actually goes back to 1990, before 0.9.0), but doesn’t seem to include all the files that made it into the release (in particular, regexpmodule.c is nowhere to be seen). Additionally, its directory structure in these old commits seems to be influenced by later file moves/renames rather than preserving the original directory structure evident in the release package. I come to suspect that files that ceased to exist before a certain revision simply are not preserved. Despite that repository being now Mercurial, before that it was Subversion, before that it was CVS and I don’t know if even that was the first, so it probably cannot be assumed to be as useful/dependable as e.g. modern Git.

2.1. The regexp module (original version)

Guido’s Python 0.9.1 library reference (provided as LaTeX in the alt.sources release) gives the following documentation:

3.4 Built-in Module regexp

This module provides a regular expression matching operation. It is always available. The module defines a function and an exception:

compile(pattern) Compile a regular expression given as a string into a regular expression object. The string must be an egrep-style regular expression; this means that the characters '(' ')' '*' '+' '?' '|' '^' '$' are special. (It is implemented using Henry Spencer’s regular expression matching functions.)

regexp.error Exception raised when a string passed to compile() is not a valid regular expression (e.g., unmatched parentheses) or when some other error occurs during compilation or matching ("no match found" is not an error).

Compiled regular expression objects support a single method:

exec(str) Find the first occurrence of the compiled regular expression in the string str. The return value is a tuple of pairs specifying where a match was found and where matches were found for subpatterns specified with '(' and ')' in the pattern. If no match is found, an empty tuple is returned; otherwise the first item of the tuple is a pair of slice indices into the search string giving the match found. If there were any subpatterns in the pattern, the returned tuple has an additional item for each subpattern, giving the slice indices into the search string where that subpattern was found.

Licence for the above documentation:

Expand/Hide Spoiler

Copyright 1991 by Stichting Mathematisch Centrum, Amsterdam, The Netherlands.

       All Rights Reserved

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Stichting Mathematisch Centrum or CWI not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

2.2. The regexp module (versions from Python 0.9.2+)

The above API documentation is correct for Python 0.9.1. Following that version, the API was changed:

Determining when these changes were made is complicated by regexpmodule.c not being tracked for reasons speculated above. The HISTORY file is silent on this, understandable for the second but somewhat frustrating for the first. By checking changes to API usage in other modules which are tracked, namely the grep module, it is however apparent that the .exec method had been renamed or at least aliased to .match by the time Python 0.9.2 was released.

In Python 0.9.5, the (1st)regex module was introduced and the regexp module was reimplemented in Python as a wrapper. By this point, the match global function had been added, and the method was available only as .match.

The built-in exec function became a keyword in Python 1.0 (this decision was reversed in Python 3.0). In response, os.exec become os.execv. One might be tempted to speculate that the renaming of the regular expression .exec method in Python 0.9.2 may have been motivated by plans to do this, despite being somewhat far in advance. However, considering that os.exec was introduced under that name in Python 0.9.2, this is almost certainly not the case. Nonetheless, the renaming may have been influenced by the introduction of os.exec.

The regexp module was removed in Python 1.5.

3. Second API

Python 0.9.5 introduced the regex module, was released as Macintosh application only, 2 Jan 1992 per the HISTORY file and appears to have used GNU regex.c, from subdirectory regex, per contemporary source comments in VCS. In Python 0.9.6 (evident from both VCS and the HISTORY file), Guido switched to a non-copylefted libre reïmplementation of GNU regex which Tatu Ylönen had written and posted to comp.sources.misc, thus avoiding bringing Python under the GPL.

3.1. The first regex module

Official documentation for the first regex module: Python 1.5.1, 1.5.2, 1.6 (undocumented in Python 2.0+)

Not to be confused with the modern, identically named (2nd)regex module planned for future inclusion in the standard library.

This was the main module, the C module, and the direct successor of the regexp module. However, there are many important differences from the first API:

The following additions were made to the API in Python 0.9.9:

The symcomp function was added in Python 1.0. This worked as compile, but would parse a <name> string at the start of a parenthetical group as a group name, which could be passed to the group method. The Ylönen backend was modified to export the syntax code so it was freely visible to symcomp (but still not visible without side effects to native Python code until get_syntax was added much later). Supporting this, three new regular expression object properties were added:

Having been emitting DeprecationWarning since at least Python 2.1, the first regex module was at length removed in Python 2.5.

3.2. The regex_syntax module

The regex_syntax module was never separately documented; the documentation advised reading its source code.

Original version as added in Python 0.9.5, recovered from VCS archives:

Expand/Hide Spoiler
# These bits are passed to regex.set_syntax() to choose among
# alternative regexp syntaxes.

# 1 means plain parentheses serve as grouping, and backslash
#   parentheses are needed for literal searching.
# 0 means backslash-parentheses are grouping, and plain parentheses
#   are for literal searching.
RE_NO_BK_PARENS = 1

# 1 means plain | serves as the "or"-operator, and \| is a literal.
# 0 means \| serves as the "or"-operator, and | is a literal.
RE_NO_BK_VBAR = 2

# 0 means plain + or ? serves as an operator, and \+, \? are literals.
# 1 means \+, \? are operators and plain +, ? are literals.
RE_BK_PLUS_QM = 4

# 1 means | binds tighter than ^ or $.
# 0 means the contrary.
RE_TIGHT_VBAR = 8

# 1 means treat \n as an _OR operator
# 0 means treat it as a normal character
RE_NEWLINE_OR = 16

# 0 means that a special characters (such as *, ^, and $) always have
#   their special meaning regardless of the surrounding context.
# 1 means that special characters may act as normal characters in some
#   contexts.  Specifically, this applies to:
#    ^ - only special at the beginning, or after ( or |
#    $ - only special at the end, or before ) or |
#    *, +, ? - only special when not after the beginning, (, or |
RE_CONTEXT_INDEP_OPS = 32

# Now define combinations of bits for the standard possibilities.
RE_SYNTAX_AWK = (RE_NO_BK_PARENS | RE_NO_BK_VBAR | RE_CONTEXT_INDEP_OPS)
RE_SYNTAX_EGREP = (RE_SYNTAX_AWK | RE_NEWLINE_OR)
RE_SYNTAX_GREP = (RE_BK_PLUS_QM | RE_NEWLINE_OR)
RE_SYNTAX_EMACS = 0

# (Python's obsolete "regexp" module used a syntax similar to awk.)

Licence for the above:

Expand/Hide Spoiler

Copyright 1991 by Stichting Mathematisch Centrum, Amsterdam, The Netherlands.

       All Rights Reserved

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Stichting Mathematisch Centrum or CWI not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

3.3. The regsub module

Official documentation for the regsub module: Python 1.5.1, 1.5.2, 1.6 (undocumented in Python 2.0+)

The regsub module was added in either Python 0.9.7 or Python 0.9.8. It provided the following. This may seem peripheral, but it is quite important in comparison to the third. The main reason why these weren’t in the (1st)regex module is presumably that the latter was a C module, as these were written in Python.

The following functions were present initially:

These are mostly self-explanatory. The sub function replaces one instance, using backslash notation in the replacement for reference to subpatterns. The gsub does the same for multiple instances, but with zero-length matches adjacent to a previous match not counted (other zero-length matches are).

The regsub.compile internal function used by the others wraps the usual one with its own multiple-pattern caching of compiled regular expression objects, which wouldn’t always have been a good idea due to the ideosyncracies of the API. In particular, it did not originally keep track of the syntax flags so would have to be manually cleared when they were changed. This could be achieved through the somewhat crude assignment regsub.cache = {}. It was later changed to keep track of syntax flags (keying the cache with a (pat, syntax) tuple) and offer a clear_cache() function, but only once the regex.get_syntax function had been added, by which point it had already been superseded by the third API.

The following API extensions were made in Python 1.4:

Python 1.4beta3 added a maxsplit argument to regsub.split, matching the invocation of string.split. It also added splitx as a version which retains the delimitors. This was used by capwords. When capwords was added in 1.4beta1, a different, incompatible use of the third argument to split was added for the same purpose, but this never made it into any non-alpha version of Python and so is largely inconsequential (hence breaking compatibility with it was not a concern).

Having been emitting DeprecationWarning since Python 2.1, the regsub module was at length removed in Python 2.5.

4. Third API

Due to the fundamentally thread-unsafe nature of the second API, a third API was introduced in Python 1.5 in the re module. For example, the properties and methods relating to specific match/search results were moved to seperate match objects, which the respective methods were changed to return rather than numbers, rather than being set on the compiled regular expression objects.

The new API was also taken as an opportunity to switch to the more powerful Perl-style regular expression syntax (a significantly more powerful extension of the Posix Extended syntax), as opposed to the Emacs-style syntax previously supported. This also saw the removal of the existing syntax flags and addition of the set of flags (multiline, etc) used by Perl-style regular expressions. Being thread-safe, these flags are set for each compiled regular expression object and passed to calls that create such objects (such as compile), as opposed to being set globally.

The other thing which was introduced here was the separation of an implementation-specific low-level backend module written in C from a high-level API written in Python, rather than the entire main module being written in C. This allowed the functionality of the regsub module to be integrated into the re module, rather than being a separate Python module.

4.1 The re module

Official documentation for the re module: Python 1.5.1, 1.5.2, 1.6, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.current

TODO mention API changes from the Second API, et cetera.

4.2 The re1 implementation of the third API

TODO

4.2.1. The reop backend module

TODO

4.3 The pre implementation of the third API

TODO

4.3.1. The first pcre (backend) module

TODO

4.4 The sre implementation of the third API

SecretLabs SRE was written by Fredrik Lundh for Python as a Unicode-supporting implementation of the third API. Due to the recursive version of SRE potentially running into stack limits, as well as the possibility of SRE behaving differently to PCRE, the older implementation was retained as pre and the SRE implementation was introduced as sre, with re import-starring from one of them, configured to use sre in the standard source tree although this was supposed to be edited by sites/vendors where necessary.

4.4.1. The _sre module

TODO

4.4.2. The sre_constants module

TODO

4.4.3. The sre_compile module

TODO

4.4.4. The sre_parse module

TODO

4.5. The second regex module

4.6. Others

Wahlig (2nd)pcre uses the third API except for substitution syntax matching str.format (enable_re_template_mode() can be used to make the substitution functions compliant with the third API) and no scanner APIs or DEBUG and LOCALE flags.

General footnotes

1, 2 Colliding names used by multiple non-interchangable modules (regex and pcre) are preceded by a disambiguating numerical superscript.

† Python 1.6 is basically the state of Python 2 at the point that Guido left CNRI, released under contractual obligation or something similar: it incorporates many but not all distinctively 2.x features, it is accordingly not a true Python 1.x release; hence, What’s new in Python 2.0 compares it with Python 1.5.2.