.. -=- Python RegEx through the ages -=- = Python RegEx through the ages = .. html-a:: :id: intro :name: intro == 1. Introduction == Python has gone through three regular expression modules during its history (`regexp`, `regex` and `re`) and a fourth has been greenlighted (also called `regex`, confusingly). Furthermore, more than one of these were re{\"i}mplemented at least once during their existence. As they were superseded, disappeared from the documentation and were finally removed from the distribution, I am very much aware that the interface and functionality of the older modules may well be an important thing to have a reference for if trying to understand legacy Python code. The module name collision only adds to this, so I thought it would be sensible to write this up. Further to the goal of allowing understanding and porting of legacy Python code, and in case any unwisely written code depends on particular details of an individual implementation, I also attempt to "document the undocumented" in terms of implementation details, low-level interfaces, numerical values of constants _et cetera_ which did not qualify for coverage in the standard library manuals. It is also of interest from an academic perspective, as each new module and each revision of an existing module contributed to the current interface as it stands. Outside of Python, the history of Python regular expressions is also fundamentally tied with the history of named groups in regular expressions, so this may be of passing interest to that topic. This document is currently a work in progress, though [an overview table](pythonregex40.html) is fairly complete. It also currently covers CPython only; coverage of Jython, PyPy or IronPython is a more distant goal, though the basic API should be compatible between them. While limited detail on this is discernable from Python's HISTORY file, this is not adequate to establish the API changes between the modules themselves. Other sources include old source distributions, historical VCS (which is currently incomplete for the earliest releases), older versions of the documentation, _et cetera_. In any case, I felt it sensible to write it up in one place. So here it is: the history of Python's regular expressions. .. html-a:: :id: firstapi :name: firstapi .. html-a:: :id: regexp :name: regexp == 2. First API == Python's original regular expression support accepted Posix Extended syntax, and could use either UNIXv8 regular expressions or Henry Spencer's re{\"i}mplementation (a version of which was included). This was present in Python 0.9.1. This incorporated only minor changes from Python 0.9.0, the first public source release. Python 0.9.1 was posted on Usenet alt.sources and consequently preserved in archives (a tarball conversion was previously offered on python.org and still is [on legacy.python.org](https://legacy.python.org/download/releases/early/Python-0.9.1.tar.gz)). The same cannot be said of the vast majority of Python 0.9.x release packages, which were distributed for limited periods only on CWI's FTP site. While the HISTORY file does detail changes between these versions, taken from previous versions of the NEWS file, this is not in especially great detail (the NEWS file entries of 0.9.x are the equivalent of the "What's new in" documents of 2.x and 3.x, not the detailed changelogs which they now are). The [cpython-fullhistory](https://hg.python.org/cpython-fullhistory/graph) repository does however provide an archive of old VCS, tagged back to 0.9.8 (it actually goes back to 1990, before 0.9.0), but doesn't seem to include all the files that made it into the release (in particular, `regexpmodule.c` is nowhere to be seen). Additionally, its directory structure in these old commits seems to be influenced by later file moves/renames rather than preserving the original directory structure evident in the release package. I come to suspect that files that ceased to exist before a certain revision simply are not preserved. Despite that repository being now Mercurial, before that it was Subversion, before that it was CVS and I don't know if even that was the first, so it probably cannot be assumed to be as useful/dependable as e.g. modern Git. .. html-a:: :id: regexpexec :name: regexpexec === 2.1. The `regexp` module (original version) === Guido's Python 0.9.1 library reference (provided as LaTeX in the alt.sources release) gives the following documentation: > === 3.4 Built-in Module `regexp` === > > This module provides a regular expression matching operation. It is always available. The module defines a function and an exception: > > '''compile(pattern)''' Compile a regular expression given as a string into a regular expression object. The string must be an egrep-style regular expression; this means that the characters '(' ')' '\*' '+' '?' '|' '^' '$' are special. (It is implemented using Henry Spencer’s regular expression matching functions.) > > '''regexp.error''' Exception raised when a string passed to compile() is not a valid regular expression (e.g., unmatched parentheses) or when some other error occurs during compilation or matching ("no match found" is not an error). > > Compiled regular expression objects support a single method: > > '''exec(str)''' Find the first occurrence of the compiled regular expression in the string str. The return value is a tuple of pairs specifying where a match was found and where matches were found for subpatterns specified with '(' and ')' in the pattern. If no match is found, an empty tuple is returned; otherwise the first item of the tuple is a pair of slice indices into the search string giving the match found. If there were any subpatterns in the pattern, the returned tuple has an additional item for each subpattern, giving the slice indices into the search string where that subpattern was found. Licence for the above documentation: >! Copyright 1991 by Stichting Mathematisch Centrum, Amsterdam, The Netherlands. >! >!        All Rights Reserved >! >! Permission to use, copy, modify, and distribute this software and its >! documentation for any purpose and without fee is hereby granted, >! provided that the above copyright notice appear in all copies and that >! both that copyright notice and this permission notice appear in >! supporting documentation, and that the names of Stichting Mathematisch >! Centrum or CWI not be used in advertising or publicity pertaining to >! distribution of the software without specific, written prior permission. >! >! STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO >! THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND >! FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE >! FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES >! WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN >! ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT >! OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. .. html-a:: :id: regexpmatch :name: regexpmatch === 2.2. The `regexp` module (versions from Python 0.9.2+) === The above API documentation is correct for Python 0.9.1. Following that version, the API was changed: - The `exec` method was renamed to `match`. - An additional '''match(pat, str)''' global function was added. This would compile the pattern and then call its `match` method, although at least some versions would reüse a compiled pattern when the same pattern was supplied consecutively. Determining when these changes were made is complicated by `regexpmodule.c` not being tracked for reasons speculated above. The HISTORY file is silent on this, understandable for the second but somewhat frustrating for the first. By checking changes to API usage in other modules which *are* tracked, namely the `grep` module, it is however apparent that the `.exec` method had been renamed or at least aliased to `.match` by the time Python 0.9.2 was released. .. the revision which Mercurial calls a67ab5d58146, in which the `grep` module was changed to use `.match` not `.exec` In Python 0.9.5, the first `regex` module was introduced and the `regexp` module was reimplemented in Python as a wrapper. By this point, the `match` global function had been added, and the method was available only as `.match`. The built-in `exec` function became a keyword in Python 1.0 (this decision was reversed in Python 3.0). In response, `os.exec` become `os.execv`. One might be tempted to speculate that the renaming of the regular expression `.exec` method in Python 0.9.2 may have been motivated by plans to do this, despite being somewhat far in advance. However, considering that `os.exec` was introduced under that name in Python 0.9.2, this is almost certainly not the case. Nonetheless, the renaming may have been influenced by the introduction of `os.exec`. The `regexp` module was removed in Python 1.5. .. html-a:: :id: secondapi :name: secandapi == 3. Second API == Python 0.9.5 introduced the `regex` module, was "released as Macintosh application only, 2 Jan 1992" per the HISTORY file and appears to have used "GNU regex.c, from subdirectory regex", per contemporary source comments in VCS. In Python 0.9.6 (evident from both VCS and the HISTORY file), Guido switched to a non-copylefted libre reïmplementation of GNU regex which Tatu Ylönen had written and posted to comp.sources.misc, thus avoiding bringing Python under the GPL. .. html-a:: :id: firstregex :name: firstregex === 3.1. The first `regex` module === ''Official documentation for the first `regex` module: Python [1.5.1](https://docs.python.org/release/1.5.1/lib/module-regex.html), [1.5.2](https://docs.python.org/release/1.5.2/lib/module-regex.html), [1.6](https://docs.python.org/release/1.6/lib/module-regex.html)(^[†](#obelisk)^) (undocumented in Python 2.0+)'' ''Not to be confused with the modern, identically named (second) `regex` module planned for future inclusion in the standard library.'' This was the main module, the C module, and the direct successor of the `regexp` module. However, there were many important differences from the comparatively simple first API: - Defaulting to Emacs-style regular expression syntax, not to Posix Extended. In particular, escaped `\(`, `\|`, `\)` were used for groups and unescaped `(|)` were literal. The `\s` syntax table escapes (e.g. `\s<`) and the `\=` escape are not supported due to not actually being attached to Emacs. - A new global `set_syntax` function to set the syntax-flags integer, the `regex_syntax` module supplied these flags and precomposed such integers for styles of awk, grep et cetera The first API was emulated using "awk" mode, not "egrep" mode, because the latter mode treated newline as an "or" operator. Calling `set_syntax` also returned the previous syntax-flags integer so they could be restored. Needless to say, this could be extremely thread-unsafe, particularly as there was originally no way to query the syntax without changing it (a `get_syntax` function was added much later, but not before the third API had already been introduced). - The `match` method looked only at the start (or provided offset), while the new `search` method would scan the entire string. They accepted an optional second argument (`pos`), which specified on offset to start looking from. Both were still available as global functions with single-pattern caches, but the usefulness of these is limited for the reason explained below (and they didn't accept the `pos` argument). - `match` and `search` no longer returned tuples of tuples of slice indices. Instead, `match` returned a length (or `-1` if it doesn't match) and `search` returned an offset (or `-1`). The tuple of tuples of slice indices was made available as the `regs` (for "registers") attribute of the compiled regular expression object itself (…yes, really) following matching/searching, which was `None` if there was no match. Needless to say, this was not thread-safe. - The first-API `regexp.py` file (wrapping the second-API first `regex` module) contained code for removing zero or more `(-1, -1)` tuples (i.e. groups that did not contribute) from the end of `regs` before returning. This presumably represents a further API difference. The following additions were made to the API in Python 0.9.9: - The `compile` function now accepted an optional second argument, called `translate`. This took a length-256 string of bytes onto which input bytes were to be mapped. The `casefold` global provided one of these for ASCII casefolding. Needless to say, this was not exactly Unicode-ready, but this was Python 0.9.9 and the early 1990s, and Unicode support was a major introduction in Python 2,(^[†](#obelisk)^) by which point the third API was already in place. - Compiled regular expression objects provided a `group` method which returned the string matched by the group of a given number. If multiple arguments are given, a tuple was returned. Passing no arguments returned all groups. - Additional properties (besides `regs`) were made available on the regular expression object: - `last` — the last string passed to match/search if a match was found. Otherwise `None`. - `translate` — the `translate` argument with which the compiled regular expression object was created. Otherwise `None`. The `symcomp` function was added in Python 1.0. This worked as `compile`, but would parse a `` string at the start of a parenthetical group as a group name, which could be passed to the `group` method. The Ylönen backend was modified to export the syntax code so it was freely visible to `symcomp` (but still not visible without side effects to native Python code until `get_syntax` was added much later). Supporting this, three new regular expression object properties were added: - `givenpat` — the regular expression pattern from which the regular expression object was compiled. - `realpat` — the regular expression pattern, but with group names stripped if `symcomp` was used. - `groupindex` — dict mapping group names to group indices. Following Emacs style, `\b` for word bounderies included the start and end of the string unconditionally, and `\B` for word non-boundaries excluded the start and end of the string unconditionally. The escapes `\<` for word start boundaries and `\>` for word end boundaries, on the other hand, would more conventionally match at the start and end of the string (respectively) if next to a word character. Similarly, the underscore was not initially included in the definition of `\w` for word characters, again following Emacs style. This seems to have been changed in Python 1.5 as one of Jeffrey Ollie's revisions, although without updating the documentation for the `regex` module (I'm unsure how much impacting the `regex` module was intentional here, since the change accompanied the addition of additional syntax table flags used only by the `re1` module—see below). The Emacs syntax `\_<` and `\_>` for "symbol" boundaries (i.e. like `\<` and `\>` but treating the underscore as part of the word/symbol) was never supported. Having been emitting DeprecationWarning since at least Python 2.1, the first `regex` module was at length removed in Python 2.5. .. html-a:: :id: firstregexflags :name: firstregexflags === 3.2. The `regex_syntax` module === The `regex_syntax` module was added in Python 0.9.5 and there{\"a}fter functionally changed only once, six years later (in Python 1.5, with the addition of two more flags). Subsequent minor changes post-obsolescence added a module docstring and converted tabs in the comments to spaces. The `regex_syntax` module was never separately documented; the documentation advised reading [its source code](https://hg.python.org/cpython-fullhistory/log/v2.4/Lib/regex_syntax.py). It defined the following syntax flags: | Flag name | Value | Description | |----------------------|------:|-------------------------------------------------------------------| | RE_NO_BK_PARENS | 1| Make plain `()` grouping and escaped `\(\)` act literal. | | RE_NO_BK_VBAR | 2| Make plain `|`:code: act as an or-operator and escaped `\|`:code: act literal.| | RE_BK_PLUS_QM | 4| Make plain `+?` act literal and escaped `\+\?` act as operators. | | RE_TIGHT_VBAR | 8| Make `|`:code: bind tighter than `^$`. | | RE_NEWLINE_OR | 16| Make line breaks in the pattern behave as or-operators. | | RE_CONTEXT_INDEP_OPS | 32| As (inaccurately) documented in source code comments, treat `^$*+?` as literal in contexts where they don't otherwise make sense. Actually, the other way around (makes them an error in contexts where they don't make sense, as one might deduce from the flag name). | | RE_ANSI_HEX | 64| Added to the module in 1.5: process `\n`, `\x00` _et cetera_. Bizarrely, this seems to also have turned on e.g. `\v10` for accessing group number 10, which actually collides with `\v` for vertical tab; turning it off would have made more sense. | | RE_NO_GNU_EXTENSIONS | 128| Added to the module in 1.5; disable syntax for matching the start of the entire string (`\\\``:code:), end of the entire string (`\'`), word boundaries and word characters. | The following pre-composed (by bitwise-or) flag combinations were also defined. Note that the `{count}`, `{minimum,}` and `{minimum,maximum}` syntaxes for repetitions were not supported at all, hence the lack of a flag for them (Emacs and `grep` styles backslash them, while `egrep` style does not). | Mode name | Flags | |-----------------|---------------------------------------------------------------------| | RE_SYNTAX_AWK | RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS | | RE_SYNTAX_EGREP | RE_NO_BK_PARENS, RE_NO_BK_VBAR, RE_CONTEXT_INDEP_OPS, RE_NEWLINE_OR | | RE_SYNTAX_GREP | RE_BK_PLUS_QM, RE_NEWLINE_OR | | RE_SYNTAX_EMACS | No flags set (zero) | The `regex_syntax` module was removed in Python 2.5. .. html-a:: :id: regsub :name: regsub === 3.3. The `regsub` module === ''Official documentation for the `regsub` module: Python [1.5.1](https://docs.python.org/release/1.5.1/lib/module-regsub.html), [1.5.2](https://docs.python.org/release/1.5.2/lib/module-regsub.html), [1.6](https://docs.python.org/release/1.6/lib/module-regsub.html)(^[†](#obelisk)^) (undocumented in Python 2.0+)'' The `regsub` module was added in either Python 0.9.7 or Python 0.9.8. It provided the following. This may seem peripheral, but it is quite important in comparison to the third. The main reason why these weren't in the first `regex` module is presumably that the latter was a C module, as these were written in Python. The following functions were present initially: - '''sub''' (''pat, repl, str'') - '''gsub''' (''pat, repl, str'') - '''split''' (''str, pat'') - Internal '''compile''' (''pat'') - Internal '''expand''' (''repl, regs, str'') These are mostly self-explanatory. The `sub` function replaced one instance, using backslash notation in the replacement for reference to subpatterns. The `gsub` did the same for multiple instances, but with zero-length matches adjacent to a previous match not counted (other zero-length matches are). The `split` function did not split at zero-length matches. The `regsub.compile` internal function used by the others wrapped the usual one with its own multiple-pattern caching of compiled regular expression objects, which wouldn't always have been a good idea due to the ideosyncracies of the API. In particular, it did not originally keep track of the syntax flags so would have to be manually cleared when they were changed. This could be achieved through the somewhat crude assignment `regsub.cache = {}`. It was later changed to keep track of syntax flags (keying the cache with a `(pat, syntax)` tuple) and offer a `clear_cache()` function, but only once the `regex.get_syntax` function had been added, by which point it had already been superseded by the third API. The `regsub.expand` internal function, used by `sub` and `gsub`, took a replacement string containing backslash-references to groups (e.g. `\1`), along with the necessary match data (the `regs` tuple and the string which it indexes), and returned the substituted string. The following API extensions were made in Python 1.4. - '''capwords''' (''s''['', pat'']) - '''split''' (''str, pat''['', maxsplit'']) - '''splitx''' (''str, pat''['', maxsplit'']) Python 1.4beta3 added a `maxsplit` argument to `regsub.split`, matching the invocation of `string.split`. It also added `splitx` as a version which retains the delimitors. This was used by `capwords`. When `capwords` was added in 1.4beta1, a different, incompatible use of the third argument to `split` was added for the same purpose, but this never made it into any non-alpha version of Python and so is largely inconsequential (hence breaking compatibility with it was not a concern). Having been emitting DeprecationWarning since Python 2.1, the `regsub` module was at length removed in Python 2.5. .. html-a:: :id: thirdapi :name: thirdapi .. html-a:: :id: re :name: re == 4. Third API: the `re` module == Due to the fundamentally thread-unsafe nature of the second API, a third API was introduced in Python 1.5 in the `re` module, accompanied to a switch to the more powerful Perl-style regular expression syntax. The other thing which was introduced here was the separation of an implementation-specific low-level backend module written in C from a high-level API written in Python, rather than the entire main module being written in C. This allowed the functionality of the `regsub` module to be integrated into the `re` module, rather than being a separate Python module. .. html-a:: :id: reapi :name: reapi === 4.1. API changes from the first `regex` module to `re` === The third API introduced the concept of a "match object": contrasting with the second API's thread-unsafe practice of setting and exposing the properties and methods relating to specific match/search results on the compiled regular expression objects themselves, the `match` and `search` methods return "match objects" (not integers) upon which these are set and exposed. This allows the module to be fully thread-safe. A return value of `None` is encountered in the absence of a match; as `None` has a false boolean value while a match object has a true boolean value, this makes it very succinctly possible to simply check for a match. The distinction between `match` (matches at the start / specified position) and `search` (matches anywhere) remains. When invoked as methods, they accept both `pos` and `endpos` arguments to define the start and end of the range to match from / search over. When invoked as globals, they accept neither argument. While the `regs` attribute, now on the match object, remains in the module, it is not documented as part of the API, so is presumably not supposed to be accessed directly anymore (though it still sometimes is). Equivalent functionality is provided through the `span` method, which returns the (start, end) index tuple for a given group name or number, while the `start` and `end` methods return only the one index (the indices being, as previously, `-1` if the group did not contribute). This allows names to be used interchangably with numbers, unlike indexing `regs` directly. The `group` method is accordingly now on the match objects but, except in very early versions (see `re1` below), it now behaves like `start`, `end` and `span` in defaulting to group 0 when no arguments are supplied (rather than returning all groups). A new `groups` method was added to return all groups (1 and up). Finally, `symcomp` is not present, because a dedicated named group syntax is supported by `compile` (see below). .. html-a:: :id: resubapi :name: resubapi === 4.2. API changes from `regsub` to `re` === The functionality of `regsub` is, due to the `re` module itself being written in Python, incorporated into `re`. Some differences to note: - `capwords` is absent. - `sub` behaves like the old `gsub` by default (`gsub` is [noted](https://web.archive.org/web/19980526014452/http://www.python.org:80/doc/howto/regex-to-re/) to have been more commonly used). While the first three arguments are the same as before, it takes a count integer as an optional fourth argument. This defaults to `0`, meaning no limit, but it can be set to a number of substitutions to make, e.g. `1` to emulate old `sub`. - `subn` is invoked like `sub`, but returns a tuple of the result and the number of substitutions made. - `split` will include the content of any capturing parenthetical groups in the return value, thus negating the need for `splitx`. A bug in 1.5.0 caused `maxsplit` to be ignored due to never incrementing the loop counter, this was fixed in 1.5.1 and later. - The existing behaviour of `split` and `gsub` regarding zero-length matches was retained in their successors until Python 3.7, when they were changed to consider all zero-width matches. - All of the above are, whist still available as global functions, also available as methods on the compiled regular expression objects themselves. - The flags argument explained below was added as a further optional argument to the global functions for the above in Python 3.1, which change was backported to Python 2.7. This was consistant with `match` and `search` already accepting it. - The global `purge` function is equivalent to latter-era `regsub`'s `clear_cache` function. - `findall` was added in Python 1.5.2. Invoked with the same pattern as `match` and `search` (though the global one didn't acquire the optional `flags` argument until Python 2.4, which was still ahead of `sub` and friends), it returns a list of matching strings (left to right without overlaps). If the pattern contains a capturing group, matches of that group are returned instead of matches of the entire pattern. If the pattern contains multiple capturing groups, a list of tuples is returned. .. html-a:: :id: resyntax :name: resyntax === 4.3. Syntax changes from the first `regex` module to `re` === The new API was also taken as an opportunity to switch to the more powerful Perl-style regular expression syntax (a more powerful extension of the Posix Extended syntax), as opposed to the adjustable Emacs-style syntax previously supported. This means that: - Unescaped `(|)` are always grouping (while this was possible behaviour under the first `regex` module, it was not the default). - Minimal (non-greedy) matching is possible, by using `+?` and `*?` as non-greedy equivalents to `+` and `*`. - The character class escapes `\d\D` for digits (and non-digits) and `\s\S` for whitespace join the existing `\w\W` for word characters. The `\w` definition intentionally includes the underscore. - The `\<\>` escapes (for specific word boundaries) are dropped, while `\b` (for word boundaries *per se*) is retained. Except in `re1`, the behaviour of `\b` at the start and end of the string follows the union of the former `\<` and `\>`, rather than unconditionally matching; `\B` is in any case its inverse. Again, the underscore is intentionally included within a word. - The less problematic and more immediately visually distinct `\A` and `\Z` replace `\``:code: and `\'` for the absolute start and end of the entire string. The `^` and `$` also match only the start and end of the first and last line respectively, not the start and end of all lines, unless in multiline mode (see below), although `$` still differs from `\Z` in that `$` can match before the last of any trailing newlines (although it matches the absolute end of the string as well). - The `{count}`, `{minimum,}` and `{minimum,maximum}` syntaxes for repetitions are supported. - The `\v` syntax for groups with numbers greater than 9 is not supported; instead, the plain `\`+digit syntax is extended to take up to two digits, provided the first digit is not zero. - Several extensions are available taking the otherwise-invalid `(?…)` form, including non-capturing groups and (insofar as this is supported by the underlying engine) look-ahead and look-behind assertions. The `(?` syntactical extensions in the Perl syntax were further extended with named group support (noted in [source comments](https://hg.python.org/cpython-fullhistory/file/v1.5/Lib/re1.py#l975) to be "Python extensions"), superseding the earlier `symcomp` system, but keeping a similar syntax (with `(?Pthis)` replacing the earlier `(this)`). These extensions were later [adopted by PCRE](https://github.com/tdhock/regex-tutorial/blob/master/README.org) and by Perl, but with the inserted `P` made optional (it is still mandatory in Python). Needless to say, `symcomp` itself was not retained. A `reconvert` module was added to aid in conversion of existing patterns, but was later removed in Python 2.5. .. html-a:: :id: reflags :name: reflags === 4.4. Original flags of the `re` module === The adoption of Perl style regular expressions also saw the removal of the existing syntax flags (though they were accepted by `reconvert`) and addition of the set of flags used by Perl-style regular expressions. Being thread-safe, these flags are set for each compiled regular expression object and passed to `compile` and some other global functions such as `match`, as opposed to being set globally. They can also be specified at the start of the pattern itself, the original such flags (supported by `pre` and noted as being "standard flags" in `sre_parse` source comments) were as follows. These constants were also made available under shorter names corresponding to the uppercase of their letter codes. Actual numerical values can and do differ between implementations, and are thus listed in the implementation details further below. Also note that this is not a complete list of the flags currently regarded as part of the API, see the below documentation for the `sre` implementation for the rest. ====== ========== ====================================================================================================== Syntax Flag Meaning ====== ========== ====================================================================================================== `(?i)` IGNORECASE Match the pattern case-insensitively; supersedes use of the `casefold` string. `(?L)` LOCALE Not in `re1`; makes `\w\b` _et cetera_ and `IGNORECASE` follow the single-byte locale, not just ASCII. `(?m)` MULTILINE Makes `^$` match the start and end of any line, not just the respective first and last lines. `(?s)` DOTALL Makes `.` include the newline. `(?x)` VERBOSE Outside a hard-bracketed character class, whitespace and anything between `#` and newline ignored. ====== ========== ======================================================================================================  :kome: Explanatory note: the `^` matches the start of a string and, in multiline mode, the start of a line. If the string ends in a newline, `$` will match both before and after that newline, while in multiline mode it will also match the end of every line. The `\A` and `\Z` escapes (which replace the first `regex` module's `\``:code: and `\'`) match only the actual start and end of the string. .. html-a:: :id: refirst :name: refirst === 4.5. The `re1` implementation of the third API === .. Handy link - https://hg.python.org/cpython-fullhistory/file/v1.5/Lib/re1.py The short-lived first implementation of the `re` module was written by Jeffery Ollie and backed by the `reop` module, a newly introduced direct interface to the pattern bytecode handling (as opposed to compilation) routines of Ylönen's engine (which Ollie had substantially refactored). It was [introduced](https://hg.python.org/cpython-fullhistory/file/v2.0/Misc/HISTORY#l4374) in 1.5.0alpha3, and was superseded shortly thereafter in 1.5.0alpha4 by a new implementation of `re` (the one later renamed to `pre`). The original implementation was retained as `re1` for the 1.5.0 release, then removed. A cursory read through this module is very telling of the struggles to support a regular expression syntax itself not natively supported by the engine used. The bytecode compilation of the regular expression is done in pure Python, making the main module quite lengthy in comparision to its immediate replacement. Support is, perhaps understandably, still limited to what can be achieved with the same pattern-bytecode engine: in particular, it would appear that trying to use look-ahead assertions will raise an error with "zero-width positive [or negative] lookahead assertion is unsupported". The ALL_CAPITAL constants and the `error` exception from the backend [`reop` module](pythonregex20.html#reop) were exposed through. Other constants exported included the usual flags (not yet including `LOCALE`, `UNICODE` or `ASCII`), which were assigned the following values: | Flags | Value | |------------|------:| | IGNORECASE |1 | | MULTILINE |2 | | DOTALL |4 | | VERBOSE |8 | In the only version of this implementation ever officially released (i.e. in Python 1.5.0, as `re1`), the `RegexObject.split` method never actually increments its loop counter so its `maxsplit` argument actually does nothing. While this bug was also present in the main `re` (later `pre`) module in that version, it was fixed in Python 1.5.1, by which point the `re1` module had been removed. Uniquely among released versions (for a given value of "released") of the `re` module, however, there is no `groups` method: the `group` method (despite now being on the match objects) still behaves like in the first `regex` module, returning all groups (1 and up) when no arguments are passed. .. html-a:: :id: pre :name: pre === 4.6. The `pre` implementation of the third API === ''Official documentation for the `re` module covering the `pre` implementation: Python [1.5.1](https://docs.python.org/release/1.5.1/lib/module-re.html), [1.5.2](https://docs.python.org/release/1.5.2/lib/module-re.html), [1.6](https://docs.python.org/release/1.6/lib/module-re.html)(^[†](#obelisk)^)'' In Python 1.5.0alpha4, the `re.py` which had been introduced only one alpha version ago was deprecated and moved to `re1.py` (in [Guido's words](https://hg.python.org/cpython-fullhistory/rev/f6f2511bb6ba), "just in case you need it for comparison"), being [replaced](https://hg.python.org/cpython-fullhistory/file/v2.0/Misc/HISTORY#l3722) with a new (API-compatible) `re.py` using a contemporary (late 1990s) version of Philip Hazel's PCRE (Perl Compatible RegEx) engine. Adoption of PCRE enabled use of look-ahead and look-behind assertions, which had been unavailable in `re1` due to the limitations of the underlying engine. This accordingly become the first module to make it into a release under the `re` name. Andrew Kuchling [credits](https://web.archive.org/web/19980422202951/http://starship.skyport.net:80/crew/amk/regex/) Neal Becker with bringing this engine to the attention of the Python String Special Interest Group (String-SIG), mentioning that it had been written for Exim but had been attracting attention due to Perl's own regex code not being readily isolatable. This new `re.py` was substantially shorter, as it offloaded the work of compiling (not just matching) the regular expressions onto the underlying [`pcre` module](pythonregex20.html#firstpcre) (of no relation to the more recent binding module by the same name). Much of the other code was largely recycled from the original `re`/`re1` though, including for example the `split` method, bringing in with it the `maxsplit` bug (which was eventually fixed in Python 1.5.1). Since the `re`/`pre` module's first proper release in Python 1.5.0, and differing from `re1`, the `group` method now behaves like `start`, `end` and `span` in returning group 0 when no arguments are supplied; a new `groups` method will return all groups 1 and up (although in Python 1.5.0 itself, it would return a string is cases where a singleton tuple would be expected: this was fixed in Python 1.5.1). This new behaviour was inherited by subsequent implementations. The flags had values as follows. The `ANCHORED` flag is used internally by the `match` method and is not intended to form part of the interface. As not all supported flags were exported in Python, but the others could theoretically be used as magic numbers, the names given to them in the C code are also listed. Name from Python|Name from C |Value ----------------|-------------------|----: IGNORECASE |PCRE_CASELESS |1 VERBOSE |PCRE_EXTENDED |2 ANCHORED |PCRE_ANCHORED |4 MULTILINE |PCRE_MULTILINE |8 DOTALL |PCRE_DOTALL |16 (not exported) |PCRE_DOLLAR_ENDONLY|32 (not exported) |PCRE_EXTRA |64 (not exported) |PCRE_NOTBOL |128 (not exported) |PCRE_NOTEOL |256 LOCALE |PCRE_LOCALE |512 A `groupdict` method, which returns a mapping of group names to matched strings, was added to the match objects of `pre` in Python 1.6,(^[†](#obelisk)^) as well as being supported by the then-new `sre`. The `pre` implementation of the `re` module was never updated for more recent versions of PCRE, being instead superseded by the (API-compatible) `sre`/`re` module. It was retained as `pre` with a frozen PCRE version until it was removed altogether in Python 2.4, leaving future PCRE support open for [third-party bindings](pythonregex30.html#secondpcre). .. html-a:: :id: sre :name: sre === 4.7. The `sre` implementation of the third API === ''Official documentation for the `re` module covering the `sre` implementation: Python [2.0](https://docs.python.org/release/2.0/lib/module-re.html), [2.1](https://docs.python.org/release/2.1/lib/module-re.html), [2.2](https://docs.python.org/release/2.2/lib/module-re.html), [2.3](https://docs.python.org/release/2.3/lib/module-re.html), [2.4](https://docs.python.org/release/2.4/lib/module-re.html), [2.5](https://docs.python.org/release/2.5/lib/module-re.html), [2.6](https://docs.python.org/2.6/library/re.html), [2.7](https://docs.python.org/2/library/re.html), [3.0](https://docs.python.org/3.0/library/re.html), [3.1](https://docs.python.org/3.1/library/re.html), [3.2](https://docs.python.org/3.2/library/re.html), [3.3](https://docs.python.org/3.3/library/re.html), [3.4](https://docs.python.org/3.4/library/re.html), [3.5](https://docs.python.org/3.5/library/re.html), [3.6](https://docs.python.org/3.6/library/re.html), [3.7](https://docs.python.org/3.7/library/re.html), [3.current](https://docs.python.org/3/library/re.html)'' SecretLabs SRE was written by Fredrik Lundh for Python as a Unicode-supporting implementation of the third API (introducing the `UNICODE` flag), and introduced in Python 2.0 (actually 1.6).(^[†](#obelisk)^) Due to initially using a recursive matching scheme which would potentially run into stack limits (changed in Python 2.4.0alpha1), as well as the possibility of the SecretLabs engine behaving differently to PCRE, the older implementation was initally retained as `pre`, with the SecretLabs implementation being introduced as `sre`. Upon the introduction of `sre`, the `re` module was set up to import-star from either `sre` or `pre`; while it was configured to use `sre` in the standard source tree, this was supposed to be edited by sites/vendors where necessary. The `pre` module was removed in Python 2.4, now that `sre` had been changed to match non-recursively, leaving `re` as a mere alias to `sre`. Importing the module as `sre` was then deprecated in Python 2.5, with the actual module code being moved to `re`. The `sre` name was then removed in Python 3.0, leaving `re` as the sole name of the module. Versions of the `sre` implementation added several flags absent from `re1` and `pre`. Of these, `UNICODE`, `ASCII` and eventually `DEBUG` made it into the documentation and may accordingly be considered later additions to the essential API, while `TEMPLATE` appears still to be an implementation detail. As text strings were made Unicode by default in Python 3.0, `UNICODE` matching became the default. Matching in accordance with pure ASCII (i.e. despite the passing of Unicode strings) was added as the `ASCII` flag, while `UNICODE` became a no-op, retained for compatibility only. While `LOCALE` *was* retained (for use on byte strings only), its use is discouraged, and its usefulness is limited given that text strings are likely to be Unicode to begin with. ====== ========== ======== =========================================================================== Syntax Flag Added in Meaning ====== ========== ======== =========================================================================== `(?u)` UNICODE 1.6 Make `\w\W\b\B` _et cetera_ and `IGNORECASE` follow Unicode (no-op in 3.x). `(?t)` TEMPLATE 1.6 Disable backtracking (not a documented flag). (none) DEBUG 2.1 Prints the pattern bytecode disassembly following compilation. `(?a)` ASCII 3.0 Make `\w\W\b\B` _et cetera_ and `IGNORECASE` follow plain ASCII. ====== ========== ======== =========================================================================== The actual numerical values both of these and of the other flags are as follows. In the original Python 1.6(^[†](#obelisk)^) [version](https://hg.python.org/cpython-fullhistory/file/v1.6/Lib/sre_constants.py#l172), `DEBUG` and `ASCII` were absent, but the numerical values were otherwise the same as in the [current version](https://github.com/python/cpython/blob/3.6/Lib/sre_constants.py#L175). | Constant | Value | |------------|------:| | TEMPLATE |1 | | IGNORECASE |2 | | LOCALE |4 | | MULTILINE |8 | | DOTALL |16 | | UNICODE |32 | | VERBOSE |64 | | DEBUG |128 | | ASCII |256 | Although PCRE was subsequently updated with Unicode support, it was not re-adopted by the Python Standard Library (the existing modules were maintained as legacy and then removed), and descendants of the original `sre` module have been used to this day. This means that subsequent improvements to PCRE, and subsequent additions to its syntax, did not percolate down to Python, although a [third-party binding](pythonregex30.html#secondpcre) exists for more recent PCRE versions. The `sre` implementation in Python 2.0 (or possibly 1.6) seems to have introduced the `.expand` method of match objects, per appearance in documentation; despite no mention being made of it being new, it was never provided by any revision of `re1` or `pre`. Interestingly, this seems to be the first time such a function has been treated as part of the API rather than as an implementation detail. It functions much like [its `regsub` predecessor](#regsub), only with the addition of the `\g` syntax for named group references. However, since it's a method on the actual match object, the only argument it takes is the `repl` string to expand. The same functionality had been provided by the internal global [`_expand`](pythonregex20.html#reopfunc) in `re1` and the internal global [`pcre.prce_expand`](pythonregex20.html#firstpcrefunc) in `pre`, both of which took the match object as the first argument. Python 2.2 introduced `finditer`, which is similar to `findall`, differing only in that (a) it is a generator function and (b) it yields match objects as opposed to strings. Although `pre` was still being included at this point, this was not backported to `pre`. Python 2.4 intoduced the `(?(groupid)then|else)` syntax for matching conditional to another group having participated in the match. The existing `match` and `search` operations were joined in Python 3.4 by the `fullmatch` operation, which will only match the entire string, or the entire range between `pos` and `endpos`. Python 3.6 introduced the ability to obtain groups as strings using indexing syntax, i.e. aliasing `group` to `__getitem__`. Behaviour of `sub`, `subn` and `split` with regards to zero-length matches was changed in Python 3.7 to be more logical: `split` was changed so it will split a string at a zero-length match, and `sub` so that zero-length matches adjacent to a non-empty match are also replaced. This changed a behaviour which had lasted since the original `regsub` module in Python 0.9.8. .. html-a:: :id: secondregex :name: secondregex === 4.8. The second `regex` module === ''PyPI package: [regex](https://pypi.org/project/regex/)'' In 2008, Matthew Barnett submitted [a bug ticket](https://bugs.python.org/issue3825) including a major reworking of (at the time) the Python 2.5.2 `re` module (i.e. the `sre` implementation), adding atomic grouping and possessive quantifiers (i.e. variants of groups and greedy quantifiers which cannot backtrack), as well as variable-length look-behind assertions. He also mentioned that it was typically two times as fast as the standard one. This effort was joined by Jeffrey C. Jacobs (timehorse), who had already been [working on](https://bugs.python.org/issue2636) improvements to the `re` module, slated at the time for Python 2.7.  :kome: Explanatory note: a lazy quantifier (e.g. `+?` or `*?`) matches the fewest possible of something that will allow the rest of the pattern to match. A greedy quantifier (e.g. `+` or `*`) matches as many of something as possible that will allow the rest of the pattern to match. A possessive quantifier (e.g. `++` or `*+`) matches as many of something as are present, even if this causes the rest of the pattern to fail. The changes became quite dramatic, with [one of them](https://bugs.python.org/issue2636#msg90954) radically re{\"o}rganising the code, dramatically reducing the number of support modules, changing the engine to use a node network rather than a linear bytecode sequence, as well as other improvements. In light of this, Georg Brandl [suggested](https://bugs.python.org/issue2636#msg90961) releasing it as a stand-alone package, on the basis that it would be difficult to review, expecting it to have acquired enough use for any issues to have been ironed out by the time Python 2.7 came around. Barnett promptly renamed it to `regex` so it could be installed as an extension module (although it is unrelated to the historic module by that name). Suffice it to say that it was not included in Python 2.7, although it does have a mention in the documentation and ["in principle approval for eventual stdlib inclusion"](https://wiki.python.org/moin/Python2orPython3#Supporting_Python_2_and_Python_3_in_a_common_code_base), pending a PEP to sort out the details. Additionally, although intended to add support for Unicode regular expressions and succeeding in adding basic support, the original SecretLabs engine has been the subject of [criticism](https://web.archive.org/web/20140110033548/http://dheeb.files.wordpress.com/2011/07/gbu.pdf) for not providing selectors for Unicode properties (besides those which correspond to the syntaxes for ASCII regular expression properties), for using casemapping for case-insensitive matching rather than the more appropriate casefolding, and for not offering proper handling for multi-codepoint grapheme clusters (e.g. how they might affect what constitutes sequences/bounds of word characters). The second `regex` module is considered vastly improved in this respect. (The bigger mentioned egregious flaw of Python 2.7 and 3.2 treating Unicode strings as UCS-2 in several respects on Windows (as a result of PEP 261) was finally, and with much rejoicing, fixed in the very next version (Python 3.3) with the adoption of PEP 393.) .. Note: read the enhancement log from bottom up: some of the higher stuff is enhancing on top of the lower stuff. .. html-a:: :id: secondregexapi :name: secondregexapi ==== 4.8.1. API extensions made by the second `regex` module ==== Some API extensions relative to the lastest (Python 3.7) version of `re` are given below. Some of the simpler changes, such as the VERSION1 behaviour for zero-length matches and the addition of `fullmatch`, have also been percolated through to recent versions of standard `re`. - `splititer` as a generator version of `split` - `overlapped` argument for `findall` and `finditer` - `pos` and `endpos` available on `sub` and `subn` - For use with matches with repeated groups, `captures`, `capturesdict`, `starts`, `ends` and `spans` work like `group`, `groupdict`, `start`, `end` and `span` respectively, but return a list of values of every instance of that group in the match, not only the last (in the case of `capturesdict`, a dict with such lists as values). - `literal_spaces` argument to `escape`, suppresses escaping of spaces. - `special_only` argument to `escape`, escapes only special characters. - `detach_string` method on the match object, deletes reference to the original string (useful if it is a large string). - `expandf`, `subf` and `subfn`, differing from the ones without an `f` in that they use `str.format` style format strings instead of backslash syntax. - `partial` argument to `match`, `search`, `fullmatch` and `finditer`, allows truncated matches (i.e. where the pattern is still matching up, but runs into the end of the string before it's finished). Mainly useful for validating an input field whilst it is still being filled in. The match objects are given a `partial` attribute indicating if the match is partial in this sense. .. html-a:: :id: secondregexflags :name: secondregexflags ==== 4.8.2. Flag changes between `re` and the second `regex` module ==== In an effort to remain backward compatible with `re`, and to provide additional functionality, the second `regex` module introduces a few more flags: ======= ============ ======================================================================================================= Syntax Flag Meaning ======= ============ ======================================================================================================= `(?V0)` VERSION0 Conservatively match the behaviour of the `re` module, zero-length behaviour depends on Python version. `(?V1)` VERSION1 Support scoped flags, set operations, default to full case-folding, new zero-length behaviour always. `(?f)` FULLCASE Make `IGNORECASE` use full case-folding, implied by `VERSION1` but can still be disabled. `(?w)` WORD Use Unicode definitions of word boundaries, and consider all line breaks (rather than just LF). `(?r)` REVERSE Begin searching from the end of the string. `(?p)` POSIX Return the leftmost longest match, as stipulated by POSIX (takes longer). `(?e)` ENHANCEMATCH When handling a fuzzy-match sequence, try to improve the fit of the match found. `(?b)` BESTMATCH When handling a fuzzy-match sequence, exhaustively search for the least deviant match, not the first. ======= ============ ======================================================================================================= The values given to these flags and of the existing flags are as follows (note that `ASCII` and `DEBUG` have different values than in `re`, although all of the other flags are either new in the second `regex` module or match `re`): | Constant | Value | |--------------|------:| | TEMPLATE |1 | | IGNORECASE |2 | | LOCALE |4 | | MULTILINE |8 | | DOTALL |16 | | UNICODE |32 | | VERBOSE |64 | | ASCII |128 | | VERSION1 |256 | | DEBUG |512 | | REVERSE |1024 | | WORD |2048 | | BESTMATCH |4096 | | VERSION0 |8192 | | FULLCASE |16384 | | ENHANCEMATCH |32768 | | POSIX |65536 | .. html-a:: :id: secondregexsyntax :name: secondregexsyntax ==== 4.8.3. Syntax changes between `re` and the second `regex` module ==== The second `regex` module introduces a number of powerful syntax innovations. TODO more here. .. html-a:: :id: appendixlinks :name: appendixlinks == Appendices == - [→ Appendix A: Low level support modules of Third API implementations](pythonregex20.html) (`reop`, first `pcre`, `_sre` and `sre_*`, `_regex`) - [→ Appendix B: Other third-party regular expression modules](pythonregex30.html) (including the second `pcre` module) - [→ Appendix C: Summary table](pythonregex40.html) .. html-a:: :id: obelisk :name: obelisk † Python 1.6 is basically the state of Python 2 at the point that Guido left CNRI, released under contractual obligation or something similar: it incorporates many but not all distinctively 2.x features, it is accordingly not a true Python 1.x release; hence, "What's new in Python 2.0" compares it with Python 1.5.2.