Python RegEx through the ages, Appendix A: Low level support modules of Third API implementations

← Return to main document

A.1. The reop module

Notable for being the first C support module for a standard library regular expression module written in Python (as opposed to the main module itself being written in C) and for pretty much nothing else, having only lasted one patchlevel of one minor version (1.5.0), and having been obsolete even by the time that one was released.

Not at all a regular expression module in its own right, its limited functionality divides broadly into two categories:

While the name reop might refer to the re opcodes handled by reop.match and reop.search, its op suffix is also evocative of the strop (string optimisations) module, which was used to provide faster C implementations of some of the functions in the string module before the functions in question were made methods on the str and unicode types in Python 2.x (and removed from the string module altogether in Python 3.0). Unlike strop though, where there is a pure-Python fallback for all its content, there is no pure-Python version of reop.match or reop.search (although there is for expand_escape and _expand).

The module itself exports:

A.1.1. Functions of the reop module

These functions are not officially documented. Below is my writeup.

reop.match (buffer:str, num_regs:int, flags:int, can_be_null:int, fastmap:str, anchor:int, string:str, pos:int) => regs:tuple
reop.search (buffer:str, num_regs:int, flags:int, can_be_null:int, fastmap:str, anchor:int, string:str, pos:int) => regs:tuple

Expand/Hide Spoiler

Functions for executing the pre-compiled bytecode of a regular expression against a string. The distinction between match (at the start of the string / at the given offset) and search (anywhere in the string / anywhere beginning at the given offset) was as usual.

The buffer argument contains the bytecode for the regular expression, as opposed to the regular expression itself. See opcode documentation below.

num_regs refers to the number of registers (i.e. the regular expression as a whole and all capturing parenthetical groups).

flags has the same meaning as in re.compile, although only IGNORECASE is consulted at this stage (the rest having been consulted during bytecode compilation), and then only as an injunction to casefold the string (using the reop.casefold string) before searching it for the pattern (the casefolding of the pattern itself having been done during bytecode compilation).

The next three of the arguments appear to be pre-computed information useful for quick checks to exclude obvious non-matches from time-wasting scrutiny.

can_be_null is an integer of zero if the regular expression cannot match an empty string, one if it can, and two if a match can begin with a line feed but not be totally empty.

fastmap is a 256-byte buffer intended to be indexed with character bytes; a zero-byte at a given index denotes a character which cannot be the first character of a match and a one-byte denotes a character which can.

anchor is an integer of one if the regular expression can only match the start of a line, two if it can only match the start of the string, and zero otherwise.

The last two arguments are the usual arguments to the match and search methods: string is the string which should be matched against (or scanned for) the regular expression, while pos is the position in the string to match at or begin searching from.

The return value is the equivalent to the regs attribute of the match object, or None in the event of no match.

reop.expand_escape (pattern:str, index:int, context:int=NORMAL) => (escape_type:int, value:str, index:int)
reop._expand (match:MatchObject, repl:str) => result:str

Expand/Hide Spoiler

C implementations of two internal functions implemented in pure Python in the re1 module itself, but overwritten with the (presumably faster) C ones at the end of the file, with a commented injuction to comment out those lines if bugs in the C versions are suspected. They both had functionality related to parsing or processing escape sequences.

_expand was used by the sub and subn functions/methods, it took a match object and a replacement string (with group references) and returned the replacement string with the group references substituted in, using the replacement mode of expand_escape.

expand_escape was used to process escape sequences when tokenising the pattern, including some of the escape sequences used in Python normally (such as \t or \x) and also the special ones used in the Perl regular expression syntax (\b, \w…). Unrecognised escapes were treated like in JS (\HH), not like in Python (\H\\H). Its return values (in a tuple) were a token type (one of the constants listed below), a value which could depend on the token type, and the new index from which to resume tokenisation.

The value can be an actual character for CHAR, a syntax_table flag specification for SYNTAX or NOT_SYNTAX, a number or string identifying a group for MEMORY_REFERENCE, a list of characters for SET, an empty string for several of the others.

The optional third argument of expand_escape can be any of the constants NORMAL, CHARCLASS or REPLACEMENT:

  • Normal mode is used by the compile function and interprets most of the escapes, and uses SYNTAX and NOT_SYNTAX for the character classes.
  • Charclass mode is used by the compile function while inside a hard-bracketed set. It uses SET for the character-class related escapes (so they can be added to the set), and treats the other regular-expression-specific escapes the same as unknown escapes.
  • Replacement mode is used by the _expand function, and permits numerical group references and normal Python escapes, as well as the otherwise unused \g<name> references (treated as an unknown escape by the other modes), and treats the rest as unknown escapes.

A.1.2. Integer constants of the reop module

The following integer constants are exported in the module.

Flags used by syntax_table:

Name from Python Name from C Value
word Sword 1
whitespace Swhitespace 2
digit Sdigit 4
(not exported) Soctaldigit 8
(not exported) Shexdigit 16

Other constants used by expand_escape included contexts (modes) and token types. As a sidenote, the definitions of these are also included in the re1 module itself but commented out.

Contexts:

Mode constantValue
NORMAL 0
CHARCLASS 1
REPLACEMENT 2

Tokens:

Token type constantValue
CHAR 0
MEMORY_REFERENCE 1
SYNTAX 2
NOT_SYNTAX 3
SET 4
WORD_BOUNDARY 5
NOT_WORD_BOUNDARY 6
BEGINNING_OF_BUFFER7
END_OF_BUFFER 8

A.1.3. Byte-code opcodes used by the Ylönen regular expression engine

The following opcodes are used by the bytecode interpreter of the Ylönen regular expression engine, as accessible through the reop C support module in Python 1.5.0 only. (While they had also been an implementation detail of the regex module since Python 0.9.6, and technically remained so (albeit deprecated) as late as Python 2.4, they were not visible or accessible from Python code.)

Each opcode takes up one byte and is followed by zero or more bytes; the total sequence length is fixed for each opcode. The details of the set and exact opcodes are interesting to note insofar they assume a single-byte character, i.e. the Ylönen engine is not equipped to meaningfully handle Unicode in any form.

While exact could be supplemented with 2-byte and 4-byte equivalents, set would need to be 8 KiB (plus opcode) long to properly handle even UCS/2 (136 KiB for all of Unicode), and Unicode support might therefore be best off using a completely different paradigm. As it happened, it was deprecated before Unicode support was added to Python, so this never truly became a concern.

A more prurient concern at the time would no doubt have been the lack of any ability to unconsume characters except on failure, which would have made implementation of look-ahead assertions difficult if not impossible, to say nothing of look-behind.

CodeNameBytes including opcodeDescription
0 end1Sentinel marking the end of the pattern
1 bol1Zero-length matches the beginning of a line
2 eol1Zero-length matches the end of a line
3 set33256-element bitvector indicating matching (single byte) characters, byte-indexed by five most significant bits and then bit-indexed (LSb first) by the remaining three.
4 exact2Fixed (single byte) character match
5 anychar1Match any character except the line feed
6 start_memory2Begin an enumerated group (with the given 8-bit number)
7 end_memory2End an enumerated group
8 match_memory2Reference an enumerated group
9 jump3Jump to a given bytecode offset (as a little-endian signed 16-bit integer, relative to the end of the jump instruction).
10 star_jump3Used in the implementations of * and +. Optimised to update_failure_jump or repeat1 when it is deemed safe to do so, otherwise the same as jump.
11 failure_jump3Try matching the rest of the pattern and, failing that, roll back to here and jump to the point indicated.
12 update_failure_jump3Update topmost failure point and jump (i.e. the target should be a failure_jump). Used in loops only if (a) the loop doesn’t itself contain any jumps, and (b) the pattern immediately following the loop cannot match the same string as the pattern within the loop (so partial rollback of the loop will not happen).
13 dummy_failure_jump3Jump unconditionally, but push a dummy failure point (used in the implementation of +, so that star_jump can be optimised to update_failure_jump without overwriting a wrong / non-existent failure point the first time around).
14 begbuf1Zero-length matches the start of the string (not merely line)
15 endbuf1Zero-length matches the end of the string (not merely line)
16 wordbeg1Zero-length matches the start of a word
17 wordend1Zero-length matches the end of a word
18 wordbound1Zero-length matches the start or end of a word
19 notwordbound1Not the start or end of a word
20*syntaxspec2Matching certain syntax-table flags (i.e. word, whitespace, digit, octal, hex)
21*notsyntaxspec2Not matching certain syntax-table flags
22 repeat13Added as part of Jeffrey Ollie’s revisions to the Ylönen engine for the 1.5 release; like update_failure_jump, but the loop must contain only one instruction and match a single character per iteration.

 * When the (pre-Ollie) Ylönen engine is compiled with Emacs integration, opcode 20 is used for emacs_at_dot (used for the \= escape), while syntaxspec and notsyntaxspec are opcodes 21 and 22 respectively. This has never been the case in Python.

A.2. The first pcre module

A remarkably simple regular expression module in its own right, the first pcre module holds the distinction of being the first low level support module to a standard Python regular expression module to make it into a release before being deprecated, the first one to survive multiple patchlevels of a minor version without being deprecated and/or removed, and the first one to survive multiple minor and indeed major Python versions without being removed (deprecation notwithstanding).

Unlike its immediate predecessor reop, the first pcre module was in principle usable by itself, albeit lacking documentation due to being, really, an implementation detail. Incompatible with yet eerily reminiscent of the the original regexp module in its interface and exported functionality, it exported:

A.2.1. Functions of the first pcre (C support) module

The functions were as follows:

pcre.pcre_expand (match:MatchObject, repl:str) => result:str

Used by the sub and subn functions/methods, this did exactly the same thing as the older reop._expand() which it replaced, i.e. it took a match object and a replacement string (with group references) and returned the replacement string with the group-references substituted in.

pcre.pcre_compile (pattern:str, flags:int, groupindex:dict) => code:Pcre

Returned a compiled low-level, regular expression object, exporting only a match method as documented below. Mappings of names to numbers for named groups would be written to groupindex, which should initially be an empty dict.

method Pcre.match (string:str, pos:int, endpos:int, options:int) => regs:tuple

Searched the string for a given pattern; the options argument would be set to the ANCHORED constant by re.match (i.e. matching only at the start of the range) and to 0 by re.search. Returns a regs tuple of the usual format, or None if there is no match.

A.3. Support modules of the sre implementation of re

A.3.1. The _sre C support module

TODO

A.3.2. The sre_constants support module

TODO

A.3.3. The sre_compile support module

TODO

A.3.4. The sre_parse support module

TODO

A.4. The _regex C support module

TODO