.. -=- Python RegEx through the ages, Appendix A -=NO=-=MENU=- -=- = Python RegEx through the ages, Appendix A: Low level support modules of Third API implementations = [← Return to main document](pythonregex.html#appendixlinks) .. html-a:: :id: reop :name: reop == A.1. The `reop` module == Notable for being the first C support module for a standard library regular expression module written in Python (as opposed to the main module itself being written in C) and for pretty much nothing else, having only lasted one patchlevel of one minor version (1.5.0), and having been obsolete even by the time that one was released. Not at all a regular expression module in its own right, its limited functionality divides broadly into two categories: - It provides some utilities for working with escape sequences, especially character class escape sequences. While these could be computed / implemented in Python (and, in fact, pure-Python implementations are included but overridden in `re1.py`), doing so in C is faster. - It provides an interface over the bytecode handling backend of Ylönen's engine, allowing the compiling frontend to be rewritten in Python to handle the different syntax. This, of course, does not have a pure-Python equivalent. While the name `reop` might refer to the "`re` opcodes" handled by `reop.match` and `reop.search`, its "op" suffix is also evocative of the `strop` ("`string` optimisations") module, which was used to provide faster C implementations of some of the functions in the `string` module before the functions in question were made methods on the `str` and `unicode` types in Python 2.x(^[†](#obelisk)^) (and removed from the `string` module altogether in Python 3.0). Unlike `strop` though, where there is a pure-Python fallback for all its content, there is no pure-Python version of `reop.match` or `reop.search` (although there is for `expand_escape` and `_expand`). The module itself exports: - An `error` exception. - The `casefold` string, as explained for the [first `regex` module](pythonregex.html#firstregex), used in response to the `IGNORECASE` flag (interestingly this is accessed from the `reop` module itself, never from `re1`). The comments about how Unicode-ready this isn't apply here also. - A `syntax_table` dictionary (mapping one-byte byte-strings to flag integers indicating what character classes apply, per the flag constants documented below). - The integer constants listed below. - The functions documented below. .. html-a:: :id: reopfunc :name: reopfunc === A.1.1. Functions of the `reop` module === These functions are not officially documented. Below is my writeup. .. html-dl:: .. html-dt:: reop.'''match''' (''buffer''\:str, ''num_regs''\:int, ''flags''\:int, ''can_be_null''\:int, ''fastmap''\:str, ''anchor''\:int, ''string''\:str, ''pos''\:int) ⇒ ''regs'':tuple .. html-dt:: reop.'''search''' (''buffer''\:str, ''num_regs''\:int, ''flags''\:int, ''can_be_null''\:int, ''fastmap''\:str, ''anchor''\:int, ''string''\:str, ''pos''\:int) ⇒ ''regs'':tuple .. html-dd:: Functions for executing the pre-compiled bytecode of a regular expression against a string. The distinction between `match` (at the start of the string / at the given offset) and `search` (anywhere in the string / anywhere beginning at the given offset) was as usual. The `buffer` argument contained the bytecode for the regular expression, as opposed to the regular expression itself. See opcode documentation below. `num_regs` referred to the number of registers (i.e. the regular expression as a whole and all capturing parenthetical groups). `flags` has the same meaning as in `re.compile`, although only `IGNORECASE` is consulted at this stage (the rest having been consulted during bytecode compilation), and then only as an injunction to casefold the string (using the `reop.casefold` string) before searching it for the pattern: the transformations of the pattern itself to search a casefolded string are assumed to have been done during bytecode compilation. The next three of the arguments appear to have been pre-computed information useful for "quick checks" to exclude obvious non-matches from time-wasting scrutiny: - `can_be_null` was an integer of zero if the regular expression couldn't match an empty string, one if it could, and two if a match could begin with a line feed but not be totally empty. - `fastmap` was a 256-byte buffer intended to be indexed with character bytes; a zero-byte at a given index denoted a character which couldn't be the first character of a match and a one-byte denoted a character which could. - `anchor` was an integer of one if the regular expression could only match the start of a line, two if it could only match the start of the string, and zero otherwise. The last two arguments were the usual arguments to the `match` and `search` methods: `string` was the string which should be matched against (or scanned for) the regular expression, while `pos` was the position in the string to match at or begin searching from. The return value was the equivalent to the `regs` attribute of the match object, or `None` in the event of no match. .. html-dt:: reop.'''\_expand''' (''match''\:MatchObject, ''repl''\:str) ⇒ ''result''\:str .. html-dd:: Was used by the `sub` and `subn` functions/methods. Took a match object and a replacement string (with group references) and returned the replacement string with the group references substituted in, using the replacement mode of `expand_escape`. That is to say, it served much the same function as the old [`regsub.expand`](pythonregex.html#regsub), with the addition of the `\g` syntax for named group references, but was passed the match string/ranges data in the form of a match object. It also corresponds to the `expand` method subsequently added to the match objects themselves in the `sre` implementation of `re`. An alternative implementation in pure Python was present in the source of the `re1` module itself, but overwritten with the (presumably faster) C one at the end of the file, with a commented injuction to comment out those lines if bugs in the C versions are suspected. .. html-dt:: reop.'''expand_escape''' (''pattern''\:str, ''index''\:int, ''context''\:int=`NORMAL`) ⇒ (''escape_type''\:int, ''value''\:str, ''index''\:int) .. html-dd:: Was used by the `compile` function to process escape sequences when tokenising patterns, and also used by `_expand` to process escapes in the `repl` string. Depending on mode (`context`), it processed some of the escape sequences used in Python normally (such as `\t` or `\x`), the special ones used in the Perl regular expression syntax (`\b`, `\w`…), and group references. Unrecognised escapes were treated like in JS (`\H`≡`H`), not like in Python (`\H`≡`\\H`). The optional third argument could be any of the constants `NORMAL`, `CHARCLASS` or `REPLACEMENT`: - Normal mode is used by the `compile` function and interprets most of the escapes, and uses `SYNTAX` and `NOT_SYNTAX` for the character classes. - Charclass mode is used by the `compile` function while inside a hard-bracketed set. It uses `SET` for the character-class related escapes (so they can be added to the set), and treats the other regular-expression-specific escapes the same as unknown escapes. - Replacement mode is used by the `_expand` function, and permits numerical group references (e.g. `\1`) and normal Python escapes, as well as the otherwise unused `\g` references (treated as an unknown escape by the other modes), and treats the rest as unknown escapes. Its return values (in a tuple) were a token type (one of the constants listed below), a "value" which could depend on the token type, and the new index from which to resume tokenisation. The "value" could be an actual character for `CHAR`, a `syntax_table` flag specification for `SYNTAX` or `NOT_SYNTAX`, a number or string identifying a group for `MEMORY_REFERENCE`, a list of characters for `SET`, an empty string for several of the others. Like with `_expand`, an alternative implementation in pure Python was present in the source of the `re1` module itself, but overwritten with the (presumably faster) C one at the end of the file. .. html-a:: :id: reopconst :name: reopconst === A.1.2. Integer constants of the `reop` module === The following integer constants were exported in the module. Flags used by `syntax_table`: | Name from Python | Name from C | Value | |------------------|-------------|------:| |word |Sword |1 | |whitespace |Swhitespace |2 | |digit |Sdigit |4 | |(not exported) |Soctaldigit |8 | |(not exported) |Shexdigit |16 | Other constants used by `expand_escape` included "contexts" (modes) and token types. As a sidenote, the definitions of these were also included in the `re1` module itself but commented out. Contexts: |Mode constant|Value| |-------------|----:| |NORMAL |0 | |CHARCLASS |1 | |REPLACEMENT |2 | Tokens: |Token type constant|Value| |-------------------|----:| |CHAR |0 | |MEMORY_REFERENCE |1 | |SYNTAX |2 | |NOT_SYNTAX |3 | |SET |4 | |WORD_BOUNDARY |5 | |NOT_WORD_BOUNDARY |6 | |BEGINNING_OF_BUFFER|7 | |END_OF_BUFFER |8 | .. html-a:: :id: reopcodes :name: reopcodes === A.1.3. Byte-code opcodes used by the Ylönen regular expression engine === The following opcodes were used by the bytecode interpreter of the Ylönen regular expression engine, as accessible through the `reop` C support module in Python 1.5.0 only. (While they had also been an implementation detail of the `regex` module since Python 0.9.6, and technically remained so (albeit deprecated) as late as Python 2.4, they were not visible or accessible from Python code.) Each opcode took up one byte and was followed by zero or more bytes; the total sequence length was fixed for each opcode. The details of the `set` and `exact` opcodes are interesting to note insofar they assumed a single-byte character, i.e. the Ylönen engine was not equipped to meaningfully handle Unicode in any form. While `exact` could be supplemented with 2-byte and 4-byte equivalents, `set` would need to be 8 KiB (plus opcode) long to properly handle even UCS/2 (136 KiB for all of Unicode), and Unicode support might therefore be best off using a completely different paradigm. As it happened, it was deprecated before Unicode support was added to Python, so this never truly became a concern. At the time, a more salient concern would no doubt have been the lack of any ability to unconsume characters except on failure, which would have made implementation of look-ahead assertions difficult if not impossible, to say nothing of look-behind. (This being presumably why, despite them being standard features of Perl-style regular expressions, the `re1` implementation of `re` lacked support for lookaheads; or rather, would detect attempts to use them and raise an error denoting lack of support.) |Code|Name|Bytes including opcode|Description| |----|---|---|---| |0 |end|1|Sentinel marking the end of the pattern| |1 |bol|1|Zero-length matches the beginning of a line| |2 |eol|1|Zero-length matches the end of a line| |3 |set|33|256-element bitvector indicating matching (single byte) characters, byte-indexed by five most significant bits and then bit-indexed (LSb first) by the remaining three.| |4 |exact|2|Fixed (single byte) character match| |5 |anychar|1|Match any character except the line feed| |6 |start_memory|2|Begin an enumerated group (with the given 8-bit number)| |7 |end_memory|2|End an enumerated group| |8 |match_memory|2|Reference an enumerated group| |9 |jump|3|Jump to a given bytecode offset (as a little-endian signed 16-bit integer, relative to the end of the jump instruction).| |10 |star_jump|3|Used in the implementations of `*` and `+`. Optimised to `update_failure_jump` or `repeat1` when it is deemed safe to do so, otherwise the same as `jump`.| |11 |failure_jump|3|Try matching the rest of the pattern and, failing that, roll back to here and jump to the point indicated.| |12 |update_failure_jump|3|Update topmost failure point and jump (i.e. the target should be a `failure_jump`). Used in loops only if (a) the loop doesn't itself contain any jumps, and (b\) the pattern immediately following the loop cannot match the same string as the pattern within the loop (so partial rollback of the loop will not happen).| |13 |dummy_failure_jump|3|Jump unconditionally, but push a dummy failure point (used in the implementation of `+`, so that `star_jump` can be optimised to `update_failure_jump` without overwriting a wrong / non-existent failure point the first time around).| |14 |begbuf|1|Zero-length matches the start of the string (not merely line)| |15 |endbuf|1|Zero-length matches the end of the string (not merely line)| |16 |wordbeg|1|Zero-length matches the start of a word| |17 |wordend|1|Zero-length matches the end of a word| |18 |wordbound|1|Zero-length matches the start or end of a word| |19 |notwordbound|1|Not the start or end of a word| |20\*|syntaxspec|2|Matching certain syntax-table flags (i.e. word, whitespace, digit, octal, hex)| |21\*|notsyntaxspec|2|Not matching certain syntax-table flags| |22 |repeat1|3|Added as part of Jeffrey Ollie's revisions to the Ylönen engine for the 1.5 release; like `update_failure_jump`, but the loop must contain only one instruction and match a single character per iteration.|  \* When the (pre-Ollie) Ylönen engine is compiled with Emacs integration, opcode 20 is used for `emacs_at_dot` (used for the `\=` escape), while `syntaxspec` and `notsyntaxspec` are opcodes 21 and 22 respectively. This has never been the case in Python. .. repeat1 added in rev 18a639db048c .. html-a:: :id: firstpcre :name: firstpcre == A.2. The first `pcre` module == A remarkably simple regular expression module in its own right, the first `pcre` module holds the distinction of being the first low level support module to a standard Python regular expression module to make it into a release before being deprecated, the first one to survive multiple patchlevels of a minor version without being deprecated and/or removed, and the first one to survive multiple minor and indeed major Python versions without being removed (deprecation notwithstanding). Unlike its immediate predecessor `reop`, the first `pcre` module was in principle usable by itself, albeit lacking documentation due to being, really, an implementation detail. Incompatible with yet eerily reminiscent of the the original `regexp` module in its interface and exported functionality, it exported: - The usual pre-Unicode `re` module flags, plus the `ANCHORED` flag which was used by the low-level object's `match` method. Values were as documented for the `pre` module [in the main document](pythonregex.html#pre). - An `error` exception. - Two functions, `pcre_compile` and `pcre_expand`. .. html-a:: :id: firstpcrefunc :name: firstpcrefunc === A.2.1. Functions of the first `pcre` (C support) module === The functions were as follows: .. html-dl:: .. html-dt:: pcre.'''pcre_expand''' (''match''\:MatchObject, ''repl''\:str) ⇒ ''result''\:str .. html-dd:: Used by the `sub` and `subn` functions/methods, this did exactly the same thing as the older `reop._expand()` which it replaced, i.e. it took a match object and a replacement string (with group references) and returned the replacement string with the group-references substituted in. Accordingly, it corresponds to the `expand` method subsequently added to the match objects themselves in the `sre` implementation (but never in the `pre` implementation). .. html-dt:: pcre.'''pcre_compile''' (''pattern''\:str, ''flags''\:int, ''groupindex''\:dict) ⇒ ''code''\:Pcre .. html-dd:: Accepted a regular expression pattern, compiled it, and returned a compiled low-level regular expression object, exporting only a `match` method as documented below. Mappings of names to numbers for named groups would be written to `groupindex`, which was expected to initially be an empty dict. .. html-dt:: ''method'' Pcre.'''match''' (''string''\:str, ''pos''\:int, ''endpos''\:int, ''options''\:int) ⇒ ''regs''\:tuple .. html-dd:: Searched the string for a given pattern; the `options` argument would be set to the `ANCHORED` flag by `re.match` (i.e. matching only at the start of the range) and to `0` by `re.search`. Returned a `regs` tuple of the usual format, or `None` if there is no match. .. html-a:: :id: sresupport :name: sresupport == A.3. Support modules of the `sre` implementation of `re` == === A.3.1. The `_sre` C support module === TODO === A.3.2. The `sre_constants` support module === TODO === A.3.3. The `sre_compile` support module === TODO === A.3.4. The `sre_parse` support module === TODO .. html-a:: :id: secondregexsupport :name: secondregexsupport == A.4. The `_regex` C support module == TODO .. html-a:: :id: obelisk :name: obelisk † Python 1.6 is basically the state of Python 2 at the point that Guido left CNRI, released under contractual obligation or something similar: it incorporates many but not all distinctively 2.x features, it is accordingly not a true Python 1.x release; hence, "What's new in Python 2.0" compares it with Python 1.5.2.