Musings on the salvage of SGML
The current status quo is as follows:
- Most applications of SGML, such as DocBook, adopted XML in later versions. Most SGML-derived formats from after the turn of the millennium, such as XAML, used XML from the beginning. Outside of isolated legacy systems, the main hold-outs were HTML and BBCode.
- Although software which implemented BBCode in its heyday may still support it, it has been nearly fully displaced by various dialects of Markdown. Unlike BBCode, Markdown makes no attempt to be definable in terms of SGML.
- Yes, BBCode counts as SGML about as much as HTML ever did (making heavy use of
SHORTREF
s and syntax-token redefinition notwithstanding).- I don’t know of any pre-existing
SYNTAX
FPI for the BBCode syntax though.
- I don’t know of any pre-existing
- Yes, BBCode counts as SGML about as much as HTML ever did (making heavy use of
- HTML never fully adopted XML.
- HTML files may still need to be written to be parsable with an XML parser in certain specific contexts (iXBRL-format financial reports, for example, are expected to be polyglot documents, readable as XML by XBRL software and as HTML by a browser; e-book formats also tend to target minimum specs with an XML parser but not necessarily an HTML5 parser).
- Many years of stigmatising tag omission die hard. Omitting
<head>
and<body>
opening and closing tags is still, in practice, incorrectly perceived as something merely condoned by browsers, rather than “correct” HTML as is actually is.
- HTML5 is not defined in terms of SGML, since its de facto parsing requirements for compatibility with the web as it exists cannot be expressed in terms of SGML as it currently exists.
- The major clinchers are:
- The conditional handling of self-closing syntax (which is ignored on HTML-namespace elements, and honoured on SVG-namespace or MathML-namespace elements, where the namespaces are inferred by recognising the respective root-element names).
- The different “misnested” handling of certain omitted tags (e.g. if an omitted closing
</i>
tag is inferred before a closing</b>
tag, it will also infer an omitted opening<i>
tag straight after the closing</b>
tag).- As particularly evident for misnested
<a>
tags, attributes behave as#CURRENT
in this case even if they don’t normally.- What does this mean for
id=
?- How would HTML’s handling for duplicated
id=
s interact with an extended SGML layer (or XLink or HyTime, for that matter)?
- How would HTML’s handling for duplicated
- What does this mean for
- HTML5’s inference of omitted tags doesn’t derive implicitly from DTD content models (note that SGML tag-omission specifiers state which elements are allowed to omit tags without flagging up a validation error; they do not affect the logic for determining if a an omitted tag will be inferred).
- As particularly evident for misnested
- Different end conditions for
CDATA
elements.- Even
CDATA
elements without special handling, such as<xmp>
or<style>
, end theirCDATA
spans at a</
token followed by the element name and a tag-terminating byte rather than merely at the first</
token seen. - The
<plaintext>
element’sCDATA
span ends only at an end-of-entity token, and the rules for when the<script>
element’sCDATA
span ends are remarkably complex.
- Even
- Boolean attributes, as they are treated in HTML5, have diverged somewhat from the SGML-based concept (which felt somewhat kludgy to begin with).
- To say nothing of attributes which are for historic reasons sometimes boolean and sometimes not (consider:
<table border>
,<table border=border>
,<table border="">
and<table border=1>
).
- To say nothing of attributes which are for historic reasons sometimes boolean and sometimes not (consider:
- For web compatibility, it would need to be possible to disable any SGML syntax that would cause existing HTML documents to be interpreted differently, e.g. marked sections.
DOCTYPE
declarations serve a completely different purpose, and behave in a completely different way, in HTML5 versus in SGML.
- For the purposes of validation, the HTML5 schema is defined using DSDL (that is, RELAX-NG supplemented by Schematron), with a bespoke datatypes library. More on this later.
- The major clinchers are:
- The full XML format cannot be defined in terms of the original SGML format.
- Qualified clauses in the XML specification have the effect of defining two versions, the full “for compatibility” version which is defined in terms of WebSGML (a later annex to the SGML standard extending it), and the restricted “for interoperability” version which is defined in terms of the original SGML specification as described in Goldfarb; each has a separate SGML declaration.
- The clinchers are (a) hexadecimal numeric character references (which cannot be used in the “for interoperability” version), and (b) self-closing tags (which, in the “for interoperability” version, elements defined as
EMPTY
in a DTD—not an XSD or RNG, a DTD—must always use, and all others must never use). - The “for compatibility” clauses (i.e. restrictions which only exist for the purpose of keeping XML valid WebSGML) themselves imply an aspirational future XML version where SGML compatibility is no longer relevant and neither sets of restrictions will apply.
- As explained in Goldfarb, requiring enumerated values of attributes to be unique for an entire element (as opposed to merely prohibiting ambiguous attribute-name omissions in the document instance) was a deliberate restriction to avoid confusing users. Whether this actually achieves that is questionable, considering that people tend to think of HTML boolean attributes as omitting the attribute value (as they indeed are in HTML5), rather than omitting the name as SGML considers them to be—this is also a lurking gotcha for anyöne who tries to use a DTD to schematise XML, since XML doesn’t use attribute-name omission in the first place.
- For conceptualising the hierarchical content of an HTML or XML document, the W3C and later WHATWG DOM (Document Object Model) has largely displaced the earlier concepts of ESIS (Element Structure Information Set) and Property Set as applied to SGML.
- Except for the obsolete “level 1” DOM (which I’m not sure even has any “pure” implementations in practice), the concept of XML namespaces is baked into the DOM. This applies even to HTML5, where HTML, SVG or MathML elements are implicitly assigned to their respective XML namespaces by the HTML parser.
- This has the extra-fun result that there’s effectively an HTML serialisation of SVG, which is distinct from the XML serialisation of SVG, e.g. permitting unquoted attributes.
- The issue which is addressed by XML namespaces (defining element and attribute semantics in a manner independent of the overall schema of the document) had already been addressed in the context of HyTime hypertext linking, in the form of “architectural forms”. This is now incredibly obscure, while XML namespaces are ubiquitous.
- XML is case-sensitive
- The WHATWG DOM treats HTML elements differently due to them beïng case-insensitive.
- Case-insensitive XML formats are not unheard of, e.g. Microsoft ASX/WMX playlists (if they can be considered XML).
- Although, ASX seems to have been deprecated as a playlist format in favour of a subset of SMIL, including in Microsoft contexts.
- SGML, and by extension DTDs, have no innate understanding of a namespace.
- Strictly speaking, this is also true of the core specification for XML: XML namespaces are a separate specification, while the core XML specification treats colons as merely part of the name (although names starting with
xml
are reserved for W3C use, so the core XML layer knows that the attribute namexmlns:svg
has some schema-independent W3C-assigned semantic, just not what that semantic is; the SGML layer doesn’t even know that).
- Strictly speaking, this is also true of the core specification for XML: XML namespaces are a separate specification, while the core XML specification treats colons as merely part of the name (although names starting with
- In the other direction, the DOM has no means of representing an internal DTD subset: it can only represent a
DOCTYPE
declaration in terms of a tag name, optional public ID, and optional system ID.- DTDs not beïng themselves “XML” as the term came to be understood came to be regarded as a mistake. Whether this was actually the case is something I am inclined to question.
- That XSD can itself be defined in XSD (ditto for RELAX-NG) means that those definitions provide valuable documentation of the format, both by description and by example. Such is not possible for the DTD format, let alone the SGML declaration.
- DTDs not beïng themselves “XML” as the term came to be understood came to be regarded as a mistake. Whether this was actually the case is something I am inclined to question.
- Except for the obsolete “level 1” DOM (which I’m not sure even has any “pure” implementations in practice), the concept of XML namespaces is baked into the DOM. This applies even to HTML5, where HTML, SVG or MathML elements are implicitly assigned to their respective XML namespaces by the HTML parser.
- XML DTDs are a subset of SGML DTDs.
- There is nothing in principle preventing an XML validator from implementing more of the full DTD format than the subset defined in the core XML specification, since many if not most of the excluded features are still theoretically relevant to XML.
- The XML specification (W3C) is open-access, while the SGML specification (ISO) is not (most advisable way to get it is arguably to buy a second-hand copy of Goldfarb, although that doesn’t include the WebSGML extensions). An implementation based solely on the XML specification would be entirely unaware of DTD features outside of that subset, which would appear to be syntax errors.
- Principal removed features are:
- Inclusions and exclusions.
&
groups (RELAX-NG reïntroduces a modified version of this).- Granted, their SGML semantic has the fatal flaw that using
+
or*
operators inside an&
group does not behave in a sensible manner.- The HTML4 DTDs kludge around this using inclusions for most
<head>
elements. - RELAX-NG’s reïntroduction of the
&
operator changes the semantic to address this.- I think even
&&
would be syntactically unambiguous for the alternative semantic in an extended DTD format.
- I think even
- The HTML4 DTDs kludge around this using inclusions for most
- However, this does mean that unordered groups cannot be expressed in the XML subset of DTD without highly repetitive syntax resulting from expanding them to a
|
group of,
groups for every single permutation (which is a factorial blowup in a naïve approach, not quite as bad in a recursive approach). - Note that in the context in which they appear, only parameter entity references can occur, not general entity references, so the use of ampersand is not an issue.
- Granted, their SGML semantic has the fatal flaw that using
#CONREF
attributes.- Can be converted from SGML DTD to RELAX-NG directly. Cannot be converted to XML DTD nor (so far as I know) to XSD.
- Granted, they are much less powerful than RELAX-NG, but the “contains content OR this attribute but not both” semantic does crop up reasonably often (e.g. although the HTML
<script>
tag doesn’t use#CONREF
on thesrc=
attribute, it fits the intended semantic impeccably).
- Tag omission specifications (SGML strictly speaking mandates them when
OMITTAG
is enabled, and permits but ignores them otherwise). This one isn’t relevant to XML.
- XSDs theoretically replaced DTDs.
- Although XSD is still a current specification, the more powerful RELAX-NG has become a formidable competitor for hierarchical-schema definitions.
- Although RELAX-NG’s capability to define hierarchical schema structures with attributes is mostly a superset of its precursors (full SGML DTD, XML DTD, and XSD), it lacks the inclusions/exclusions features of the SGML DTD format.
- Exclusions, in a DSDL schema, are expressed with a separate Schematron file accompanying the RELAX-NG schema.
- Before the combination of RELAX-NG and Schematron in DSDL, a kludge was used in some places which expressed each individual exclusion in a separate RELAX-NG schema, which all had to be applied to the document instance in parallel to the main schema file. The standardisation (at the ISO level, no less) of the combination of RELAX-NG with Schematron rendered this long-winded kludge obsolete.
- Schematron assertions are more powerful than SGML DTD exclusions.
- DTD inclusions are very awkward: SGML doesn’t treat them as part of the content in every respect (the same arcane whitespace-handling considerations that nominally define the difference between
SDATA
and processing instructions so far as the SGML layer is concerned), they make the content models of everything else very confusing to the user, and their usage in the HTML4 DTDs for elements within<head>
is rendered unnecessary by RELAX-NG’s modified&
semantics.- However, HyTime makes extensive use of DTD inclusions for architectural forms in its meta-DTDs.
- One can, of course, include the element in every content model that it could theoretically appear in via inclusions (minus any places where it self-evidently does not belong), then enforce descent from the “inclusion” ancestor in a Schematron file. This does not entail any weird whitespace behaviour, and avoids allowing the elements in places where they would absolutely not be expected.
- Schematron isn’t really suitable for use in inferring SGML tag omission though.
- A subset of it (constrained to only define assertions relating to SGML structures which preceed, rather than following, the context element) could definitely be used for that though.
- Inferring closing-tag omission via an SGML exclusion (except if it’s excluding something directly included in the content model of the element itself, in which case, just remove it) sounds fairly surprising to the user anyway (it would necessarily close at least two tags).
- And yet, may well be necessary to parse existing SGML content.
- Which would presumably use a DTD anyway, so what’s the problem?
- And yet, may well be necessary to parse existing SGML content.
- Unlike XSD, RELAX-NG has no innate capability to define scalar datatypes, although it can reference datatype libraries by namespace URI, and pass parameters to those datatypes. RELAX-NG is often used with the same datatype libraries as XSD, which might themselves be defined with XSD (in which case, the RELAX-NG datatype parameter concept maps onto the XSD datatype “facet” concept), and/or with a plugin for the validator software.
- Same applies to DSDL-DTD (DSDL part 9).
- This means that XSD is possibly still “the” industry standard for defining datatypes (or declaring primitive datatypes) themselves.
- Not necessarily true. DSDL part 5 is purpose-designed for defining datatypes and, unlike XSD, is capable of defining some degree of value-parsing, not just matching, of datatypes.
- Although RELAX-NG’s capability to define hierarchical schema structures with attributes is mostly a superset of its precursors (full SGML DTD, XML DTD, and XSD), it lacks the inclusions/exclusions features of the SGML DTD format.
- Although XSD is still a current specification, the more powerful RELAX-NG has become a formidable competitor for hierarchical-schema definitions.
- There is nothing in principle preventing an XML validator from implementing more of the full DTD format than the subset defined in the core XML specification, since many if not most of the excluded features are still theoretically relevant to XML.
- SGML’s concept of a
NOTATION
encompasses both the concept of a datatype in a schema, and the concept of a content type (file type) of a resource.- SGML’s ability to define attribute datatypes is severely limited.
- Besides
NMTOKEN
,NMTOKENS
andCDATA
, the keywords for defining attribute datatypes are mostly… not all that useful (not all of the numbers one deals with are positive integers, after all). - Unlike its usage when defining unparsed external entities, the
CDATA
keyword used when defining an attribute datatype is not followed by a notation name. This is a severe oversight.- This cannot be trivially fixed since a following name-token already has a defined meaning, i.e. the name of the next attribute.
- Neither
SDATA
(the format of an attribute rarely varies depending on operating system, especially on the web) norNDATA
(implies that the format has to be treated as a binary stream, and not as text in the document’s encoding) is really correct in the context of the datatype of an attribute.- Since XML only allows
NDATA
, notCDATA
orSDATA
(norSUBDOC
even), when defining unparsed external entities, one could argue that XML has broadened the meaning ofNDATA
from the original “NONSGML
data” to “NOTATION
data”. So that is probably the least bad option.- Eh, SGML over-uses the
CDATA
keyword anyway, so overloading theNDATA
keyword would be fairly par for the course. - What about
RCDATA
?—given that “CDATA
” attributes are reallyRCDATA
anyway.- That could imply that ampersand-references would be replaced twice (as happens for internal text entities, once when it is defined and again when it is transcluded—as opposed to internal
CDATA
entities, for which ampersand-references are replaced only once, when the entity is defined). - See comments about
CDATA
content models below.
- That could imply that ampersand-references would be replaced twice (as happens for internal text entities, once when it is defined and again when it is transcluded—as opposed to internal
- Eh, SGML over-uses the
- Since XML only allows
- Strictly speaking, where the
CDATA
(orNDATA
) keyword is followed by a notation name, that can in turn be followed by an attribute list in square brackets.- Does anyöne actually use this syntax? I’ve only seen notation attributes set by means of default values (for a single notation name, where a given notation FPI can be used to declare multiple notation names) in an
ATTLIST
in any DTD I’ve seen. - Notation attributes very cleanly correspond to (a) parameters in MIME types, (b) facets in XSD XML datatypes, (c) datatype parameters in RELAX-NG.
- Does anyöne actually use this syntax? I’ve only seen notation attributes set by means of default values (for a single notation name, where a given notation FPI can be used to declare multiple notation names) in an
- de facto, DTDs seem to have settled to defining parameter entities representing datatypes (in later such examples, they seem to have acquired a naming convention of e.g.
%Number.datatype;
; this is not the case in earlier such examples). The value of this parameter entity is just one of the generic keywords, oftenCDATA
. Presumably, however, the intent is that a validator aware of this convention could make use of the additional information.- Notation FPIs for these attribute datatypes are declared by some of the W3C’s XHTML DTDs. There is no way in standard (Web)SGML to associate them with the parameter entities besides their declared notation names beïng (case-insensitive) matches of the parameter entity names, however.
- But, one could redefine the entities, even in an internal DTD subset, with e.g.
<!ENTITY URI.datatype "NDATA URI">
if targeting an implementation with such a nonstandard extension.- How does this apply to content models? e.g., some HTML4 DTDs use
%Script;
(defined asCDATA
) as both a content model and an attribute keyword.- This implies that one could use e.g.
NDATA xbm
as a content model.- Or
CDATA xbm
(I don’t think that’s actually ambiguous in this case)? - What about
RCDATA xbm
? - The standard way would be to define a
#NOTATION
attribute with a default and sole-permitted valuexbm
, and declare the content model asCDATA
,RCDATA
or(#PCDATA)
as appropriate.- Speaking of, is there a reason to require
(#PCDATA)
as opposed to just acceptingPCDATA
(since the content model gets treated as a keyword anyway unless it’s a group, and the function of the#
character is to denote a keyword in a syntactic context where a name could also appear (though one could condone#PCDATA
with a#
but no group, which does occasionally show up in DTDs, so presumably some software’s condoning it anyway—and by extension, unnecessary prefixing of keywords with#
in general).- Although that’s probably a bad idea if you want to be able to sanitise DTDs (or documents embedding them).
- Speaking of, is there a reason to require
- Or
- This implies that one could use e.g.
- How does this apply to content models? e.g., some HTML4 DTDs use
- But, one could redefine the entities, even in an internal DTD subset, with e.g.
- DSDL-DTD (DSDL part 9) takes a different approach using processing instructions, effectively however meaning that each attribute is defined twice (not very DRY).
- Notation FPIs for these attribute datatypes are declared by some of the W3C’s XHTML DTDs. There is no way in standard (Web)SGML to associate them with the parameter entities besides their declared notation names beïng (case-insensitive) matches of the parameter entity names, however.
- Besides
- SGML’s ability to define element-content datatypes is less limited, but still potentially annoying (if the content datatype depends on an attribute, it must depend on exactly one attribute, and the enumerated values of that attribute must be valid names and distinct from the values of any other enumerated attribute on that element; if the element content’s datatype doesn’t vary, one needs to pollute the attribute space with an attribute with a single valid value which is also used as the default value).
- The lack of ability to define open-ended notation attributes is also somewhat annoying.
- Polluting the attribute space with fixed-valued attributes is also how the aforementioned architectural forms work.
- Allowing a DTD to set values of attributes in namespaces that aren’t necessarily declared in the document itself would cleanly ameliorate that, and set the ground for architectural forms and XML namespaces cleanly complementing one another.
- Does this interact with DSDL-DTD (DSDL part 9) and, if so, how?
- How does this interact with LPDs?
- Speaking of XML namespaces and attributes, the fact that unprefixed attributes are treated as beïng in the null namespace rather than inheriting the namespace of the element they belong to is… decidedly annoying.
- Good luck changing that in a way that doesn’t break e.g. RELAX-NG semantics.
- Allowing a DTD to set values of attributes in namespaces that aren’t necessarily declared in the document itself would cleanly ameliorate that, and set the ground for architectural forms and XML namespaces cleanly complementing one another.
NOTATION
FPIs are, ironically, the only FPI class where there is no discernable file format that they are supposed to resolve to locators to.- All the others theoretically can be understood as pointing to some sort of SGML fragment or resource file:
SD
resolves to an SGML declaration.CAPACITY
,SYNTAX
andCHARSET
can resolve to the respective fragments of an SGML declaration.- Yes, an SGML declaration can define a
CHARSET
in terms of one or more otherCHARSET
s, or eventually in terms ofSDATA
with e.g. OpenType glyph names.
- Yes, an SGML declaration can define a
SYNTAX
might in practice also resolve to an entire SGML declaration, in FPIs that have to pass theFORMAL
feature on implementations without WebSGML’s addition of theSD
FPI class.
DTD
,ELEMENTS
,ENTITIES
andSHORTREF
resolve to DTD subsets (the difference between the four is less important in practice since a parameter entity transclusion doesn’t actually look at the FPI class; theoretically the latter three only contain specific types of markup declaration, although this is not always strictly adhered to in practice).DOCUMENT
andSUBDOC
resolve to SGML documents (resources in an SGML format), whileTEXT
resolves to a SGML document fragment to be substituted as a text entity during parsing.LPD
resolves to a link-process definition (more on them later).NONSGML
resolves to a resource in a non-SGML format.
- To what extent are all FPIs resolvable? One could refer to a physical book, for example, so as to hyperlink to a particular page in a physical book.
- Hyperlinks to physical media are very in-line with early SGML idealism (Goldfarb, for example, uses a compact page/line reference notation to include so-described “push-button” hyperlinks despite beïng print medium).
- What FPI class does a physical medium reference use? (Probably
NONSGML
, though I wouldn’t be surprised to find people usingTEXT
orDOCUMENT
for that in spite of their meanings in SGML entity terms.)- FPIs of
NONSGML
class also get used for IDs for software exporting vCard (address book) and iCalendar files.- Except for the ones that omit the FPI class altogether, or make other mistakes trying to imitate FPIs without understanding their syntax (or imitating such imitations), or make no attempt to be an FPI. One could argue the position of treating vCard and iCalendar as irrelevant to actual SGML FPIs (if you need a unique identifier, just use the URL of your software’s website; clearly virtually no calendar or address-book software handling vCard or iCalendar in practice actually knows or much cares about what an FPI is).
- FPIs of
- The built-in XSD datatype library also defines URLs with fragments for referencing the individual datatypes (for example,
http://www.w3.org/2001/XMLSchema#dateTime
). These point to the XSD file defining/declaring the datatypes, which setsid=
attributes on the individual derived datatype definitions / primitive datatype declarations, and are the obvious thing to resolve notation FPIs for datatypes to.- The XSD file does not fully specify the primitive datatypes, only the ways in which the derived datatypes are constrained, since XSD is insufficiently powerful (unlike DSDL part 5).
- Strictly speaking, XML
SYSTEM
identifiers are not supposed to contain URL fragment parts.- Which makes sense for external parsed entities, but not really for notations.
- The WHATWG HTML standard defines the (IANA-registered)
about:html-kind
URI as a notation identifier for the datatype of thekind
string-valued property of list-items of theaudioTracks
,videoTracks
andtextTracks
properties (i.e. Javascript attributes as opposed to SGML attributes) of DOM nodes for<media>
elements. I’m not entirely sure where this gets used.
- MIME types can be turned into URIs by prefixing
http://www.iana.org/assignments/media-types/
—I say URIs, not URLs, because retrieving the MIME type registration document (if any exists) would not be useful to an SGML implementation.- Could a primitive datatype (e.g.
octetStream
, since all MIME types are subsets ofapplication/octet-stream
) be declared in an XSD (or DSDL part 5) file taking a MIME type (and/or an Apple UTI, Windows Registry GUID, etc) as a facet?- Using a datatype library URI of
http://www.iana.org/assignments/media-types/video/
with a datatype name ofwebm
, for example, could be another option (since treating portions appended to the namespace URI like namespaced names has precedent in W3C’s CURIE format).
- Using a datatype library URI of
- Conversely, should XML datatypes from the XSD, XHTML and HTML5 datatype libraries be given MIME types (say, in the unregistered or personal trees)?
- This makes some sense for those with FPIs defined by XHTML, possibly for the built-in XSD ones, maybe for those of the remaining HTML5 ones which could plausibly be re-used elsewhere. It does not make much sense for anything else.
- Presumably one could use
+json
or+yaml
suffixes (favouring the former where both apply) if the datatype so happens to be a subset of the syntax (e.g. many of the numeric datatypes count as subsets of JSON).
- Could a primitive datatype (e.g.
- All the others theoretically can be understood as pointing to some sort of SGML fragment or resource file:
- SGML’s ability to define attribute datatypes is severely limited.
- The paradigm which everyöne has settled for seems to be that, with the possible exception of icons conceptually “built in” to a document format, references to embedded external resources appear in the document (as what, from a purely data-model perspective, might be considered a specialised type of hyperlink), not the prologue.
- This implies that such references are usually not unparsed entities.
- They’d presumably be referenced by a URI.
- There’s no fundamental reason not to reference one by an FPI.
- Can or should FPIs be used for hyperlinks in a document body?
- FPIs exist both for URIs (
-//W3C//NOTATION XHTML Datatype: URI//EN
) and for FPIs themselves (ISO 8879:1986//NOTATION Formal Public Identifier//EN
), so one could readily mark a#CONREF
/href
/etc attribute as one of them, given an ability to specify notations for attributes themselves.
- This applies even to subdocuments (even though XML doesn’t allow the
SUBDOC
keyword on external entities). Yes, the XSD schema pointed to by using e.g.xsi:noNamespaceSchemaLocation
in XML is a subdocument: it’s a document in the same concrete syntax (the XML one) as the main document, with its own independent schema, referenced from the main document.
- This implies that such references are usually not unparsed entities.
- The SGML declaration
CHARSET
defines what numeric character references refer to codepoints in.- In HTML and XML, this is Unicode, regardless of the character encoding of the document itself.
- XML character encoding declarations are SGML processing instructions; HTML character encoding declarations are just void elements.
- Even before XML, this didn’t work well for variable-width encodings (e.g. the SGML declaration used in practice with pre-XML
Shift_JIS
orEUC-JP
documents actually definescsEucFixWidJapanese
—in terms of the individual ISO/IEC-2022 G-sets—as the SGML character set).- So, the entire edifice makes the flawed (and thoroughly incorrect in the present UTF-8 age) assumption that character set (for numeric character references) equals character encoding (for reading the document off the disk—interpreting
NDATA
asPCDATA
,RCDATA
orCDATA
, to use SGML terms).
- So, the entire edifice makes the flawed (and thoroughly incorrect in the present UTF-8 age) assumption that character set (for numeric character references) equals character encoding (for reading the document off the disk—interpreting
- An SGML declaration defines capacity limits and quantity limits. XML effectively deprecated this; WebSGML added the ability to disable them outright, and XML did so. The advent of the Billion Laughs attack made it clear that, actually, limits are important.
- I’m not sure how fine-grained you could make a mitigation for billion-laughs-style attacks using the existing quantity and capacity limits provided by SGML. Introducing more (not fewer) limitable quantities and capacities might be sensible.
- Also, a more fine-grained approach to disabling the limits, as opposed to all-or-nothing as in WebSGML.
- I’m not sure how fine-grained you could make a mitigation for billion-laughs-style attacks using the existing quantity and capacity limits provided by SGML. Introducing more (not fewer) limitable quantities and capacities might be sensible.
- Should the SGML declaration format be extended, replaced outright, or both? It is easily the least legible markup declaration in the entirety of SGML.
- If replaced, would that mean that
SYNTAX
es andCAPACITY
s would now have more than one format—how would they be differentiated, if so?- Likewise with
CHARSET
s, although defining them in Unicode Technical Standard 22 format would arguably be an improvement.
- Likewise with
- Would any replacement still be referenceäble with an
SGML
markup declaration using anSD
FPI? If so, how would that be distinguished? - Would such a replacement be a markup declaration, processing instruction (unlikely tbh given how profoundly an SGML declaration can change the parsing of everything that follows, processing instructions included), or subdocument? If a subdocument, would it be in the Reference Concrete Syntax or in XML (or in the HTML Syntax, for that matter)? Note that an SGML declaration is always in the Reference Concrete Syntax (to avoid the cyclic dependency of needing to read the SGML declaration as a prerequisite to beïng able to read the SGML declaration), regardless of what syntax everything following it is in.
- If a subdocument, then it would in principle have a schema definition (e.g. RELAX-NG), which would serve as valuable documentation.
- If a subdocument, would its FPI class be
SD
orSUBDOC
?- Given that no part of SGML currently pays attention to the FPI class except possibly when validating the FPI itself (and we shouldn’t necessarily require a
PUBLIC
identifier to be present in the first place, and aSYSTEM
identifier wouldn’t provide an FPI class), this shouldn’t be used to differentiate between the two formats.
- Given that no part of SGML currently pays attention to the FPI class except possibly when validating the FPI itself (and we shouldn’t necessarily require a
- If a subdocument, would its FPI class be
- If a subdocument, then it would in principle have a schema definition (e.g. RELAX-NG), which would serve as valuable documentation.
- If replaced, would that mean that