DOMDocument appendXML with special characters

I am retreiving some html strings from my database and I would like to parse these strings into my DOMDocument. The problem is, that the DOMDocument gives warnings at special characters.

Warning:
DOMDocumentFragment::appendXML()
[domdocumentfragment.appendxml]:
Entity: line 2: parser error : Entity
‘nbsp’ not defined in
page.php
on line 189

I wonder why and I wonder how to solve this. This are some code fragments of my page. How can I fix these kind of warnings?

$doc = new DOMDocument();

// .. create some elements first, like some divs and a h1 ..

while($row = mysql_fetch_array($result))
{
    $messageEl = $doc->createDocumentFragment();
    $messageEl->appendXML($row['message']); // gives it's warnings here!

    $otherElement->appendChild($messageEl);
}

echo $doc->saveHTML();

I also found something about validation, but when I apply that, my page won’t load anymore. The code I tried for that was something like this.

$implementation = new DOMImplementation();
$dtd = $implementation->createDocumentType('html','-//W3C//DTD XHTML 1.0 Transitional//EN','http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd');

$doc = $implementation->createDocument('','',$dtd);
$doc->validateOnParse = true;
$doc->formatOutput = true;

// in the same whileloop, I used the following:
$messageEl = $doc->createDocumentFragment();
$doc->validate(); // which stopped my code, but error- and warningless.
$messageEl->appendXml($row['message']);

Thanks in advance!


4 Answers
answer

There is no   in XML. The only character entities that have an actual name defined (instead of using a numeric reference) are &, <, >, " and '.

That means you have to use the numeric equivalent of a non-breaking space, which is   or (in hex)  .

If you are trying to save HTML into an XML container, then save it as text. HTML and XML may look similar but they are very distinct. appendXML() expects well-formed XML as an argument. Use the nodeValue property instead, it will XML-encode your HTML string without any warnings.

// document fragment is completely unnecessary
$otherElement->nodeValue = $row['message'];
answer

That’s a tricky one because it’s actually multiple issues in one.

Like Tomalak points out, there is no   in XML. So you did the right thing specifying a DOMImplementation, because in XHTML there is  . But, for DOM to know that the document is XHTML, you have load and validate against the DTD. The DTD is located at

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

but because there is millions of requests to that page daily, the W3C decided to block access to the page, unless there is a UserAgent sent in the request. To supply a UserAgent you have to create a custom stream context.

In code:

// make sure DOM passes a User Agent when it fetches the DTD
libxml_set_streams_context(
    stream_context_create(
        array(
            'http' => array(
                'user_agent' => 'PHP libxml agent',
            )
        )
    )
);

// specify the implementation
$imp = new DOMImplementation;

// create a DTD (here: for XHTML)
$dtd = $imp->createDocumentType(
    'html',
    '-//W3C//DTD XHTML 1.0 Transitional//EN',
    'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
);

// then create a DOMDocument with the configured DTD
$dom = $imp->createDocument(NULL, "html", $dtd);
$dom->encoding = 'UTF-8';
$dom->validate();

$fragment = $dom->createDocumentFragment();
$fragment->appendXML('
    <head><title>XHTML test</title></head>
    <body><p>Some text with a &nbsp; entity</p></body>
    '
);
$dom->documentElement->appendChild($fragment);
$dom->formatOutput = TRUE;
echo $dom->saveXml();

This still takes some time to complete (dont ask me why) but in the end, you’ll get (reformatted for SO)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>XHTML test</title>
    </head>
    <body>
        <p>Some text with a &nbsp; entity</p>
    </body>
</html>

Also see DOMDocument::validate() problem

answer

I do see the problem in question, and also that the question has been answered, but if I may I’d like to suggest a thought from my past dealing with similar problems.

It just might be so that your task requires including tagged data from the database in the resulting XML, but may or may not require parsing. If it’s merely data for inclusion, and not structured parts of your XML, you can place strings from the database in CDATA section(s), effectively bypassing all validation errors at this stage.

answer

While smarty might be a good bet (why invent the wheel for the 14th time?), etranger might have a point. There’s situations in which you don’t want to use something overkill like a complete new (and unstudied) package, but more like you want to post some data from a database that just happens to contain html stuff an XML parser has issues with.

Warning, the following is a simple solution, but don’t do it unless you’re SURE you can get away with it! (I did this when I had about 2 hours before a deadline and didn’t have time to study, leave lone implement something like smarty…)

Before sticking the string into an appendXML function, run it through a preg_replace. For instance, replace all & nbsp; characters with [some_prefix]_nbsp. Then, on the page where you show the html, do it the other way around.

And Presto! =)

Example code:
Code that puts text into a document fragment:

// add text tag to p tag.
// print("CCMSSelTextBody::getDOMObject: strText: ".$this->m_strText."<br>\n");
$this->m_strText = preg_replace("/&nbsp;/", "__nbsp__", $this->m_strText);
$domTextFragment = $domDoc->createDocumentFragment();
$domTextFragment->appendXML(utf8_encode($this->m_strText));
$p->appendChild($domTextFragment);
// $p->appendChild(new DOMText(utf8_encode($this->m_strText)));

Code that parsed the string and writes the html:

// Instantiate template.
$pTemplate = new CTemplate($env, $pageID, $pUser, $strState);

// Parse tag-sets.
$pTemplate->parseTXTTags();
$pTemplate->parseCMSTags();

// present the html code.
$html = $pTemplate->getPageHTML();
$html = preg_replace("/__nbsp__/", "&nbsp;", $html);
print($html);

It’s probably a good idea to think up a stronger replace. (If you insist on being thorough: Do a md5 on a time() value, and hardcode the result of that as a prefix. So like in the first snippet:

$this->m_strText = preg_replace("/&nbsp;/", "4597ee308cd90d78aa4655e76bf46ee0_nbsp", $this->m_strText);

And in the second:

$html = preg_replace("/4597ee308cd90d78aa4655e76bf46ee0_nbsp/", "&nbsp;", $html);

Do the same for any other tags and stuff you need to circumvent.

This is a hack, and not good code by any stretch of the imagination. But it saved my live and wanted to share it with other people that run into this particular problem with minutes to spare.

Use the above at your own risk.

replyGGKF - Ted Guild:Due to the volume of DTD requests W3C not only blocks user agents that do not identify themselves but has set up a tarpit to encourage software and library authors to use a catalog. Wouldn't it be more efficient to validate your markup using local resources instead of going over the net to retrieve something that has not changed since August of 2002 each and every time your code runs? At the very least these libraries should more fully implement HTTP and take advantage of the caching directives which would mean they would only retrieve the DTD once every three months. W3C's tarpit delay is meant to emphasize this so developers notice and file bug reports/feature requests with the library maintainers.
  • Ted Guild

    Due to the volume of DTD requests W3C not only blocks user agents that do not identify themselves but has set up a tarpit to encourage software and library authors to use a catalog.

    Wouldn’t it be more efficient to validate your markup using local resources instead of going over the net to retrieve something that has not changed since August of 2002 each and every time your code runs? At the very least these libraries should more fully implement HTTP and take advantage of the caching directives which would mean they would only retrieve the DTD once every three months. W3C’s tarpit delay is meant to emphasize this so developers notice and file bug reports/feature requests with the library maintainers.