This article is intended for PHP developers at all levels who are
interested in using the new XML functionality in PHP 5. Only basic,
general knowledge about XML is assumed. However, it's an advantage if
you have already worked with XML in PHP.
Introduction
In today's Internet world, XML isn't just a buzzword anymore, but a
widely accepted and used standard. Therefore XML support was taken
more seriously for PHP 5 than it was in PHP 4. In PHP 4 you were
almost always faced with non-standard, API-breaking, memory leaking,
incomplete functionality. Although some of these deficiencies were
dealt with in the 4.3 series of PHP 4, the developers nevertheless
decided to dump almost everything and start from scratch in PHP 5.
This article will give an introduction to all the new exciting
features PHP 5 has to offer regarding XML.
XML in PHP 4
PHP has had XML support from its early days. While this was "only" a
SAX based interface, it did at least allow parsing any XML documents
without too much hassle. Further XML support came with PHP 4 and the
domxml extension. Later the XSLT extension, with Sablotron as backend,
was added. During the PHP 4 life cycle, additional features like HTML,
XSLT and DTD-validation were added to the domxml extension.
Unfortunately, since the xslt and domxml extensions never really left
the experimental stage, and changed their API more than once, they
were not enabled by default, and frequently not installed on hosts.
Furthermore, the domxml extension did not implement the DOM standard
defined by the W3C, but had its own method-naming. While this was
improved in the 4.3 series of PHP, together with a lot of memory leak
and other fixes, it never reached a truly stable stage, and it was
almost impossible to really fix the deeper issues. Also, only the SAX
extension was enabled by default, so the other extensions never
achieved widespread distribution.
For all these reasons, the PHP XML developers decided to start from
scratch for PHP 5, and to follow commonly used standards.
XML in PHP 5
Almost everything regarding XML support was rewritten for PHP 5. All
the XML extensions are now based on the excellent libxml2 library by
the GNOME project. This allows for interoperability between the
different extensions, so that the core developers only need to work
with one underlying library. For example, the quite complex and now
largely improved memory management had to be implemented only once for
all XML-related extensions.
In addition to the better-known SAX support inherited from PHP 4, PHP
5 supports DOM according to the W3C standard and XSLT with the very
fast libxslt engine. It also incorporates the new PHP-specific
SimpleXML extension and a much improved, standards-compliant SOAP
extension. Given the increasing importance of XML, the PHP developers
decided to enable more XML support by default. This means that you now
get SAX, DOM and SimpleXML enabled out of the box, which ensures that
they will be installed on many more servers in the future. XSLT and
SOAP support, however, still need to be explicitly configured into a
PHP build.
Streams support
All the XML extensions now support PHP streams throughout, even if you
try to access a stream not directly from PHP. In PHP 5 you can access
a PHP stream, for example, from an
everywhere where you can access a normal file.
Streams in general were introduced in PHP 4.3 and were further
improved in PHP 5 as a way of generalizing file-access, network-
access, and other operations that share a common set of functions. You
can even implement your own streams with PHP code, and thus unify and
simplify access to your data. See the PHP documentation for more
details about that.
SAX
SAX stands for Simple API for XML. It's a callback-based interface for
parsing XML documents. SAX support has been available since PHP 3 and
hasn't changed a lot since then. For PHP 5 the API is unchanged, so
your old code should still work. The only difference is that it's not
based on the expat library anymore, but on the libxml2 library.
This change introduced some problems with namespace support, which are
currently resolved in libxml2 2.6, but not in older versions of
libxml2. Therefore, if you use xml_parser_create_ns(), you are
strongly advised to install libxml2 2.6 or above on your system.
DOM
DOM (Document Object Model) is a standard for accessing XML document
trees, defined by the W3C. In PHP 4, the domxml extension was used for
doing just that. The main problem with the domxml extension was that
it didn't follow the standard method names. It also had memory leak
issues for a long time (they were fixed in PHP 4.3).
The new DOM extension is completely based on the W3C standard,
including method and property names. If you're familiar with DOM from
other languages, for example in JavaScript, it will be much easier for
you to code similar functionality in PHP. You don't have to check the
documentation all the time, because the methods and parameters are
identical.
As a consequence of this new W3C compatibility, your old domxml-based
scripts won't work anymore. The API is quite different in PHP 5. But
if you used the "almost W3C compatible" method names available in PHP
4.3, porting isn't such a big deal. You only need to change the
loading and saving methods, and remove the underscore in the method
names (the DOM standard uses studlyCaps). Other adjustments here and
there may be necessary, but the main logic can stay the same.
Reading the DOM
I will not explain all the features of the DOM extension in this
article; that would be overkill. You may want to bookmark the
documentation available at http://www.w3.org/DOM, which basically
corresponds to the implementation in PHP 5.
For most of examples in this article we will use the same XML file; a
much-simplified version of the RSS available at zend.com. Paste the
following into a text file and save it as articles.xml:
http://www.zend.com/zend/week/week172.php
http://www.zend.com/zend/tut/tut-hatwar3.php
To load this example into a DOM object, you have to create a
DomDocument object, and then load the XML file:
$dom = new DomDocument();
$dom->load("articles.xml");
As mentioned above, you could use a PHP stream to load an XML
document. You would do this by writing:
$dom->load("file:///articles.xml");
(or any other type of stream, as appropriate).
If you want to output the XML document to the browser or as standard
output, use:
print $dom->saveXML();
If you want to save it to a file, use:
print $dom->save("newfile.xml");
(Note that this action will send the filesize to stdout.)
There's not much functionality in this example, of course, so let's do
something more useful: let's grab all the titles. There are different
ways to do this, the easiest one being to use
getElementsByTagname($tagname):
$titles = $dom->getElementsByTagName("title");
foreach($titles as $node) {
print $node->textContent . " ";
convenience property to access all the text nodes of an element
quickly. The W3C way to read this would have been:
$node->firstChild->data;
(but only if you were sure that firstChild was the text node you
needed, otherwise you would have to loop through all the child nodes
to find that).
One other thing to notice is that getElementsByTagName() returns a
DomNodeList, and not an array as the similar function
get_elements_by_tagname() did in PHP 4. But as you can see in the
example, you can easily loop through it with a foreach directive. You
could also directly access the nodes with $titles->item(0). This would
return the first title element.
Another approach to getting all the titles would be to loop through
the nodes starting with the root element. As you can see, this is way
more complicated, but it's also more flexible should you need more
than just the title elements.
foreach ($dom->documentElement->childNodes as $articles) {
//if node is an element (nodeType == 1) and the name is "item"
loop further
if ($articles->nodeType == 1 && $articles->nodeName == "item") {
foreach ($articles->childNodes as $item) {
//if node is an element and the name is "title", print it.
if ($item->nodeType == 1 && $item->nodeName == "title") {
print $item->textContent . " ";
}
}
}
XPath is something like SQL for XML. With XPath you can query an XML
document for a specific node matching some criteria. To get all the
title nodes with XPath, just do the following:
query("/articles/item/title");
foreach ($titles as $node) {
print $node->textContent . " ";
This is almost the same code as with getElementsByTagName(), but XPath
is much more powerful. For example, if we had a title element as a
child of the article element (instead of being the child of an item
element), getElementsByTagname() would return it. With /articles/item/
title we only pick up the title elements that are placed at the
desired level. This is just a simple example; further possibilities
might be:
* /articles/item[position() = 1]/title returning the title element
of the first item element.
* /articles/item/title[@id = '23'] returning all title elements
having an attribute id with the value 23
* /articles//title returning all title elements that are placed
below articles
You can also query for elements which have a specific sibling element,
or which have a certain text content, or using namespaces, etc. If you
have to query XML documents a lot, learning to use XPath properly will
save you a lot of time. It's much easier to use, faster in execution,
and requires less code than the standard DOM methods.
Writing to the DOM
The Document Object Model can not only be read and queried; you can
also manipulate and write to it. (The DOM standard is a little
verbose, because its writers tried to support just about every
imaginable situation, but it does the job very well). See the next
example, where a new element is added to our articles.xml:
$item = $dom->createElement("item");
$title = $dom->createElement("title");
$titletext = $dom->createTextNode("XML in PHP5");
$title->appendChild($titletext);
$item->appendChild($title);
$dom->documentElement->appendChild($item);
print $dom->saveXML();
First, we create all the needed nodes: an item element, a title
element and a text node containing the title of the item. Then we
chain all the nodes together by appending the text node to the title
element and appending the title element to the item element. Finally
we insert the item element into the root element articles, and voilĂ !
- we have a new article listed in our XML document.
Extending Classes
While the above examples were all doable with PHP 4 and the domxml
extension (only the API was a little bit different), the ability to
extend DOM classes with your own code is a new feature of PHP 5. This
makes it possible to write more readable code. Here's the whole
example again, re-written to use the DomDocument class:
class Articles extends DomDocument {
function __construct() {
//has to be called!
parent::__construct();
}
function addArticle($title) {
$item = $this->createElement("item");
$titlespace = $this->createElement("title");
$titletext = $this->createTextNode($title);
$titlespace->appendChild($titletext);
$item->appendChild($titlespace);
$this->documentElement->appendChild($item);
}
$dom->load("articles.xml");
$dom->addArticle("XML in PHP5");
print $dom->save("newfile.xml");
HTML
An often-overlooked feature in PHP 4 is the HTML support in libxml2.
You can not only load well-formed XML documents with the DOM
extension, but you can also load not-well-formed HTML documents, treat
them as regular DomDocument objects, and use all the available methods
and features such as XPath and SimpleXML.
This HTML capability is very useful if you need to access content from
a website you don't control. With the help of XPath, XSLT or SimpleXML
you avoid a lot of coding, as compared with using regular expressions
or a SAX parser. This is especially useful if the HTML document is not
well structured (a frequent problem!).
The code below fetches the php.net index page, parses it and returns
the name of the first title element:
$dom = new DomDocument();
$dom->loadHTMLFile("http://www.php.net/");
$title = $dom->getElementsByTagName("title");
print $title->item(0)->textContent;
Note that you may get errors as part of your output when expected
elements are not found.
If you're one of those people still outputting HTML 4 code on their
web pages, there is good news for you, too. The DOM extension cannot
only load HTML documents, but can also save them as HTML 4. Just use
$dom->saveHTML() after you have built up your DOM document. Note that,
for simply making HTML code W3C standards compliant, you're far better
off using the Tidy extension. The HTML support in libxml2 is not tuned
for every possible case, and doesn't cope well with input in uncommon
formats.
Validation
Validation of XML documents is getting more and more important. For
example, if you get an XML document from some foreign source, you need
to verify that it follows a certain format before you can process it.
Luckily it's not necessary to write your own validating code in PHP,
because you can use one of the three widely used standards for doing
this: DTD, XML Schema or RelaxNG.
* DTD is a standard that comes from SGML days, and lacks some of
the newer XML features (like namespaces). Also, because it's not
written in XML, it's not easily parsed and/or transformed.
* XML Schema is a standard defined by the W3C. It's very extensive
and has taken care of almost every imaginable need for validating XML
documents.
* RelaxNG was an answer to the complex XML Schema standard, and
was created by an independent group. More and more programs support
RelaxNG, since it's much easier to implement than XML Schema.
If you don't have legacy schema documents, or overly complex XML
documents, go for RelaxNG. It's easier to write, easier to read, and
more and more tools support it. There's even a tool called Trang,
which automatically creates a RelaxNG document from sample XML
document(s). Furthermore, only RelaxNG (and the aging DTDs) is fully
supported by libxml2, although full XML Schema support is coming
along.
The syntax for validating XML documents is quite simple:
* $dom->validate('articles.dtd');
* $dom->relaxNGValidate('articles.rng');
* $dom->schemaValidate('articles.xsd');
At present, these all simply return true or false. Errors are dumped
out as PHP warnings. Obviously this is not the ideal way to give good
feedback to the user, and it will be enhanced in one of the releases
after PHP 5.0.0. The exact implementation is currently under
discussion, but will certainly lead to better error reporting for
parse errors and so on.
SimpleXML
SimpleXML is the latest addition to the XML family in PHP. The goal of
the SimpleXML extension is to provide easy access to XML documents
using standard object properties and iterators. This extension doesn't
have many methods, but it's quite powerful nonetheless. Getting all
the title nodes from our document requires even less code than before:
$sxe = simplexml_load_file("articles.xml");
foreach($sxe->item as $item) {
print $item->title ." ";
object. Then it gets all elements named item with the property $sxe-
using: $item->title['id'].
As you can see, there's a lot of magic behind this, and there are
different ways to get the desired result. For example, $item->title[0]
returns the same result as the example. On the other hand,
foreach($sxe->item->title as $item) only returns the first title, and
not all the titles stored in the document (as I - coming from XPath -
would have expected).
SimpleXML is actually one of the first extensions to use most of the
new features available with Zend Engine 2. It's therefore also the
testing ground for these new features. You should be aware that bugs
and unexpected behavior are not uncommon during this stage of
development.
Besides the traditional "loop through all the nodes" approach, as
shown in the example above, there's also an XPath interface in
SimpleXML, which provides even easier access to individual nodes:
foreach($sxe->xpath('/articles/item/title') as $item) {
print $item . " ";
given more complex or deeply nested XML documents you'll find that
using XPath together with SimpleXML saves you a lot of typing.
Writing to SimpleXML documents
You can not only parse and read, but also change SimpleXML documents.
At least, to some extent:
$sxe->item->title = "XML in PHP5 "; //new text content for the title
element
$sxe->item->title['id'] = 34; // new attribute for the title element
$xmlString = $sxe->asXML(); // returns the SimpleXML object as a
serialized XML string
print $xmlString;
Interoperability
As SimpleXML is also based on libxml2, you can easily convert
SimpleXML objects to DomDocument objects and vice versa without a big
impact on speed (the document doesn't have to be copied internally).
With this mechanism you can have the best of both worlds, using the
tool best suited for the job in hand. It works with the following
methods:
* $sxe = simplexml_import_dom($dom);
* $dom = dom_import_simplexml($sxe);
XSLT
XSLT is a language for transforming XML documents into other XML
documents. XSLT is itself written in XML, and belongs to the family of
functional languages, which have a different approach to that of
procedural and object-orientated languages like PHP.
There were two different XSLT processors implemented in PHP 4:
Sablotron (in the more widely used and known xslt extension), and
libxslt (within the domxml extension). The two APIs were not
compatible with each other, and their feature sets were also
different.
In PHP 5, only the libxslt processor is supported. Libxslt was chosen
because it's also based on libxml2 and therefore fits perfectly into
the XML concept of PHP 5.
It would theoretically be possible to port the Sablotron binding to
PHP 5 as well, but unfortunately no one did this yet. Therefore, if
you're using Sablotron you will have to switch to the libxslt
processor for PHP 5. libxslt is - with the exception of the JavaScript
support - feature-equivalent to Sablotron. Even the useful Sablotron-
specific scheme handlers can be reimplemented with the much more
powerful and portable PHP streams. In addition, libxslt is one of the
fastest XSLT implementations available, so you'll get a nice speed
boost for free (the execution speed can be double that of Sablotron).
As with all the other extensions discussed in this article, you can
exchange XML documents from the XSL extension to the DOM extension and
vice versa. In fact you have to, as ext/xsl doesn't have an interface
to load and save XML documents, but uses the one from the DOM
extension.
You don't need many methods for starting an XSLT transformation, and
there is no W3C standard for it, therefore the API was "borrowed" from
Mozilla.
First, you need an XSLT stylesheet. Paste the following into a new
file and save it as articles.xsl:
Transform">
Then call it with a PHP script:
/* load the xml file and stylesheet as domdocuments */
$xsl = new DomDocument();
$xsl->load("articles.xsl");
$inputdom = new DomDocument();
$inputdom->load("articles.xml");
/* create the processor and import the stylesheet */
$proc = new XsltProcessor();
$xsl = $proc->importStylesheet($xsl);
$proc->setParameter(null, "titles", "Titles");
/* transform and output the xml document */
$newdom = $proc->transformToDoc($inputdom);
print $newdom->saveXML();
?>
The above example first loads the XSLT stylesheet articles.xsl with
the help of the DOM method load(). Then it creates a new XsltProcessor
object, which imports the loaded XSLT stylesheet for later execution.
Parameters can be set with setParameter(namespaceURI, name, value),
and finally it starts the transformation with
transformToDoc($inputdom), which returns a new DomDocument.
This API has the advantage that you can make dozens of XSLT
transformations with the same stylesheet, just loading it once and
reusing it, as transormToDoc() can be applied to different XML
documents.
Besides transformToDoc(), there are two other transformation methods;
transformToXML($dom), which returns a string, and transformToURI($dom,
$uri), which saves the transformation to a file or a PHP stream. Note
that if you want to use an XSLT feature such as
because the DomDocument cannot retain this information. These
directives will be used only if you output the transformation directly
to a string or a file.
Calling PHP Functions
One of the latest features added to the XSLT extension is the ability
to call any PHP function from within an XSLT stylesheet. While XML/
XSLT purists will certainly dislike this (such stylesheets won't be
portable anymore, and could easily mix logic and design), it can be
very useful in some special cases. XSLT is very limited when it comes
down to functions. Even outputting a date in different languages can
be painful to implement - but with this feature it's no more
complicated than with PHP itself. Here's the PHP snippet for adding a
function into XSLT:
function dateLang () {
return strftime("%A");
$xsl->load("datetime.xsl");
$inputdom = new DomDocument();
$inputdom->load("today.xml");
$proc = new XsltProcessor();
$proc->registerPhpFunctions();
// Load the documents and process using $xslt
$xsl = $proc->importStylesheet($xsl);
/* transform and output the xml document */
$newdom = $proc->transformToDoc($inputdom);
print $newdom->saveXML();
?>
Here's the XSLT stylesheet, datetime.xsl, that will call that
function:
Transform" xmlns:php="http://php.net/xsl">
And here's an absolute minimal XML file, today.xml, to pass through
the stylesheet (although articles.xml would achieve the same result):
The stylesheet above, together with the PHP script and any xml file
loaded, will output the current weekday name in the language defined
in the locale settings. You could add more arguments to
php:function(), which would also be passed to the PHP function.
Additionally, there's php:functionString(). This function
automatically converts all input parameters to strings, so that you
don't need to convert them when they reach PHP.
Note that you will need to call $xslt->registerPhpFunctions(); before
the transformation, otherwise the PHP function-calls will not work for
security reasons (can you always trust your XSLT stylesheets?). A more
refined access system (i.e. one that limits access to specific
methods) is not available yet, but would not be impossible to
implement in a future PHP 5 release.
Summary
XML support in PHP has taken a great step forward. It is standards-
compliant, well behaved, feature-rich, interoperable - and enabled-by-
default functionality can now be taken for granted.
PHP 4's much-disliked domxml extension has been completely rewritten.
The new DOM extension follows the W3C standard almost to the dot, and
has also resolved a lot of internal memory problems. With the added
support of some general PHP features, such as class inheritance and
stream support, even more powerful and tightly integrated XML
applications will be possible.
The newly added SimpleXML extension is an easy and fast way to access
XML documents. It can save you a lot of coding, especially if you have
structured documents or are able to use the power of XPath.
Thanks to libxml2, the underlying library used for all PHP 5 XML
extensions, validation of XML documents using DTD, RelaxNG or (to some
extent) XML Schema is now supported.
XSLT support also got a facelift and now uses the libxslt library,
which should improve performance over the old Sablotron library.
Furthermore, the ability to call PHP functions from within XSLT
stylesheets allows you to write more powerful (though unfortunately
less portable) XSLT code.
If you used XML in PHP 4 or in another language, you will love the XML
support in PHP 5. XML in PHP 5 is much improved, is standards-
compliant, and is finally on a par with other tools and languages.
Links
PHP 4 specific
* Domxml extension: http://www.php.net/domxml/
* Sablotron extension: http://www.php.net/xslt/
* Libxslt: http://www.php.net/manual/en/function.domxml-xslt-stylesheet.php
PHP 5 specific
* SimpleXML: http://www.php.net/simplexml/
* Streams: http://www.php.net/manual/en/ref.stream.php
Standards
* DOM: http://www.w3.org/DOM
* XSLT: http://www.w3.org/TR/xslt
* XPath: http://www.w3.org/TR/xpath
* XML Schema: http://www.w3.org/XML/Schema
* RelaxNG: http://relaxng.org/
* Xinclude: http://www.w3.org/TR/xinclude/
No comments:
Post a Comment