Recently I was faced with a directory full of XML files containing test data. The tests were all failing due to a couple of unwanted elements within each test file.
The question - how to write an xsl stylesheet to remove the offending elements.
After some head scratching the following script seemed to do the trick
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="badElement"/>
</xsl:stylesheet>
This recreates the original xml file exactly, minus any elements called badElement
How does it work?
Well the first part is called an identity transform. This template matches every input node in the source document, (elements attributes and content_, and copies it to the output tree.
The second template matches only on elements called badElement. Clearly, the badElement in the source document will match both the identity transform template and the badElement template. . The XSL specification defines behaviour when a source node matches multiple templates. Essentially, each template has an associated priority. In general, a more specific match gets a higher priority by default, which is why in this case the badElement node is processe by the badElement template rule. This rule does nothing - it has not output and does not contain an apply-templates, effectively supressing its child nodes too. Which is why the badElement and its child nodes are effectively filtered out from the target document.
If in doubt over which template will take priority, it is actually possible to specify a priority value explicitly as an attribute to the template declaration in the xsl stylesheet.