Custom conversions of XML to JSON in XSLT

There are already several resources online devoted to converting XML to JSON with XSLT. Unfortunately, most of these resources describe only how to generate a quick JSON view closely resembling the original XML. This might be all you need, and if so you are in luck. I’ve seen several very useful XSLT templates on the web to do just this. But what if you would like to tweak your JSON output a little? Supposedly, JSON is preferred by developers as a simpler and more straightforward standard than their old foe XML. Why then should we slavishly copy the mistakes of the past into the future?

Let’s take some sample XML I pulled from the World Heritage Centre describing a cultural heritage site in Brazil (I also edited it slightly):

<site>
    <date_inscribed>1983</date_inscribed>
    <http_url>http://whc.unesco.org/en/list/275</http_url>
    <id_number>275</id_number>
    <image_url>http://whc.unesco.org/uploads/sites/site_275.jpg</image_url>
    <iso_code>ar,br</iso_code>
    <latitude>-28.5433333300</latitude>
    <location>State of Rio Grande do Sul, Brazil; Province of Misiones, Argentina</location>
    <longitude>-54.2658333300</longitude>
    <region>Latin America and the Caribbean</region>
    <short_description>&lt;p&gt;The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil,
        and those of San Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and
        Santa Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They
        are the impressive remains of five Jesuit missions, built in the land of the Guaranis during
        the 17th and 18th centuries. Each is characterized by a specific layout and a different
        state of conservation.&lt;/p&gt;</short_description>
    <states>Argentina,Brazil</states>
</site>

Now let’s pass it through one of the many XML to JSON XSLT templates I came across online. This one is from Convert XML to JSON using XSLT and the output looks like:

{"site": {
    "date_inscribed": "1983",
    "http_url": "http://whc.unesco.org/en/list/275",
    "id_number": "275",
    "image_url": "http://whc.unesco.org/uploads/sites/site_275.jpg",
    "iso_code": "ar,br",
    "latitude": "-28.5433333300",
    "location": "State of Rio Grande do Sul, Brazil; Province of Misiones, Argentina",
    "longitude": "-54.2658333300",
    "region": "Latin America and the Caribbean",
    "short_description": "<p>The ruins of S&atilde;o Miguel das Miss&otilde;es in Brazil, and those of San Ignacio Min&iacute;, Santa Ana, Nuestra Se&ntilde;ora de Loreto and Santa Mar&iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the Guaranis during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation.",
    "states": "Argentina,Brazil"
}}

This XML is fairly simplistic and does not contain any attributes, but I think the example does convey what is possible with XSLT. Now we have a fairly accurate rendition of our original XML object.

But looking at this output there are a few things I’d like to change:

–‘location’, ‘states’ and ‘iso_code’ contain multiple values. Those should really be arrays.

–I would like to use the GeoJSON standard for encoding latitude and longitude

–The coordinates should be numbers, not strings

With the changes described above, the JSON above would end up looking something like:

{"site": {
    "date_inscribed": "1983",
    "http_url": "http://whc.unesco.org/en/list/275",
    "id_number": "275",
    "image_url": "http://whc.unesco.org/uploads/sites/site_275.jpg",
    "iso_codes": [
        "ar",
        "br"
    ],
    "geometry": {
        "type": "Point",
        "coordinates": [
            -28.54333333,
            -54.26583333
        ]
    },
    "locations": [
        "State of Rio Grande do Sul, Brazil",
        "Province of Misiones, Argentina"
    ],
    "region": "Latin America and the Caribbean",
    "short_description": "<p>The ruins of S&atilde;o Miguel das Miss&otilde;es in Brazil, and those of San Ignacio Min&iacute;, Santa Ana, Nuestra Se&ntilde;ora de Loreto and Santa Mar&iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the Guaranis during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation.",
    "states": [
        "Argentina",
        "Brazil"
    ]
}}

If I was feeling really confident, I might want to associated the country names with their codes and perhaps group all of the location data in its own object. But for now let’s be happy with the changes we have made. We are now using a common standard for our latitude and longitude values which will helpfully give developers a leg up on consuming our data. We have also parsed those multi-value fields into arrays which will allow for better searching across these fields. Note that in ‘iso_codes’ and ‘states’ the delimiter is a comma and in ‘locations’ it is a semi-colon. We’ve just ironed out that discrepancy and made parsing this data a lot easier on our JSON users.

Of course, now we need to write our own XSLT, and that’s mostly what I wanted to discuss here. It may seem that <> is not too far from {}, but there are syntactical differences that make a transform challenging – especially when we wish to dig into the data and make some changes.

When writing XSLT one can employ a push method where the source tree is pushed through a set of templates, a pull method where specific nodes are retrieved and employed in the desired fashion, or hybrid of the two. To generate the JSON above we’ll definitely need to use the hybrid approach. Here’s an XSLT that will translate our source data to the JSON above:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" method="text" media-type="application/json"/>
    <xsl:output method="text" encoding="utf-8"/>
    <xsl:template match="site">
        <xsl:text>{"site":{</xsl:text>
        
        <xsl:apply-templates/>
        <xsl:text>"geometry": {"type":"Point","coordinates":[</xsl:text>
        <xsl:value-of select="latitude"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="longitude"/>
        <xsl:text>]}</xsl:text>
        <xsl:text>}}</xsl:text>
    </xsl:template>
    
    <!-- String values from /site -->
    <xsl:template match="date_inscribed|http_url|id_number|image_url|region|short_description">
        <xsl:text>"</xsl:text>
        <xsl:value-of select="local-name()"/>
        <xsl:text>":"</xsl:text>
        <xsl:value-of select="normalize-space(.)"/>
        <xsl:text>",</xsl:text>
        <xsl:apply-templates/>
    </xsl:template>
    
    <!-- comma separated array values values from /site -->
    <xsl:template match="iso_code|states">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,','))"/>
        <xsl:text>"</xsl:text>
        <xsl:choose>
            <xsl:when test="local-name()='iso_code'">
                <xsl:text>iso_codes</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="local-name()"/>
            </xsl:otherwise>
        </xsl:choose>
        <xsl:text>":[</xsl:text>
        <xsl:for-each select="$tokens">
            <xsl:text>"</xsl:text>
            <xsl:value-of select="normalize-space(.)"/>
            <xsl:text>"</xsl:text>
            <xsl:if test="position() != last()">
                <xsl:text>, </xsl:text>
            </xsl:if>
        </xsl:for-each>
        <xsl:text>],</xsl:text>
    </xsl:template>
    
    <!-- semi-colon separated array values values from /site -->
    <xsl:template match="location">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,';'))"/>
        <xsl:text>"</xsl:text>
        <xsl:choose>
            <xsl:when test="local-name()='location'">
                <xsl:text>locations</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="local-name()"/>
            </xsl:otherwise>
        </xsl:choose>
        <xsl:text>":[</xsl:text>
        <xsl:for-each select="$tokens">
            <xsl:text>"</xsl:text>
            <xsl:value-of select="normalize-space(.)"/>
            <xsl:text>"</xsl:text>
            <xsl:if test="position() != last()">
                <xsl:text>, </xsl:text>
            </xsl:if>
        </xsl:for-each>
        <xsl:text>],</xsl:text>
    </xsl:template>
    
    <!-- Whenever you match any node or any attribute -->
    <xsl:template match="node()|@*">        
        <!-- Including any attributes it has and any child nodes -->
        <xsl:apply-templates select="@*|node()"/>
    </xsl:template>
</xsl:stylesheet>

The source data uses inconsistent pluralization so I’ve adjusted some of the local-names. It also uses different and separator tokens as mentioned above, so I’ve had to create a few duplicate templates. Everything in the main template is representative of the ‘pull’ methodology. There is really no true source in the original document for the GeoJSON we want to output, so we construct it within the ‘site’ template.

This transform will output null values if one of the expected elements in the source XML is missing. Go ahead and comment out latitude and see what happens; you should end up with a null value in your JSON output. This is only an issue for objects created in the aforementioned ‘pull’ style transform in the main template.

We could go ahead and add an if statement here to test for the presence of an element before converting, but then we run the risk of introducing a trailing comma which would invalidate the JSON output. And in fact, the more we add to the ‘pull’ section of the transform the higher the risk of creating invalid, or at least not useful, JSON.

One strategy I would recommend is to split your transform into two steps. In the first step you would generate an XML view of your desired JSON, and in the second you would parse this secondary ‘JSON-ML’ into actual JSON. Here’s an example of what I mean using a JSON-ML standard I’ve borrowed from Marklogic:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    <xsl:template match="site">
        <xsl:element name="json" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:element name="site" namespace="http://marklogic.com/xdmp/json/basic">
                <xsl:attribute name="type">
                    <xsl:text>object</xsl:text>
                </xsl:attribute>
                <xsl:apply-templates/>
            </xsl:element>
        </xsl:element>
    </xsl:template>
    <!-- String values from /site -->
    <xsl:template match="date_inscribed|http_url|id_number|image_url|region|short_description">
        <xsl:element name="{local-name()}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>string</xsl:text>
            </xsl:attribute>
            <xsl:value-of select="normalize-space(.)"/>
        </xsl:element>
    </xsl:template>
    <!-- comma separated array values values from /site -->
    <xsl:template match="iso_code|states">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,','))"/>
        <xsl:variable name="object-name">
            <xsl:choose>
                <xsl:when test="local-name()='iso_code'">
                    <xsl:text>iso_codes</xsl:text>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="local-name()"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:variable>
        <xsl:element name="{$object-name}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>array</xsl:text>
            </xsl:attribute>
            <xsl:for-each select="$tokens">
                <xsl:element name="json" namespace="http://marklogic.com/xdmp/json/basic">
                    <xsl:attribute name="type">
                        <xsl:text>string</xsl:text>
                    </xsl:attribute>
                    <xsl:value-of select="normalize-space(.)"/>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
    <!-- semi-colon separated array values values from /site -->
    <xsl:template match="location">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,';'))"/>
        <xsl:variable name="object-name">
            <xsl:choose>
                <xsl:when test="local-name()='iso_code'">
                    <xsl:text>iso_codes</xsl:text>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="local-name()"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:variable>
        <xsl:element name="{$object-name}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>array</xsl:text>
            </xsl:attribute>
            <xsl:for-each select="$tokens">
                <xsl:element name="json" namespace="http://marklogic.com/xdmp/json/basic">
                    <xsl:attribute name="type">
                        <xsl:text>string</xsl:text>
                    </xsl:attribute>
                    <xsl:value-of select="normalize-space(.)"/>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
    
    <!-- Whenever you match any node or any attribute -->
    <xsl:template match="node()|@*">
        <!-- Including any attributes it has and any child nodes -->
        <xsl:apply-templates select="@*|node()"/>
    </xsl:template>
</xsl:stylesheet>

Which will generate the following JSON-ML:

<json xmlns="http://marklogic.com/xdmp/json/basic">
   <site type="object">
      <date_inscribed type="string">1983</date_inscribed>
      <http_url type="string">http://whc.unesco.org/en/list/275</http_url>
      <id_number type="string">275</id_number>
      <image_url type="string">http://whc.unesco.org/uploads/sites/site_275.jpg</image_url>
      <iso_codes type="array">
         <json type="string">ar</json>
         <json type="string">br</json>
      </iso_codes>
      <location type="array">
         <json type="string">State of Rio Grande do Sul, Brazil</json>
         <json type="string">Province of Misiones, Argentina</json>
      </location>
      <region type="string">Latin America and the Caribbean</region>
      <short_description type="string">&lt;p&gt;The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil, and those of San Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and Santa Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the Guaranis during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation.&lt;/p&gt;</short_description>
      <states type="array">
         <json type="string">Argentina</json>
         <json type="string">Brazil</json>
      </states>
   </site>
</json>

And finally, here is a generic stylesheet for transforming any JSON-ML document to valid JSON:

<xsl:stylesheet version="2.0" xmlns:fn="http://www.w3.org/2005/xpath-functions"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:appl="http://ap.org/schemas/03/2005/appl"
    xmlns:json="http://marklogic.com/xdmp/json/basic" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <xsl:output omit-xml-declaration="yes" method="text" encoding="UTF-8"
        media-type="application/json"/>
    
    <xsl:template match="json:json">
        <xsl:text>{</xsl:text>
        <xsl:for-each select="child::*[not(string-length(.)=0)]">
            <xsl:choose>
                <xsl:when test="normalize-space()=''"/>
                <xsl:otherwise>
                    <xsl:call-template name="recurse"/>
                    <xsl:if test="not(position()=last())">
                        <xsl:text>,</xsl:text>
                    </xsl:if>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:for-each>
        <xsl:text>}</xsl:text>
    </xsl:template>
    
    <xsl:template name="recurse">
        <xsl:choose>
            <xsl:when test="@type='string'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:text>"</xsl:text>
                <xsl:value-of select="."/>
                <xsl:text>"</xsl:text>
            </xsl:when>
            <xsl:when test="@type='number' or @type='boolean'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:value-of select="."/>
            </xsl:when>
            <xsl:when test="@type='object'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:text>{</xsl:text>
                <xsl:for-each select="child::*[not(string-length(.)=0)]">
                    <xsl:call-template name="recurse"/>
                    <xsl:if test="not(position()=last())">
                        <xsl:text>,</xsl:text>
                    </xsl:if>
                </xsl:for-each>
                <xsl:text>}</xsl:text>
            </xsl:when>
            <xsl:when test="@type='array'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:text>[</xsl:text>
                <xsl:for-each select="child::*[not(string-length(.)=0)]">
                    <xsl:call-template name="recurse"/>
                    <xsl:if test="not(position()=last())">
                        <xsl:text>,</xsl:text>
                    </xsl:if>
                </xsl:for-each>
                <xsl:text>]</xsl:text>
            </xsl:when>
            <xsl:otherwise/>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>

This approach eliminates any danger of outputting an unnecessary trailing comma, and also nicely filters out any null values, but it obviously adds some processing time to your transform. It may also be unnecessary for your purposes, if the push style transform is sufficient for your desired data model.

I’ll end with a few other templates which may be useful to the would-be JSON XSLT writer.

In my example above, the XML that is encoded in short_description is already serialized, but if you need to do the work of serialization yourself, try the following:

Source XML:

    <short_description>
        <p>The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil, and those of San Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and Santa
            Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are
            the impressive remains of five Jesuit missions, built in the land of the Guaranis during
            the 17th and 18th centuries. Each is characterized by a specific layout and a different
            state of conservation.</p>
    </short_description>

XSLT:

   <!-- serialize xml to string -->
    
    <xsl:template match="*" mode="serialize">
        <xsl:text>&lt;</xsl:text>
        <xsl:value-of select="name(.)"/>
        <xsl:text>&gt;</xsl:text>
        <xsl:apply-templates mode="serialize"/>
        <xsl:text>&lt;/</xsl:text>
        <xsl:value-of select="name(.)"/>
        <xsl:text>&gt;</xsl:text>
    </xsl:template>

    <xsl:template match="short_description">
        <xsl:variable name="desc">
            <xsl:apply-templates select="./*" mode="serialize"/>
        </xsl:variable>
        <xsl:element name="{local-name()}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>string</xsl:text>
            </xsl:attribute>
            <xsl:value-of select="normalize-space($desc)"/>
        </xsl:element>
    </xsl:template>

And finally a few templates for escaping characters that would have potentially ill effects on your JSON output:

Source XML:

    <short_description>
        <p>The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil, and those of San
            Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and Santa
            Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are
            the impressive remains of five Jesuit missions, built in the land of the "Guaranis" during
            the 17th and 18th centuries. Each is characterized by a specific layout and a different
            state of conservation.</p>
    </short_description>

XSLT:

<!-- Escape the backslash (\) before everything else. -->
    <xsl:template name="escape-string">
        <xsl:param name="s"/>
        <xsl:choose>
            <xsl:when test="contains($s,'\')">
                <xsl:call-template name="escape-quot-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'\'),'\\')"/>
                </xsl:call-template>
                <xsl:call-template name="escape-string">
                    <xsl:with-param name="s" select="substring-after($s,'\')"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:call-template name="escape-quot-string">
                    <xsl:with-param name="s" select="$s"/>
                </xsl:call-template>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <!-- Escape the double quote ("). -->
    <xsl:template name="escape-quot-string">
        <xsl:param name="s"/>
        <xsl:choose>
            <xsl:when test="contains($s,'&quot;')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'&quot;'),'\&quot;')"/>
                </xsl:call-template>
                <xsl:call-template name="escape-quot-string">
                    <xsl:with-param name="s" select="substring-after($s,'&quot;')"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="$s"/>
                </xsl:call-template>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <xsl:template name="encode-string">
        <xsl:param name="s"/>
        <xsl:choose>
            <!-- tab -->
            <xsl:when test="contains($s,'	')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'	'),'\t',substring-after($s,'	'))"/>
                </xsl:call-template>
            </xsl:when>
            <!-- line feed -->
            <xsl:when test="contains($s,'
')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'
'),'\n',substring-after($s,'
'))"/>
                </xsl:call-template>
            </xsl:when>
            <!-- carriage return -->
            <xsl:when test="contains($s,'
')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'
'),'\r',substring-after($s,'
'))"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="$s"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>
<xsl:template match="short_description">
        <xsl:variable name="desc">
            <xsl:apply-templates select="./*" mode="serialize"/>
        </xsl:variable>
        <xsl:element name="{local-name()}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>string</xsl:text>
            </xsl:attribute>
            
            <xsl:call-template name="escape-string">
                <xsl:with-param name="s" select="normalize-space($desc)"/>
            </xsl:call-template>
        </xsl:element>
    </xsl:template>

Which will ultimately produce:

"short_description": "<p>The ruins of S&atilde;o Miguel das Miss&otilde;es in Brazil, and those of San Ignacio Min&iacute;, Santa Ana, Nuestra Se&ntilde;ora de Loreto and Santa Mar&iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the \"Guaranis\" during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation."

There are perhaps better technologies for generating JSON from XML, but if XSLT is your preferred tool then by all means you should use it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s