Generating new triples in Marklogic with SPARQL CONSTRUCT (and INSERT)

SPARQL is known mostly as query language, but it also has the capability—via the CONSTRUCT operator—to generate new triples. This can be useful for delivering a custom snippet of RDF to a user, but can also be used to write new data back to the database, enriching what was already there. Marklogic’s triple store supports the SPARQL standard, including CONSTRUCT queries, and the results can be easily incorporated back into the data set using the XQuery Semantics API. Here’s a quick demo.

I have a set of geography terms which have already been linked to the Geonames dataset. Here’s an example:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> ;
 <http://www.w3.org/2004/02/skos/core#inScheme> <http://cv.ap.org/a#geography> ; <http://www.w3.org/2003/01/geo/wgs84_pos#long> &quot;-70.76255&quot;^^xs:decimal ;
 <http://www.w3.org/2003/01/geo/wgs84_pos#lat> &quot;43.07176&quot;^^xs:decimal ;  <http://www.w3.org/2004/02/skos/core#exactMatch> <http://sws.geonames.org/5091383/> ;
 <http://www.w3.org/2004/02/skos/core#broader> <http://cv.ap.org/id/9531546082C6100487B5DF092526B43E> ;
 <http://www.w3.org/2004/02/skos/core#prefLabel> &quot;Portsmouth&quot;@en .

If we look at the same term via the New York Times’ Linked Open Data service we’ll see a set of equivalent terms, including the Geonames resource for Portsmouth:


<http://data.nytimes.com/10237454346559533021> <http://www.w3.org/2002/07/owl#sameAs> <http://data.nytimes.com/portsmouth_nh_geo> ;
 <http://dbpedia.org/resource/Portsmouth%2C_New_Hampshire> ;
 <http://rdf.freebase.com/ns/en.portsmouth_new_hampshire> ;
 <http://sws.geonames.org/5091383/> .

Oh, hey, we have the same Geonames URI. Guess what we can do with that? More links!

After ingesting the NYTimes data into Marklogic, I was able to write a SPARQL query to begin connecting the two datasets using the Geonames URI as glue.


 PREFIX cts: <http://marklogic.com/cts#>
 PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX owl: <http://www.w3.org/2002/07/owl#>
 SELECT ?s ?n 
 WHERE
 {
 ?s skos:inScheme <http://cv.ap.org/a#geography> .
 ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
 ?s skos:exactMatch ?gn .
 ?n owl:sameAs ?gn .
 } 
 LIMIT 2

Returning:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://data.nytimes.com/10237454346559533021>
<http://cv.ap.org/id/662030807D5B100482BDC076B8E3055C> <http://data.nytimes.com/10616800927985096861>

Now, if we want to generate triples instead of SPARQL results, we simply swap out our SELECT for a CONSTRUCT operator, like so:


PREFIX cts: <http://marklogic.com/cts#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
CONSTRUCT { ?s skos:exactMatch ?n .}
WHERE
 {
 ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
 ?s skos:inScheme <http://cv.ap.org/a#geography> .
 ?s skos:exactMatch ?gn .
 ?n owl:sameAs ?gn .
 } 
LIMIT 2

Returning:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://www.w3.org/2004/02/skos/core#exactMatch> <http://data.nytimes.com/10237454346559533021> .
<http://cv.ap.org/id/662030807D5B100482BDC076B8E3055C> <http://www.w3.org/2004/02/skos/core#exactMatch> <http://data.nytimes.com/10616800927985096861> .

We have a few options for writing our newly generated triples back to the database, but let’s start with Marklogic’s XQuery Semantics API, in particular the sem:rdf-insert function. Here’s a bit of XQuery to run the SPARQL query above and insert them into the <geography> graph in the database:


import module namespace sem = &quot;http://marklogic.com/semantics&quot;
  at &quot;/MarkLogic/semantics.xqy&quot;

let $sparql := &quot;PREFIX cts: <http://marklogic.com/cts#>
                PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
                PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
                PREFIX owl: <http://www.w3.org/2002/07/owl#>
                CONSTRUCT { ?s skos:exactMatch ?n .}
                WHERE
                {
                ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
                ?s skos:inScheme <http://cv.ap.org/a#geography> .
                ?s skos:exactMatch ?gn .
                ?n owl:sameAs ?gn .
                } &quot;

let $triples := sem:sparql($sparql, (),(),())                                          

return
(
sem:rdf-insert($triples,(&quot;override-graph=geography&quot;))
)

Now if we look at the triples for my original term we should see an additional skos:exactMatch for the NYTimes:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> ;
<http://www.w3.org/2004/02/skos/core#inScheme> <http://cv.ap.org/a#geography> ;
<http://www.w3.org/2003/01/geo/wgs84_pos#long> "-70.76255"^^xs:decimal ; 
<http://www.w3.org/2003/01/geo/wgs84_pos#lat> "43.07176"^^xs:decimal ; 
<http://www.w3.org/2004/02/skos/core#exactMatch> <http://sws.geonames.org/5091383/> ;
<http://www.w3.org/2004/02/skos/core#exactMatch> <http://data.nytimes.com/10237454346559533021> ;
<http://www.w3.org/2004/02/skos/core#broader> <http://cv.ap.org/id/9531546082C6100487B5DF092526B43E> ;
<http://www.w3.org/2004/02/skos/core#prefLabel> &quot;Portsmouth&quot;@en .

Another option for writing the new triples back to the database is SPARQL itself. The most recent version, SPARQL 1.1, defines an update language which includes the useful operator INSERT. We can modify our earlier SPARQL query like so:


PREFIX cts: <http://marklogic.com/cts#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
INSERT
{ GRAPH <geography> { ?s skos:exactMatch ?n .} }
WHERE
{
  GRAPH <nytimes>
    {
    ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
    ?n owl:sameAs ?gn .
    } .
 GRAPH <geography>
    {
    ?s skos:inScheme <http://cv.ap.org/a#geography> .
    ?s skos:exactMatch ?gn . 
    }.
}

The multiple GRAPH statements allow me to query across two graphs, but only write to one. And if we wanted to replace an existing skos:exactMatch triple, rather than append to our existing statements we would precede our INSERT statement with a DELETE. This DELETE/INSERT operation is described in detail here.

Marklogic 8, not yet released, will include support for the SPARQL 1.1 Update query language (among other new semantic capabilities). Since I am lucky enough to be part of the Early Access program for Marklogic 8 I was able to run the query above and see that it generated the new triples correctly.

Both CONSTRUCT and INSERT are not exactly new technologies, but it’s great to see how they might be used within the context of a Marklogic application. For my own work cleaning and enriching vocabulary data these methods have proved to be quite valuable and I look forward to digging into the rest of the SPARQL 1.1 features coming to Marklogic 8 in the near future.

Searching RDF vocabulary data in Marklogic 7

Recently, I’ve been experimenting with Marklogic’s new(ish) semantic capabilities (here’s a quick overview of what Marklogic is offering with their semantics toolkit). In particular, I’ve been trying to build a simple interface for searching across vocabulary data in RDF. This turned out to be an interesting exercise since Marklogic’s current semantic efforts are targeted at an intersection of “documents, data, and now RDF triples.”

It’s fairly easy to set up the triple store; the quick start documentation should get you up and running in short order. For the purposes of this brief document I’m using a small content set from DBPedia derived from the following SPARQL query:

DESCRIBE ?s 
WHERE {
?s rdf:type <http://dbpedia.org/class/yago/ProfessionalMagicians> .
}

I downloaded my set in RDF/XML (using DBPedia’s SPARQL endpoint) and loaded them into Marklogic using mlcp:

mlcp.bat import -host localhost -port 8040 -username [name] -password [pass] -input_file_path C:\data\magicians.rdf -mode local -input_file_type RDF -output_collections magician -output_uri_prefix  /triplestore/

Now if you open up QConsole and ‘explore’ the data you’ll see that all of our triples have been packaged up into discrete documents:

/triplestore/1105189df46c20c7-0-11170.xml
/triplestore/1105189df46c20c7-0-11703.xml
/triplestore/1105189df46c20c7-0-12614.xml
/triplestore/1105189df46c20c7-0-13346.xml

Each one contains 100 triples, and each triple looks something like:

<sem:triple>
<sem:subject>http://dbpedia.org/resource/Criss_Angel</sem:subject>
<sem:predicate>http://dbpedia.org/property/birthDate</sem:predicate>
<sem:object datatype="http://www.w3.org/2001/XMLSchema#date">1967-12-18+02:00</sem:object>
</sem:triple>

Available query structures fall into three categories:

  • CTS queries (cts:*)
  • SPARQL queries (sem:*)
  • Hybrid CTS/SPARQL

The documentation for the available queries is here.

But before we dig into some sample queries, let’s try the Search API. It has some appeal as a solution, since it can provide easy pagination, result counts, and all the nice features of CTS (stemming/lemmatization) to boot.

Let’s search for terms which mention the word ‘paranormal’:

search:search('paranormal')

This returns the familiar results, but you will quickly realize that these will not be particularly helpful, if what you are interested are subject and not document matches.

<search:response snippet-format="snippet" total="21" start="1" page-length="10" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="" xmlns:search="http://marklogic.com/appservices/search">
    <search:result index="1" uri="/triplestore/ad02c265f21a9295-0-55.xml" path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")" score="208896" confidence="0.7033283" fitness="0.8208094">
        <search:snippet>
            <search:match path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")/sem:triples/sem:triple[97]/sem:object">http://dbpedia.org/resource/Category:<search:highlight>Paranormal</search:highlight>_investigators</search:match>
            <search:match path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")/sem:triples/sem:triple[100]/sem:object">...and scientific skeptic best known for his challenges to <search:highlight>paranormal</search:highlight> claims and pseudoscience. Randi is the founder of the...</search:match>
        </search:snippet>
    </search:result>
</search:response>

Let’s try the same search with a direct CTS query, but this time we’ll allow for wildcarding:

cts:search(collection(),
cts:and-query((cts:collection-query("magician"), 
cts:word-query("paranormal*","wildcarded")))
)

This is nice, but it again returns the entire document containing matches. And since our triples were arbitrarily added to these documents via MLCP, we have subjects in our results that we don’t care about.

Let’s try the same query in pure SPARQL:

DESCRIBE ?s
WHERE{ 
?s ?p ?o.
FILTER regex(?o, "paranormal*", "i")
}

This works pretty well and returns all the triples for those subjects we are interested in, but it’s rather slow. I’m guessing the pure SPARQL FILTER query here is not particularly optimized. As a comparison, we can actually insert some CTS into our SPARQL query if we wish, like so:

PREFIX cts: http://marklogic.com/cts#
DESCRIBE ?s 
WHERE{ 
?s ?p ?o .
 FILTER cts:contains(?o, cts:word-query("paranormal")) 
}

Compared to the previous query this is blazing fast, though not a sub-second query yet. We can speed things up a bit by using a hybrid CTS/SPARQL approach where we pass a cts:query as an option to sem:sparql. This reduces the set of documents in our search scope before executing the SPARQL, and so may offer a boost to performance. Of course, to continue to drill down to only the relevant subjects (not documents) we need to execute the query twice:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
let $query := cts:word-query('paranormal',"case-insensitive")
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                DESCRIBE ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query('paranormal')) 
                }"
let $results := sem:sparql($sparql,(),("default-graph=magician"),($query))  
return
(
sem:rdf-serialize($results,'rdfxml')
)

If we want to dynamically generate our SPARQL queries we can send a $bindings map to sem:sparql containing our variables. Here’s a more dynamic version of the above:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
let $q := 'paranormal'
let $query := cts:word-query($q,"case-insensitive")
let $bindings := map:map()
let $put := map:put($bindings,"q",$q)
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                DESCRIBE ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query(?q)) 
                }"
let $results := sem:sparql($sparql,($bindings),("default-graph=magician"),($query))  
return
(
sem:rdf-serialize($results,'rdfxml')
)

There’s an interesting byproduct of this approach, however. Once you have filtered the set of documents using the CTS query option, you have also potentially limited the triples available to your SPARQL query. So, if your subject has triples spanning two documents (which happens due to the arbitrary method MLCP employs in uploading your content), and your CTS query only matches the first, any triples from the second document which you expect to return with your SPARQL will appear to be missing.

So, our approach is flawed, but let us press on anyway. What can I do with those triples in a search interface? There are a few options here. Certainly, we can flesh out our SPARQL query to use SELECT and whatever array of properties we need for our display (label, description, etc.) and then pass the results through sem:query-results-serialize to generate SPARQL XML:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at
"/MarkLogic/semantics.xqy";
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
               SELECT DISTINCT ?s ?c
                WHERE{ 
                   ?s ?p ?o .
                   ?s rdfs:comment ?c .
                   FILTER ( lang(?c) = 'en' )
                   FILTER cts:contains(?o, cts:word-query('paranormal')) 
                }"
let $results := sem:sparql($sparql,(),("default-graph=magician"),())  
return
(
sem:query-results-serialize($results)
)

Or, if you’d rather serialize the resulting triples in a particular format such as RDF/XML:

sem:rdf-serialize($results,'rdfxml')

I mentioned pagination earlier. Certainly, this would be a lot easier with the Search API, but it can be done with pure SPARQL and a bit of imagination:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at
"/MarkLogic/semantics.xqy";
declare namespace sparql = "http://www.w3.org/2005/sparql-results#";
let $q := 'paranormal'
let $query := cts:word-query($q,"case-insensitive")

let $search-page-size := 2
let $search-start := 1
let $bindings := map:map()
let $put := map:put($bindings,"q",$q)
let $sparql :=  fn:concat(
                "PREFIX cts: <http://marklogic.com/cts#>
                SELECT DISTINCT ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query(?q)) 
                }",
                "LIMIT ",
                $search-page-size,
                " OFFSET ",
               $search-start
               )
let $results := sem:sparql($sparql,($bindings),("default-graph=magician"),($query))  
return
(
sem:query-results-serialize($results)
)

One last issue I encountered with this approach is that reliance on SPARQL requires that all users granted access to this search interface will need sem:sparql execute privileges. SPARQL 1.1 allows for updates to be made to the database via queries. Though this feature is not currently included in Marklogic’s implementation of SPARQL, it might be in version 8. Does this mean that SPARQL privileges are not something you’d want to hand out to read-only users? Perhaps.

After building my own search interface using some of the approaches described above, I feel that Marklogic is at its best when it’s used as a document store. So, perhaps the best approach is to mirror the construction of a document repository, with each term its own document, and employing embedded RDF triples. Something like this abbreviated and modified LC record for Harry Houdini:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="http://id.loc.gov/authorities/names/n79096862">
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
        <skos:prefLabel xml:lang="en" xmlns:skos="http://www.w3.org/2004/02/skos/core#">Houdini,
            Harry, 1874-1926</skos:prefLabel>
        <skos:exactMatch rdf:resource="http://viaf.org/viaf/sourceID/LC%7Cn+79096862#skos:Concept"
            xmlns:skos="http://www.w3.org/2004/02/skos/core#"/>
        <skos:inScheme rdf:resource="http://id.loc.gov/authorities/names"
            xmlns:skos="http://www.w3.org/2004/02/skos/core#"/>
        <skos:altLabel xml:lang="en" xmlns:skos="http://www.w3.org/2004/02/skos/core#">Weiss,
            Ehrich, 1874-1926</skos:altLabel>
    </rdf:Description>
    <sem:triples xmlns:sem="http://marklogic.com/semantics">
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</sem:predicate>
            <sem:object>http://www.w3.org/2004/02/skos/core#Concept</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#prefLabel</sem:predicate>
            <sem:object xml:lang="en">Duke Thomas</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#exactMatch</sem:predicate>
            <sem:object>http://viaf.org/viaf/sourceID/LC%7Cn+79096862#skos:Concept</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#inScheme</sem:predicate>
            <sem:object>http://id.loc.gov/authorities/names</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#altLabel</sem:predicate>
            <sem:object xml:lang="en">Weiss, Ehrich, 1874-1926</sem:object>
        </sem:triple>
    </sem:triples>
</rdf:RDF>

I’ll be trying this approach next.