Searching RDF vocabulary data in Marklogic 7

Recently, I’ve been experimenting with Marklogic’s new(ish) semantic capabilities (here’s a quick overview of what Marklogic is offering with their semantics toolkit). In particular, I’ve been trying to build a simple interface for searching across vocabulary data in RDF. This turned out to be an interesting exercise since Marklogic’s current semantic efforts are targeted at an intersection of “documents, data, and now RDF triples.”

It’s fairly easy to set up the triple store; the quick start documentation should get you up and running in short order. For the purposes of this brief document I’m using a small content set from DBPedia derived from the following SPARQL query:

DESCRIBE ?s 
WHERE {
?s rdf:type <http://dbpedia.org/class/yago/ProfessionalMagicians> .
}

I downloaded my set in RDF/XML (using DBPedia’s SPARQL endpoint) and loaded them into Marklogic using mlcp:

mlcp.bat import -host localhost -port 8040 -username [name] -password [pass] -input_file_path C:\data\magicians.rdf -mode local -input_file_type RDF -output_collections magician -output_uri_prefix  /triplestore/

Now if you open up QConsole and ‘explore’ the data you’ll see that all of our triples have been packaged up into discrete documents:

/triplestore/1105189df46c20c7-0-11170.xml
/triplestore/1105189df46c20c7-0-11703.xml
/triplestore/1105189df46c20c7-0-12614.xml
/triplestore/1105189df46c20c7-0-13346.xml

Each one contains 100 triples, and each triple looks something like:

<sem:triple>
<sem:subject>http://dbpedia.org/resource/Criss_Angel</sem:subject>
<sem:predicate>http://dbpedia.org/property/birthDate</sem:predicate>
<sem:object datatype="http://www.w3.org/2001/XMLSchema#date">1967-12-18+02:00</sem:object>
</sem:triple>

Available query structures fall into three categories:

  • CTS queries (cts:*)
  • SPARQL queries (sem:*)
  • Hybrid CTS/SPARQL

The documentation for the available queries is here.

But before we dig into some sample queries, let’s try the Search API. It has some appeal as a solution, since it can provide easy pagination, result counts, and all the nice features of CTS (stemming/lemmatization) to boot.

Let’s search for terms which mention the word ‘paranormal’:

search:search('paranormal')

This returns the familiar results, but you will quickly realize that these will not be particularly helpful, if what you are interested are subject and not document matches.

<search:response snippet-format="snippet" total="21" start="1" page-length="10" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="" xmlns:search="http://marklogic.com/appservices/search">
    <search:result index="1" uri="/triplestore/ad02c265f21a9295-0-55.xml" path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")" score="208896" confidence="0.7033283" fitness="0.8208094">
        <search:snippet>
            <search:match path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")/sem:triples/sem:triple[97]/sem:object">http://dbpedia.org/resource/Category:<search:highlight>Paranormal</search:highlight>_investigators</search:match>
            <search:match path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")/sem:triples/sem:triple[100]/sem:object">...and scientific skeptic best known for his challenges to <search:highlight>paranormal</search:highlight> claims and pseudoscience. Randi is the founder of the...</search:match>
        </search:snippet>
    </search:result>
</search:response>

Let’s try the same search with a direct CTS query, but this time we’ll allow for wildcarding:

cts:search(collection(),
cts:and-query((cts:collection-query("magician"), 
cts:word-query("paranormal*","wildcarded")))
)

This is nice, but it again returns the entire document containing matches. And since our triples were arbitrarily added to these documents via MLCP, we have subjects in our results that we don’t care about.

Let’s try the same query in pure SPARQL:

DESCRIBE ?s
WHERE{ 
?s ?p ?o.
FILTER regex(?o, "paranormal*", "i")
}

This works pretty well and returns all the triples for those subjects we are interested in, but it’s rather slow. I’m guessing the pure SPARQL FILTER query here is not particularly optimized. As a comparison, we can actually insert some CTS into our SPARQL query if we wish, like so:

PREFIX cts: http://marklogic.com/cts#
DESCRIBE ?s 
WHERE{ 
?s ?p ?o .
 FILTER cts:contains(?o, cts:word-query("paranormal")) 
}

Compared to the previous query this is blazing fast, though not a sub-second query yet. We can speed things up a bit by using a hybrid CTS/SPARQL approach where we pass a cts:query as an option to sem:sparql. This reduces the set of documents in our search scope before executing the SPARQL, and so may offer a boost to performance. Of course, to continue to drill down to only the relevant subjects (not documents) we need to execute the query twice:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
let $query := cts:word-query('paranormal',"case-insensitive")
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                DESCRIBE ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query('paranormal')) 
                }"
let $results := sem:sparql($sparql,(),("default-graph=magician"),($query))  
return
(
sem:rdf-serialize($results,'rdfxml')
)

If we want to dynamically generate our SPARQL queries we can send a $bindings map to sem:sparql containing our variables. Here’s a more dynamic version of the above:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
let $q := 'paranormal'
let $query := cts:word-query($q,"case-insensitive")
let $bindings := map:map()
let $put := map:put($bindings,"q",$q)
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                DESCRIBE ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query(?q)) 
                }"
let $results := sem:sparql($sparql,($bindings),("default-graph=magician"),($query))  
return
(
sem:rdf-serialize($results,'rdfxml')
)

There’s an interesting byproduct of this approach, however. Once you have filtered the set of documents using the CTS query option, you have also potentially limited the triples available to your SPARQL query. So, if your subject has triples spanning two documents (which happens due to the arbitrary method MLCP employs in uploading your content), and your CTS query only matches the first, any triples from the second document which you expect to return with your SPARQL will appear to be missing.

So, our approach is flawed, but let us press on anyway. What can I do with those triples in a search interface? There are a few options here. Certainly, we can flesh out our SPARQL query to use SELECT and whatever array of properties we need for our display (label, description, etc.) and then pass the results through sem:query-results-serialize to generate SPARQL XML:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at
"/MarkLogic/semantics.xqy";
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
               SELECT DISTINCT ?s ?c
                WHERE{ 
                   ?s ?p ?o .
                   ?s rdfs:comment ?c .
                   FILTER ( lang(?c) = 'en' )
                   FILTER cts:contains(?o, cts:word-query('paranormal')) 
                }"
let $results := sem:sparql($sparql,(),("default-graph=magician"),())  
return
(
sem:query-results-serialize($results)
)

Or, if you’d rather serialize the resulting triples in a particular format such as RDF/XML:

sem:rdf-serialize($results,'rdfxml')

I mentioned pagination earlier. Certainly, this would be a lot easier with the Search API, but it can be done with pure SPARQL and a bit of imagination:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at
"/MarkLogic/semantics.xqy";
declare namespace sparql = "http://www.w3.org/2005/sparql-results#";
let $q := 'paranormal'
let $query := cts:word-query($q,"case-insensitive")

let $search-page-size := 2
let $search-start := 1
let $bindings := map:map()
let $put := map:put($bindings,"q",$q)
let $sparql :=  fn:concat(
                "PREFIX cts: <http://marklogic.com/cts#>
                SELECT DISTINCT ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query(?q)) 
                }",
                "LIMIT ",
                $search-page-size,
                " OFFSET ",
               $search-start
               )
let $results := sem:sparql($sparql,($bindings),("default-graph=magician"),($query))  
return
(
sem:query-results-serialize($results)
)

One last issue I encountered with this approach is that reliance on SPARQL requires that all users granted access to this search interface will need sem:sparql execute privileges. SPARQL 1.1 allows for updates to be made to the database via queries. Though this feature is not currently included in Marklogic’s implementation of SPARQL, it might be in version 8. Does this mean that SPARQL privileges are not something you’d want to hand out to read-only users? Perhaps.

After building my own search interface using some of the approaches described above, I feel that Marklogic is at its best when it’s used as a document store. So, perhaps the best approach is to mirror the construction of a document repository, with each term its own document, and employing embedded RDF triples. Something like this abbreviated and modified LC record for Harry Houdini:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="http://id.loc.gov/authorities/names/n79096862">
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
        <skos:prefLabel xml:lang="en" xmlns:skos="http://www.w3.org/2004/02/skos/core#">Houdini,
            Harry, 1874-1926</skos:prefLabel>
        <skos:exactMatch rdf:resource="http://viaf.org/viaf/sourceID/LC%7Cn+79096862#skos:Concept"
            xmlns:skos="http://www.w3.org/2004/02/skos/core#"/>
        <skos:inScheme rdf:resource="http://id.loc.gov/authorities/names"
            xmlns:skos="http://www.w3.org/2004/02/skos/core#"/>
        <skos:altLabel xml:lang="en" xmlns:skos="http://www.w3.org/2004/02/skos/core#">Weiss,
            Ehrich, 1874-1926</skos:altLabel>
    </rdf:Description>
    <sem:triples xmlns:sem="http://marklogic.com/semantics">
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</sem:predicate>
            <sem:object>http://www.w3.org/2004/02/skos/core#Concept</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#prefLabel</sem:predicate>
            <sem:object xml:lang="en">Duke Thomas</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#exactMatch</sem:predicate>
            <sem:object>http://viaf.org/viaf/sourceID/LC%7Cn+79096862#skos:Concept</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#inScheme</sem:predicate>
            <sem:object>http://id.loc.gov/authorities/names</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#altLabel</sem:predicate>
            <sem:object xml:lang="en">Weiss, Ehrich, 1874-1926</sem:object>
        </sem:triple>
    </sem:triples>
</rdf:RDF>

I’ll be trying this approach next.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s