Deploying a pre-trained object detection model on Google Cloud ML Engine

The Tensorflow detection model zoo provides several extremely useful pre-trained object detection models. And though we have the option of using one of these models in a transfer learning scenario to train our own custom model, occasionally the pre-trained model will provide everything we need.

For example, we may wish to add human actions to our own image classifier. Instead of collecting and labeling images and training our own model, we might decide to employ one of the existing models trained on the AVA dataset and map a subset of these labels to our own taxonomy. For example:

  • Stand
  • Sit
  • Walk
  • Run
  • Dance
  • Fight

These pre-trained models have an expected input of a ‘tensor.’ We can confirm this by downloading and inspecting the model. In the zoo there is just one AVA model. Let’s download and extract the model:

wget http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_ava_v2.1_2018_04_30.tar.gz</pre>
tar -xzf faster_rcnn_resnet101_ava_v2.1_2018_04_30.tar.gz

If we have Tensorflow installed we can use the saved_model_cli to inspect it:

saved_model_cli show --dir faster_rcnn_resnet101_ava_v2.1_2018_04_30/saved_model/ --all

Which will return the following:


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['inputs'] tensor_info:
dtype: DT_UINT8
shape: (-1, -1, -1, 3)
name: image_tensor:0
The given SavedModel SignatureDef contains the following output(s):
outputs['detection_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100, 4)
name: detection_boxes:0
outputs['detection_classes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100)
name: detection_classes:0
outputs['detection_scores'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100)
name: detection_scores:0
outputs['num_detections'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: num_detections:0
Method name is: tensorflow/serving/predict

From this we can tell that the expected input is an uint8 image tensor. While this may work with ML Engine, all the other examples I have seen used a different input type: encoded_image_string_tensor.

So, how can we update the saved model so it allows for this different input type? Essentially, we need to install the Tensorflow Object Detection library and then re-export this pre-trained model with the preferred input signature. Note that the outputs seen above will be fine for our purposes (as we will see later).

Follow the installation guide to install the Object Detection library.

Note 1:

I am running Anaconda on a Google Compute instance, and had an issue with bunzip2. I resolved this by first installing bzip2 before beginning the installation process:

sudo apt-get install bzip2

Note 2:

I also had some issues installing Protobuf even though I was following the installation guide. Here are the steps that worked for me:


mkdir protobuf
cd protobuf
wget https://github.com/protocolbuffers/protobuf/releases/download/v3.7.1/protoc-3.7.1-linux-x86_64.zip
unzip protoc-3.7.1-linux-x86_64.zip

Now add the following line to .bashrc in your home directory:

export PATH=/home/dfox/protobuf/bin${PATH:+:${PATH}}

And then activate the modified PATH using source:

source .bashrc

Lastly, you should be able to run the final step in the installation guide. Change directories into TensorFlow/models/research/ and run the following command:

protoc object_detection/protos/*.proto --python_out=.

Exporting the model:

We have downloaded our pre-trained model, and we have installed the Object Detection library. At this point, we can run export_inference_graph.py to modify the input of our pre-trained model. Here is the sample command provided by Google:


python3 export_inference_graph.py \
--input_type encoded_image_string_tensor \
--pipeline_config_path path/to/sample/config \
--trained_checkpoint_prefix path/to/model/checkpoint \
--output_directory path/to/output/for/mlengine

The sample pipeline configs can be found in the Tensorflow repository under object_detection/samples/configs. Pick the config that will work with your model.

Here is the command I used for the AVA model:


python TensorFlow/models/research/object_detection/export_inference_graph.py --input_type encoded_image_string_tensor --pipeline_config_path TensorFlow/models/research/object_detection/samples/configs/faster_rcnn_resnet101_ava_v2.1.config --trained_checkpoint_prefix faster_rcnn_resnet101_ava_v2.1_2018_04_30/model.ckpt --output_directory ava_for_mlengine

Note that there is no file named model.ckpt in the AVA directory; this is the prefix of the files created for the model checkpoint.

Now we should have a directory ava_for_mlengine, with everything we need to deploy this model:


$ ls ava_for_mlengine/

checkpoint model.ckpt.data-00000-of-00001 model.ckpt.meta saved_model
frozen_inference_graph.pb model.ckpt.index pipeline.config

Now our model is ready to be deployed on ML Engine, which we can do using the gcloud command line tool.

We begin by copying our new saved model to a GCS bucket which can be accessed by ML Engine:

gsutil cp -r ava_for_mlengine/saved_model/ ${GCS_BUCKET_NAME}/faster_rcnn_resnet101_ava/

Next we can create our model:

gcloud ml-engine models create faster_rcnn_resnet101_ava --regions us-central1

And finally create our model version (v1):

gcloud ml-engine versions create v1 --model faster_rcnn_resnet101_ava --origin=gs://${YOUR_PROJECT_NAME}/faster_rcnn_resnet101_ava/saved_model --framework tensorflow --runtime-version=1.13

Note that I have specified a runtime version equivalent to my local version of Tensorflow (which we used to export the model). Your version may be different.

This step may take some time. While we wait, let’s download an image for testing:

Kick

wget https://upload.wikimedia.org/wikipedia/commons/9/9b/Kick.JPG

You can use the code provided by Google for performing an online prediction (found here), but we first need to convert our image into an encoded string.

For a quick test, you can add the following to the script provided by Google:

if __name__ == '__main__':
	instances = []
	for image_path in sys.argv[1:]:
		with open(image_path, "rb") as image_file:
			encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
			instances.append({"b64":encoded_string})
		response = predict_json($MY_PROJECT_NAME,$MY_MODEL_NAME,instances)
		for response in response:
			print(response.keys())

Which will return:

dict_keys(['detection_boxes', 'detection_classes', 'raw_detection_scores', 'detection_scores', 'num_detections', 'raw_detection_boxes'])

The boxes provide the locations of the detected objects, the classes provide a unique key that identifies the detected object, and the scores provide the confidence for each detected object.

And if I am interested in knowing which classes were applied with a particular confidence level:

for c, s in zip(response['detection_classes'], response['detection_scores']):
	if s > .80:
		print(c, s)

Which, for the ‘kick’ image seen above, will return:

11.0 0.8621717691421509
80.0 0.8211425542831421
80.0 0.8173160552978516

To figure out which classes these keys indicate, I need to download the AVA concept mappings. These can be found here. The relevant labels:


label {
name: "sit"
label_id: 11
label_type: PERSON_MOVEMENT
}

label {
name: "watch (a person)"
label_id: 80
label_type: PERSON_INTERACTION
}

So, we did not detect any kicking with any confidence. There are two kick related labels (35 and 71), but neither appear even in an unfiltered list of the detected classes.

At this point, if my intention was to create a sports specific action detector, that would perform well on photos of martial arts events, I might train my own object detection model. I might also decide to use AVA as a baseline for transfer learning.

Local testing and Google Cloud Functions

Back in 2017, I wrote about local testing and AWS Lambda. At some point, I will update that post with details on how to use AWS SAM to invoke automated local tests. In this article, I will pivot away from AWS to talk about Google’s equivalent service, Cloud Functions, and again will focus on local testing.

I am using the Python runtime, which makes use of Flask to handle incoming requests. If you are already familiar with Flask patterns, you will find a lot to like about Google Cloud Functions. And as before, I am using Python 3’s unittest to discover and run my tests.

The following is an attempt to document a problem I encountered, and the solution that I settled on. There are likely other/better ways. If you have a suggestion, please leave me a comment!

Here’s where I started.

My project structure:

  • gcf-testing-demo
      • count
        • __init__.py
        • main.py
        • counter.py
      • tests
        • test_count.py

main.py

import os
import json
from counter import Count

def document_count(request):

    headers = {
        'Content-Type': 'application/json'
    }

    try:
        request_json = request.get_json()
        document = request_json['document']
        c = Count()
        count = c.tok_count(document)
        response_body = {}
        response_body['document_count'] = count
        response = (json.dumps(response_body), 200, headers)

    except Exception as error:
        #can't parse json
        response_body = {}
        response_body['message'] = error
        response = (json.dumps(response_body), 400, headers)

counter.py

class Count:

    def tok_count(self, mystring):
        tokens = mystring.split()
        return len(tokens)

test_count.py

from count.main import document_count
import unittest
import json
from unittest.mock import Mock

class MyTests(unittest.TestCase):
    def test_count(self):
        data = {"document":"This is a test document"}
        count = 5
        request = Mock(get_json=Mock(return_value=data), args=data)
        response = document_count(request)[0]
        self.assertTrue(json.loads(response)['document_count'] == 5)

if __name__ == '__main__':
    unittest.main()

This function takes an input:

{"document":"this is a test document"}

And returns a token count of the input ‘document’:

{"document_count":5}

A few details worth highlighting:

From main.py:

request_json = request.get_json()

This is an example of how Google is reusing familiar Flask patterns. The method get_json will pull any JSON out of the incoming Flask object. We can then directly pick out properties, e.g. request_json[‘document’].

Also, from main.py:

response = (json.dumps(response_body), 200, headers)

Each Google Cloud Function is essentially an API method, and we must provide not just the response body, but the HTTP status code, and any headers. We return all 3 as a tuple and GCF will proxy this to the user appropriately.

The code for multiple Google Cloud Functions can be maintained in a single main.py. You will see how this looks when we deploy this function below. We can also import external modules, as we have done here with count.py, but to enable this functionality we must include an __init__.py in our function directory.

Note that you could also keep your modules in a subdirectory, provided that subdirectory also has an __init__.py. In either case, the __init__.py can be empty.

In test_count.py you will find a single test which ensures that the document counter returns the correct value for some input. I am using unittest.mock library to construct a Mock object equivalent to the object expected by our Google Cloud Function. Note the get_json method, which stores the contents of my test document.

request = Mock(get_json=Mock(return_value=data), args=data)

Okay, things look good. Let’s run our test:

python3 -m unittest discover

Which results in the error:

ERROR: tests.test_count (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: tests.test_count
Traceback (most recent call last):
  File "/usr/lib/python3.5/unittest/loader.py", line 428, in _find_test_path
    module = self._get_module_from_name(name)
  File "/usr/lib/python3.5/unittest/loader.py", line 369, in _get_module_from_name
    __import__(name)
  File "/home/dfox/gcf-testing-demo/tests/test_count.py", line 1, in 
    from count.main import document_count
  File "/home/dfox/gcf-testing-demo/count/main.py", line 3, in 
    from counter import Count
ImportError: No module named 'counter'
----------------------------------------------------------------------
Ran 1 test in 0.000s
FAILED (errors=1)

So, what’s happening? When I invoke my test script, it searches in local and system paths for counter.py, and finds…nothing! I have a few options at this point:

  • I could manually update my PYTHONPATH to ensure that my module directory is included.
  • I could move my test script into my module directory.
  • Or I can switch from using relative to absolute paths in my module code.

Let’s try this last option, as it seems like the solution that will be easiest on any future testers/developers. Update main.py as follows:

from count.counter import Count

When we run our unittest again, we should get back a successful report:

Ran 1 test in 0.001s
OK

Great! Let’s deploy our function. I’m using the gcloud command-line utility. More info on setting this up here. I’m also using the beta client. Change directories into your module dir and execute the following (note that the name ‘document_count) refers to the function in main.py and not to main.py itself):

gcloud beta functions deploy document_count --runtime python37 --trigger-http

Which will return

ERROR: (gcloud.beta.functions.deploy) OperationError: code=3, message=Function failed on loading user code. Error message: Code in file main.py can't be loaded.
Did you list all required modules in requirements.txt?
Detailed stack trace: Traceback (most recent call last):
  File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 256, in check_or_load_user_function
    _function_handler.load_user_function()
  File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 166, in load_user_function
    spec.loader.exec_module(main)
  File "", line 728, in exec_module
  File "", line 219, in _call_with_frames_removed
  File "/user_code/main.py", line 3, in 
    from count.counter import Count
ModuleNotFoundError: No module named 'count'

Oh boy. Looks like my genius plan to use absolute paths will not fly with our Google Cloud Function deployment. Which makes sense, as the resulting function has no knowledge of our local project structure. It is dynamically building a package out of main.py, any imported modules, and any dependencies we may have listed in our requirements.txt.

At this point, I was a bit unsure of how to proceed. I decided to take a look at some of Google’s own sample Python applications. Here is a sample application for a Slack ‘slash command’. Let’s peek at the project structure:

  • slack/
    • README.md
    • config.json
    • main.py
    • main_test.py
    • requirements.txt

It is not a like-for-like example, since there are no modules being imported into main.py, but notice that the test script is simply in the same directory as main.py. This was an approach I had considered and discarded, simply because I was concerned about mingling my tests with function code. But if it is good enough for Google, who am I to argue?

So, let’s restructure things:

  • gcf-testing-demo
      • count
        • __init__.py
        • main.py
        • counter.py
        • test_count.py

And then switch back to relative imports in main.py:

from counter import Count

I also need to update the import  in test_count.py:

from main import document_count

Now, I should be able to cd into count/ and execute my test:

python3 -m unittest discover</div>
Ran 1 test in 0.000s
OK

Next, I will confirm that I can deploy this function as  GCF:

gcloud beta functions deploy document_count --runtime python37 --trigger-http
Deploying function (may take a while - up to 2 minutes)...done.
availableMemoryMb: 256
entryPoint: document_count
httpsTrigger:
  url: ###
labels:
  deployment-tool: cli-gcloud
name: ###
runtime: python37
serviceAccountEmail: ###
sourceUploadUrl:###
status: ACTIVE
timeout: 60s
updateTime: '2019-01-23T15:06:23Z'
versionId: '1'

It worked!

Let’s log into the console, browse the Cloud Functions resource and view our function. If I look at the ‘Source’ tab I’ll see main.py, counter.py (this is good!), and also test_count.py (this is less good). This is why I did not want to mingle my tests with my code:

tests_in_source

Fortunately, Google provides a method to filter out those files we don’t wish to incorporate into our Cloud Function package. We need to create a .gcloudignore file (equivalent to.gitignore which you may be familiar with) and add this to the module directory. I only need one line to filter out my tests, but I may as well also filter out __pycache__, *.pyc, and .gcloudignore itself:

.gcloudignore:

*.pyc
__pycache__/
test_*.py
.gcloudignore

After I redeployed this function, the source code looks much cleaner:

filtered_source

Now, I can finally make a live test of the deployed function:

live_test

Success!

Hierarchical multi-label classification of news content using machine learning

There is no shortage of beginner-friendly articles about text classification using machine learning, for which I am immensely grateful. In general, these posts attempt to classify some set of text into one or more categories: email or spam, positive or negative sentiment, a finite set of topical categories (e.g. sports, arts, politics). This last example can be described as a multi-class problem. Here’s a definition of multi-class taken from the scikit-learn documentation:

Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

This is certainly fine for a simple classification task such as slotting a news article into a broad vertical such as ‘Travel,’ or ‘Weather,’ but if our taxonomy is even a bit wider or deeper we will find ourselves struggling to assign each piece of text to a single category. Take for example, the following article:

Dancer badly injured in hit-and-run returns to the stage

PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage.

Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer.

How should we classify this document? Is it about dance? Or about car accidents? Or perhaps about sports injuries? If we look for inspiration in the IPTC Media Topics taxonomy, we might end up with the following topics:

accident and emergency incident http://cv.iptc.org/newscodes/mediatopic/20000139
ballet http://cv.iptc.org/newscodes/mediatopic/20000008

This kind of scenario, where a single sample can be associated with multiple targets (accident and ballet), is called multi-label classification. Let’s crib one more time from the scikit-learn documentation:

Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

And if we look a bit closer at these topics, we might notice that ‘ballet’ is a child of ‘dance’, which is itself a child of ‘arts and entertainment’. The full hierarchy of both terms can be expressed as the following:

  • arts, culture and entertainment
    • arts and entertainment
      • dance
        • ballet
  • disaster, accident and emergency incident
    • accident and emergency incident

We’ve quickly transitioned from a ‘simple’ multi-class classification problem to a multi-label classification problem that is further complicated by a set of hierarchically structured targets. Should we only apply the narrowest of topics in our taxonomy? Do we create a classifier for all topics, broad and narrow, and does the application of one mean anything for the other?

Sadly, I was not able to find many beginner-friendly articles written about hierarchical multi-label classification. I wish I could tell you that this will be that very article, but I can’t and it won’t. Maybe if I outline the problem, someone else will be inspired to write that article. And then we all benefit!

A simple example of multi-label classification

Let’s table the discussion of hierarchy for now and start with the simplest implementation of multi-label classification we can find.

The two main methods for approaching multi-label classification are problem transformations and algorithm adaptations. You will find a good overview of the two approaches here and here. Problem transformation techniques convert the multi-label task into a set of binary classification tasks, somewhat simplifying the task. For each label in the training data we create a single binary classifier and then the set of binary classifiers are then evaluated in concert. This is also referred to as a one-vs.-rest classifier. Let’s walk through a simple example.

Our training set:

example label
PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage. Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer. ballet
PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage. Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer. accident and emergency incident

 

You’ll notice that in the training data I have repeated the example text on two rows, one per label. Not knowing how many labels an example might have, and therefore how many columns I’d need for a single row display, this seemed the best way to encode the information. You might start with something a bit different. Regardless of where you start, we need to make some modifications before training a multi-label model.

Essentially, we need end up here:

example labels
PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage. Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer. [ballet, accident and emergency incident]

 

Where our ‘labels’ value is an array of label strings. Here is how I transformed my data:


import pandas as pd

path_to_csv = ‘training_data.csv'

dataset = pd.read_csv(path_to_csv,usecols=["label","example"])

#modify dataset for multilabel

grouped = dataset.groupby('example')

df = grouped['label'].aggregate(lambda x: list(x)).reset_index(name="labels")

This will group my data by example and then pull all of the related labels into an array. This is great, but we are not quite done. Though we can intuitively understand the meaning of our lists of strings, they will be too cumbersome for our model to process. We need to convert these arrays into the expected multi-label format, a binary matrix indicating the presence (or absence) of a label. We do this using scikit-learn’s MultiLabelBinarizer:


from sklearn.preprocessing import MultiLabelBinarizer

X = df['example']

y = df['labels']

y = MultiLabelBinarizer().fit_transform(y)

Now that our data is in the correct format, we can train a model. The OneVsRestClassifier allows us to use the binary classifier of our choice. Let’s start with LinearSVC:


from sklearn.model_selection import train_test_split

import numpy as np

from sklearn.multiclass import OneVsRestClassifier

from sklearn.svm import LinearSVC

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

#split data into test and train

random_state = np.random.RandomState(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=random_state)

#our pipeline transforms our text into a vector and then applies OneVsRest using LinearSVC

pipeline = Pipeline([

('tfidf', TfidfVectorizer()),

('clf', OneVsRestClassifier(LinearSVC()))

])

pipeline.fit(X_train,y_train)

That, I think, is the simplest approach to multi-label classification. Of course, my results were (seemingly) abysmal. More on evaluation metrics later.

Other methods

Another problem transformation technique is the classifier chains method. This approach is similar to one-vs.-rest, seen above, in that it is comprised of several binary classifiers. But in a classifier chain the output of each classifier is passed on to the next classifier in the chain (along with the original input, in our case the news text). This approach is intended to improve our classifier by taking label dependencies/co-occurrences into consideration.

Our working example classed with ‘ballet’ and ‘accident and emergency incident’ is perhaps not the best represenation of label interdependence, since these two topics will not co-occur with great frequency (we hope!). However, if we browse our favorite news site, mentally classifying each article into a set of topics, we should come up with a few sets of commonly co-occurring topics. ‘Elections’ and ‘campaign finance.’ ‘Football’ and ‘sports injuries.’ ‘Coal mining’ and ‘environment.’ (For the purpose of these examples, I am inventing my own news topics, rather than looking to IPTC).

There are a few, frequently cited, papers on the subject of classifier chains (such as, Classifier Chains for Multi-label Classification) and a scikit-learn implementation, described here. In the example, the order of the chains is random. The documentation notes:

Because the models in each chain are arranged randomly there is significant variation in performance among the chains. Presumably there is an optimal ordering of the classes in a chain that will yield the best performance. However we do not know that ordering a priori. Instead we can construct an voting ensemble of classifier chains by averaging the binary predictions of the chains and apply a threshold of 0.5.

Since we have an implicit order in our hierarchical taxonomy, I wonder if this can be used to improve performance. Of course, there is no guarantee that co-occurrence will be limited to labels in the same taxonomy branch. At any rate, I have yet to implement this method.

Yet another problem transformation technique is the label powerset method. In this approach, each combination of labels in the training set is considered as a unique class. So, instead of two classes for our example, ‘ballet’ and ‘accident and emergency incident,’ we would have a single class ‘ballet, accident and emergency incident.’ If you are starting with something like IPTC’s media topics, a rather large taxonomy which may also be applied in an unexpected fashion (e.g. lots of cross-hierarchy cooccurrences), the resulting set of classes may be too large. Also, we cannot guarantee that our training set will have an example for every potential combination of labels. Mostly for this latter reason, I don’t think this method is appropriate for news classification.

Algorithm adaptions

This brings us to algorithm adaptations, methods that modify an existing algorithm so it can directly cope with a multi-label dataset.

There are several scikit-learn libraries that are described as having support for multi-label classification, a list which includes decision tree, k-neighbor, random forest, and ridge classifiers. Decision trees seem to have some promise for the problem, especially considering the issue of hierarchy. There has been some research on the subject, see Decision Trees for Hierarchical Multi-label Classification.

In addition, several algorithms are available from the scikit-multilearn library (built on top of scikit-learn and expressly designed for multi-label classification):

Next steps

Again, there are not enough examples of applying these methods for text classification, at least not enough at my level (novice). I think the scikit-multilearn library is likely the obvious next step as it implements several algorithms from commonly cited articles in the literature. Although, it may also be worthwhile to run through all the available scikit-learn multi-label-compliant algorithms, just to see if there are any easy wins to be had.

Notes on accuracy

After fitting the simple OneVsRestClassifier seen above, I was disappointed by the low accuracy score. Little did I know that the evaluation metrics I was used to using were not appropriate for a multi-label scenario. Here’s a note from the OneVsRestClassifier documentation regarding accuracy:

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

That is pretty harsh. What we need is a metric that will reflect partial accuracy. For instance, we apply ‘ballet’ correctly, but we also apply ‘weather’ incorrectly to the same example text. This is not a wholly inaccurate classification, it is only partially inaccurate. Luckily, we have a few options.

  • Hamming loss
    • the fraction of the wrong labels to the total number of labels
    • a loss function, so the optimal value is 0
    • scikit-learn implementation of hamming loss
  • Jaccard similarity coefficient score
    • the size of the intersection divided by the size of the union of the sample sets
    • ranges from 0 to 1, and 1 is the optimal score
    • scikit-learn implementation of jaccard similarity
  • Coverage error measure
    • average “depth” (how far we need to go through the ranked scores) to cover all true labels
    • the optimal value is the average number of true labels.
    • scikit-learn implementation of coverage error
  • Averaged (micro and macro) F1 scores
    • I’m having trouble understanding this one, so I’ll just point you to a seemingly useful StackOverflow post.
    • scikit-learn implementation of F1 score (see note for ‘average’ param)

An example of hamming loss and jaccard similarity using scikit-learn:


from sklearn.metrics import hamming_loss

from sklearn.metrics import jaccard_similarity_score

y_pred = pipeline.predict(X_test)

print(hamming_loss(y_test,y_pred))

print(jaccard_similarity_score(y_test, y_pred))

Returning:

  • 0.0107019250706
  • 0.391602082404

Above are my results using the simple OneVsRestClassifier described earlier. The hamming loss, seems good? The jaccard similarity, less so.

On hierarchy

One approach to dealing with my hierarchical taxonomy would be to simply flatten it, ignoring the hierarchy entirely. And this is exactly what I’ve done so far. This seems to be a fine approach for the short term, as I have yet to explore all of the available multi-label algorithms described above. Perhaps, the ‘flat’ approach will be good enough. Perhaps not.

If not, there are some novel ideas out there that use the hierarchy to the advantage of the classifier. Unfortunately, most of these ideas are described in academic papers, with few including any bootstrap code.

One approach which seemed interesting is described in a PyData talk by Jurgen Van Gael: Hierarchical Text Classification using Python (and friends). There is a lot to chew on here, but essentially this approach uses a set of Naïve Bayes classifiers to route a document through the branches of our hierarchical tree, and then individual classifiers for each node in the branch. Using IPTC Media Topics as our example again, we might have a set of Naïve Bayes classifiers to route a document to one of the top level terms (arts, culture and entertainment, education, environment, politics, society, sport, etc.) and then a different classifier for any subsequent nodes in the tree. I’m assuming there would be a set of Naïve Bayes classifiers for any hierarchical level where multiple paths can be followed. Van Gael also notes that if a training example is associated with a class that is 5 levels deep, that training example is copied to each of that class’s ancestors. It seems like a promising approach, but it requires a lot of orchestration, also several more classifiers than just the simple flattened approach.

On imbalance

Another potential issue with a corpus tagged with a relatively deep taxonomy, is that many of the deepest labels will have less examples. The more granular a concept the less broadly it can be applied. If we look at a news corpus that has been tagged with the IPTC Media Topics taxonomy we will likely find plenty of examples for ‘health,’ but far fewer for ‘dietary supplements’ (which is 4 levels down from ‘health’).

Generally, our classification models are better served by having a balanced number of examples across the target classes. Given a large enough corpus we may be able to ensure that all classes are equally represented, but it is inevitable that some will lag behind.

A few options:

We could modify the individual binary classifiers we’ve wrapped with the one-vs.-rest classifier. For example, the LinearSVC classifier (shown above) has a ‘class_weight’ parameter which purports (if set to ‘balanced’) to automatically adjust weights inversely proportional to class frequencies in the input data. So, our instances of ‘dietary supplements’ which appear less frequently should be weighted appropriately.

We could use the imbalanced-learn library. In particular, the method RandomUnderSample can be easily added to our pipeline to equalize the number of examples per class before training begins. However, it is not clear if this will work in a multi-label scenario.

Or, if we have decided to use one of the adapted algorithms provided by scikit-multilearn (described above), we could follow their suggestion to use a k-fold cross-validation approach with sklearn.model_selection.KFold. The scikit-multilearn folks also mention:

If your data set exhibits a strong label co-occurrence structure you might want to use a label-combination based stratified k-fold.

But this method uses the label powerset approach, in which cooccurring labels are combined into unique classes. This would have the same drawbacks described above, in that our training becomes more expensive (many more classes) and we may be unable to accurately tag content in the future as the classifier can only tag content with label combinations it saw during training.

Ideas?

If you have any thoughts about where I should go next, or regarding any false assumptions I might have made above, I’d love to hear from you!

More resources:

Multi-label Classification: A Guided Tour

Comparative Study of Supervised Multi-Label Classification Models for Legal Case Topic Classification

Learning Hierarchical Multi-label Classification Trees from Network Data

 

 

Getting started with Serverless Framework and Python (Part 2)

Stax radio tower

Radio tower at Stax Museum of American Soul Music

This is a continuation of my previous post, which offered some tips for setting up Serverless Framework and concluded with generating a template service in Python. By this point you should have a file, serverless.yml, that will allow you to define your service. Again, the documentation is quite good. I suggest reading from Services down to Workflow. This gives a good overview and should enable you start hacking away at your YML file and adding your functions, but I’ll call out a few areas where I had some trouble.

Testing

Lambda functions are hard to test outside the context of AWS, but any testing withing AWS is going to cost you something (even if it is pennies). The Serverless folks suggest that you “write your business logic so that it is separate from your FaaS provider (e.g., AWS Lambda), to keep it provider-independent, reusable and more easily testable.” Ok! This separation, if we can create it, would allow us to write typical unit tests for each discrete function.

All of my previous Lambdas contained the handler function as well as any other functions required to process the incoming event. I was not even aware that it was possible to import local functions into my handler, but you can! And it works great!

Here’s my handler:

from getDhash import image_to_dhash

def dhash(event, context):
    image = event['image']
    dhash = image_to_dhash(image)
    return dhash

The handler accepts an image file in a string, passes this to the imported image_to_dash function, and returns the resulting dhash.

And here is the image_to_dash function, which I’ve stored separately in getDhash.py:

from PIL import Image
import imagehash
from io import BytesIO
	
def image_to_dhash(image):
    return str(imagehash.dhash(Image.open(BytesIO(image))))

Now, I can simply write my tests against getDhash.py and ignore the handler entirely. For my first test I have a local image (test/image.jpg) and a Python script, test.py, containing my unit tests:

import unittest
from getDhash import image_to_dhash

class TestLambdaFunctions(unittest.TestCase):
    
    with open('test/image.jpg', 'r') as f:
        image  = f.read()
        
    def testGetDhash(self):
        self.assertEqual(image_to_dhash(self.image), 'db5b513373f26f6f')
        
if __name__ == '__main__':
    unittest.main()

Running test.py should return some testing results:

(myenv) dfox@dfox-VirtualBox:~/myservice$ python test.py 
.
----------------------------------------------------------------------
Ran 1 test in 0.026s

OK

Environment variables

AWS Lambdas support the use of environment variables. These variables can also be encrypted, in case you need to store some sensitive information along with your code. In other cases, you may want to use variables to supply slightly different information to the same Lambda function, perhaps corresponding to a development or production environment. Serverless makes it easy to supply these environment variables at the time of deployment. And making use of Serverless’ feature-rich variable system we have a few options for doing so.

Referencing local environment variables:

functions:
  dhash:
    handler: handler.dhash
    environment:
      MYENVVAR: ${env:MYENVVAR}

Or, referencing a different .yml file:

functions:
  dhash:
    handler: handler.dhash
    environment:
      MYENVVAR: ${file(./serverless.env.yml):${opt:stage}.MYENVVAR}

The above also demonstrates how to reference CLI options, in this case the stage we provided with our deploy command:

serverless deploy --stage dev

And for completeness sake, the serverless.env.yml file:

dev:
    MYENVVAR: "my environment variable is the best"

Dependencies

In the past, I found that dealing with Python dependencies and Lambda could be a real pain. Inevitably, my deployment package would be incorrectly compiled. Or, I’d build the package fine, but the unzipped contents would exceed the size limitations imposed by Lambda. Using Serverless along with a plugin, Serverless Python Requirements, makes life (specifically your Lambda-creating life) a lot easier. Here’s how it works.

Get your requirements ready:

pip freeze > requirements.txt

In my case, this produced something like the following:

certifi==2017.4.17
chardet==3.0.4
idna==2.5
ImageHash==3.4
numpy==1.13.0
olefile==0.44
Pillow==4.1.1
PyWavelets==0.5.2
requests==2.17.3
scipy==0.19.0
six==1.10.0
urllib3==1.21.1

Call the plugin in your serverless.yml file:

plugins:
  - serverless-python-requirements

And that’s it. 🙂

Now, if you have requirements like mine, you’re going to hit the size limitation (note the inclusion of Pillow, numpy, and scipy). So, take advantage of the built in zip functionality, by adding the following to your serverless.yml file:

custom:
  pythonRequirements:
    zip: true

This means your dependencies will remained zipped. This also means you need to unzip them when your handler is invoked.

When you run the deploy service command, the Python requirements plugin will add a new script to your directory called unzip_requirements.py. This script will extract the required dependencies when they are needed by your Lambda functions. You will have to import this function before all of your other imports. For example:

import unzip_requirements
from PIL import Image

There does seem to be a drawback here, however. Until you run the deploy command, the unzip_requirements.py will not be added to your directory and therefore all of your local tests will fail with an ImportError:

ImportError: No module named unzip_requirements

Of course, I may be doing something wrong here.

Questions

  • There are actually two Python requirements plugins for Serverless. Am I using the best one?
  • As I add functions to my service, do I reuse the existing handler.py? Or do I create new handler scripts for each function?

Getting started with Serverless Framework and Python (Part 1)

Edison water tower

Water tower of the former Edison laboratories

For a while now I’ve been working with various AWS solutions (Lambda, Data Pipelines, CloudWatch) through the console, and sometimes using homegrown scripts that take advantage of the CLI. This approach has many limitations, but I’d actually recommend it if you’re just getting started with AWS as I found it to be a great way to learn.

But if you’re ready to truly embrace the buzzy concept of serverless architecture and let your functions fly free in the rarefied air of ‘the cloud,’ well then you’ll want to make use of a framework that is designed to make development and deployment a whole heck of a lot easier. Here’s a short list of serverless frameworks:

There are many more, I’m sure, but these seem to be the popular choices. And each one has something to recommend it. Chalice is the official AWS client. Serverless is widely used. Zappa has some neat dependency packaging solutions. Apex is clearly at the apex of serverless framework technology.

I selected Serverless for a few reasons. It does seem to be used quite a bit, so there is a lot of discussion on Stackoverflow and quite a few code examples on Github. As I need all the help I can get, these are true benefits. Also, there are numerous plugins available for serverless, which seems to indicate there is an active developer community. And I knew right off the bat that I would take advantage of at least two of these plugins:

  • Serverless Python Requirements
    • Coping with Python dependencies when deploying a Lambda is one of the more challenging aspects for beginners (like me). I appreciate that someone figured out how to do it well and made that method available to me with a few extra lines in the config
  • Serverless Step Functions
    • I’m looking forward to making use of this relatively new service and none of the other frameworks had anything built in yet for Step Functions

The installation guide for Serverless is pretty good, actually, but I’ll call out a few things that might need some extra attention.

Once you’ve installed Serverless:

sudo npm install -g serverless

Your next concern will be authentication. Serverless provides some pretty good documentation on the subject, but as they describe a few different scenarios, here’s my recommendation.

Follow their instructions for generating your client key/secret, then authenticate using the CLI:

aws configure --profile serverless-user

This will walk you through entering your key, your secret, and your region. The reason I suggest authenticating using the CLI is that the CLI is darned useful. For example, you may want to ‘get-function’ just to see if serverless is doing what you think it is doing.

Also note that I have provided a profile to the command. I find profiles useful, you may not. But if you do like to use profiles, Serverless will let you take full advantage of them. For example, you can deploy with a particular profile:

serverless deploy --aws-profile serverless-user

Or check out this nice idea for per stage profiles.

I found the idea of stages in AWS a bit confusing at first. This blog post does a good job of explaining the concept and how to implement it.

Installing the plugins was dead simple:

npm install --save serverless-python-requirements
npm install --save serverless-step-functions

And setting up a new service environment ain’t much harder:

serverless create --template aws-python --path myAmazingService

Now we are ready to dig in and start writing our functions. In my next post I’ll write a bit about Python dependencies, unit testing, and anything else that occurs to me in the meantime.

Generating new triples in Marklogic with SPARQL CONSTRUCT (and INSERT)

SPARQL is known mostly as query language, but it also has the capability—via the CONSTRUCT operator—to generate new triples. This can be useful for delivering a custom snippet of RDF to a user, but can also be used to write new data back to the database, enriching what was already there. Marklogic’s triple store supports the SPARQL standard, including CONSTRUCT queries, and the results can be easily incorporated back into the data set using the XQuery Semantics API. Here’s a quick demo.

I have a set of geography terms which have already been linked to the Geonames dataset. Here’s an example:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> ;
 <http://www.w3.org/2004/02/skos/core#inScheme> <http://cv.ap.org/a#geography> ; <http://www.w3.org/2003/01/geo/wgs84_pos#long> &quot;-70.76255&quot;^^xs:decimal ;
 <http://www.w3.org/2003/01/geo/wgs84_pos#lat> &quot;43.07176&quot;^^xs:decimal ;  <http://www.w3.org/2004/02/skos/core#exactMatch> <http://sws.geonames.org/5091383/> ;
 <http://www.w3.org/2004/02/skos/core#broader> <http://cv.ap.org/id/9531546082C6100487B5DF092526B43E> ;
 <http://www.w3.org/2004/02/skos/core#prefLabel> &quot;Portsmouth&quot;@en .

If we look at the same term via the New York Times’ Linked Open Data service we’ll see a set of equivalent terms, including the Geonames resource for Portsmouth:


<http://data.nytimes.com/10237454346559533021> <http://www.w3.org/2002/07/owl#sameAs> <http://data.nytimes.com/portsmouth_nh_geo> ;
 <http://dbpedia.org/resource/Portsmouth%2C_New_Hampshire> ;
 <http://rdf.freebase.com/ns/en.portsmouth_new_hampshire> ;
 <http://sws.geonames.org/5091383/> .

Oh, hey, we have the same Geonames URI. Guess what we can do with that? More links!

After ingesting the NYTimes data into Marklogic, I was able to write a SPARQL query to begin connecting the two datasets using the Geonames URI as glue.


 PREFIX cts: <http://marklogic.com/cts#>
 PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX owl: <http://www.w3.org/2002/07/owl#>
 SELECT ?s ?n 
 WHERE
 {
 ?s skos:inScheme <http://cv.ap.org/a#geography> .
 ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
 ?s skos:exactMatch ?gn .
 ?n owl:sameAs ?gn .
 } 
 LIMIT 2

Returning:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://data.nytimes.com/10237454346559533021>
<http://cv.ap.org/id/662030807D5B100482BDC076B8E3055C> <http://data.nytimes.com/10616800927985096861>

Now, if we want to generate triples instead of SPARQL results, we simply swap out our SELECT for a CONSTRUCT operator, like so:


PREFIX cts: <http://marklogic.com/cts#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
CONSTRUCT { ?s skos:exactMatch ?n .}
WHERE
 {
 ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
 ?s skos:inScheme <http://cv.ap.org/a#geography> .
 ?s skos:exactMatch ?gn .
 ?n owl:sameAs ?gn .
 } 
LIMIT 2

Returning:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://www.w3.org/2004/02/skos/core#exactMatch> <http://data.nytimes.com/10237454346559533021> .
<http://cv.ap.org/id/662030807D5B100482BDC076B8E3055C> <http://www.w3.org/2004/02/skos/core#exactMatch> <http://data.nytimes.com/10616800927985096861> .

We have a few options for writing our newly generated triples back to the database, but let’s start with Marklogic’s XQuery Semantics API, in particular the sem:rdf-insert function. Here’s a bit of XQuery to run the SPARQL query above and insert them into the <geography> graph in the database:


import module namespace sem = &quot;http://marklogic.com/semantics&quot;
  at &quot;/MarkLogic/semantics.xqy&quot;

let $sparql := &quot;PREFIX cts: <http://marklogic.com/cts#>
                PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
                PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
                PREFIX owl: <http://www.w3.org/2002/07/owl#>
                CONSTRUCT { ?s skos:exactMatch ?n .}
                WHERE
                {
                ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
                ?s skos:inScheme <http://cv.ap.org/a#geography> .
                ?s skos:exactMatch ?gn .
                ?n owl:sameAs ?gn .
                } &quot;

let $triples := sem:sparql($sparql, (),(),())                                          

return
(
sem:rdf-insert($triples,(&quot;override-graph=geography&quot;))
)

Now if we look at the triples for my original term we should see an additional skos:exactMatch for the NYTimes:


<http://cv.ap.org/id/F1818B152CFC464EBAAF95E407DD431E> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> ;
<http://www.w3.org/2004/02/skos/core#inScheme> <http://cv.ap.org/a#geography> ;
<http://www.w3.org/2003/01/geo/wgs84_pos#long> "-70.76255"^^xs:decimal ; 
<http://www.w3.org/2003/01/geo/wgs84_pos#lat> "43.07176"^^xs:decimal ; 
<http://www.w3.org/2004/02/skos/core#exactMatch> <http://sws.geonames.org/5091383/> ;
<http://www.w3.org/2004/02/skos/core#exactMatch> <http://data.nytimes.com/10237454346559533021> ;
<http://www.w3.org/2004/02/skos/core#broader> <http://cv.ap.org/id/9531546082C6100487B5DF092526B43E> ;
<http://www.w3.org/2004/02/skos/core#prefLabel> &quot;Portsmouth&quot;@en .

Another option for writing the new triples back to the database is SPARQL itself. The most recent version, SPARQL 1.1, defines an update language which includes the useful operator INSERT. We can modify our earlier SPARQL query like so:


PREFIX cts: <http://marklogic.com/cts#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
INSERT
{ GRAPH <geography> { ?s skos:exactMatch ?n .} }
WHERE
{
  GRAPH <nytimes>
    {
    ?n skos:inScheme <http://data.nytimes.com/elements/nytd_geo> .
    ?n owl:sameAs ?gn .
    } .
 GRAPH <geography>
    {
    ?s skos:inScheme <http://cv.ap.org/a#geography> .
    ?s skos:exactMatch ?gn . 
    }.
}

The multiple GRAPH statements allow me to query across two graphs, but only write to one. And if we wanted to replace an existing skos:exactMatch triple, rather than append to our existing statements we would precede our INSERT statement with a DELETE. This DELETE/INSERT operation is described in detail here.

Marklogic 8, not yet released, will include support for the SPARQL 1.1 Update query language (among other new semantic capabilities). Since I am lucky enough to be part of the Early Access program for Marklogic 8 I was able to run the query above and see that it generated the new triples correctly.

Both CONSTRUCT and INSERT are not exactly new technologies, but it’s great to see how they might be used within the context of a Marklogic application. For my own work cleaning and enriching vocabulary data these methods have proved to be quite valuable and I look forward to digging into the rest of the SPARQL 1.1 features coming to Marklogic 8 in the near future.

Searching RDF vocabulary data in Marklogic 7

Recently, I’ve been experimenting with Marklogic’s new(ish) semantic capabilities (here’s a quick overview of what Marklogic is offering with their semantics toolkit). In particular, I’ve been trying to build a simple interface for searching across vocabulary data in RDF. This turned out to be an interesting exercise since Marklogic’s current semantic efforts are targeted at an intersection of “documents, data, and now RDF triples.”

It’s fairly easy to set up the triple store; the quick start documentation should get you up and running in short order. For the purposes of this brief document I’m using a small content set from DBPedia derived from the following SPARQL query:

DESCRIBE ?s 
WHERE {
?s rdf:type <http://dbpedia.org/class/yago/ProfessionalMagicians> .
}

I downloaded my set in RDF/XML (using DBPedia’s SPARQL endpoint) and loaded them into Marklogic using mlcp:

mlcp.bat import -host localhost -port 8040 -username [name] -password [pass] -input_file_path C:\data\magicians.rdf -mode local -input_file_type RDF -output_collections magician -output_uri_prefix  /triplestore/

Now if you open up QConsole and ‘explore’ the data you’ll see that all of our triples have been packaged up into discrete documents:

/triplestore/1105189df46c20c7-0-11170.xml
/triplestore/1105189df46c20c7-0-11703.xml
/triplestore/1105189df46c20c7-0-12614.xml
/triplestore/1105189df46c20c7-0-13346.xml

Each one contains 100 triples, and each triple looks something like:

<sem:triple>
<sem:subject>http://dbpedia.org/resource/Criss_Angel</sem:subject>
<sem:predicate>http://dbpedia.org/property/birthDate</sem:predicate>
<sem:object datatype="http://www.w3.org/2001/XMLSchema#date">1967-12-18+02:00</sem:object>
</sem:triple>

Available query structures fall into three categories:

  • CTS queries (cts:*)
  • SPARQL queries (sem:*)
  • Hybrid CTS/SPARQL

The documentation for the available queries is here.

But before we dig into some sample queries, let’s try the Search API. It has some appeal as a solution, since it can provide easy pagination, result counts, and all the nice features of CTS (stemming/lemmatization) to boot.

Let’s search for terms which mention the word ‘paranormal’:

search:search('paranormal')

This returns the familiar results, but you will quickly realize that these will not be particularly helpful, if what you are interested are subject and not document matches.

<search:response snippet-format="snippet" total="21" start="1" page-length="10" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="" xmlns:search="http://marklogic.com/appservices/search">
    <search:result index="1" uri="/triplestore/ad02c265f21a9295-0-55.xml" path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")" score="208896" confidence="0.7033283" fitness="0.8208094">
        <search:snippet>
            <search:match path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")/sem:triples/sem:triple[97]/sem:object">http://dbpedia.org/resource/Category:<search:highlight>Paranormal</search:highlight>_investigators</search:match>
            <search:match path="fn:doc("/triplestore/ad02c265f21a9295-0-55.xml")/sem:triples/sem:triple[100]/sem:object">...and scientific skeptic best known for his challenges to <search:highlight>paranormal</search:highlight> claims and pseudoscience. Randi is the founder of the...</search:match>
        </search:snippet>
    </search:result>
</search:response>

Let’s try the same search with a direct CTS query, but this time we’ll allow for wildcarding:

cts:search(collection(),
cts:and-query((cts:collection-query("magician"), 
cts:word-query("paranormal*","wildcarded")))
)

This is nice, but it again returns the entire document containing matches. And since our triples were arbitrarily added to these documents via MLCP, we have subjects in our results that we don’t care about.

Let’s try the same query in pure SPARQL:

DESCRIBE ?s
WHERE{ 
?s ?p ?o.
FILTER regex(?o, "paranormal*", "i")
}

This works pretty well and returns all the triples for those subjects we are interested in, but it’s rather slow. I’m guessing the pure SPARQL FILTER query here is not particularly optimized. As a comparison, we can actually insert some CTS into our SPARQL query if we wish, like so:

PREFIX cts: http://marklogic.com/cts#
DESCRIBE ?s 
WHERE{ 
?s ?p ?o .
 FILTER cts:contains(?o, cts:word-query("paranormal")) 
}

Compared to the previous query this is blazing fast, though not a sub-second query yet. We can speed things up a bit by using a hybrid CTS/SPARQL approach where we pass a cts:query as an option to sem:sparql. This reduces the set of documents in our search scope before executing the SPARQL, and so may offer a boost to performance. Of course, to continue to drill down to only the relevant subjects (not documents) we need to execute the query twice:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
let $query := cts:word-query('paranormal',"case-insensitive")
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                DESCRIBE ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query('paranormal')) 
                }"
let $results := sem:sparql($sparql,(),("default-graph=magician"),($query))  
return
(
sem:rdf-serialize($results,'rdfxml')
)

If we want to dynamically generate our SPARQL queries we can send a $bindings map to sem:sparql containing our variables. Here’s a more dynamic version of the above:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
let $q := 'paranormal'
let $query := cts:word-query($q,"case-insensitive")
let $bindings := map:map()
let $put := map:put($bindings,"q",$q)
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                DESCRIBE ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query(?q)) 
                }"
let $results := sem:sparql($sparql,($bindings),("default-graph=magician"),($query))  
return
(
sem:rdf-serialize($results,'rdfxml')
)

There’s an interesting byproduct of this approach, however. Once you have filtered the set of documents using the CTS query option, you have also potentially limited the triples available to your SPARQL query. So, if your subject has triples spanning two documents (which happens due to the arbitrary method MLCP employs in uploading your content), and your CTS query only matches the first, any triples from the second document which you expect to return with your SPARQL will appear to be missing.

So, our approach is flawed, but let us press on anyway. What can I do with those triples in a search interface? There are a few options here. Certainly, we can flesh out our SPARQL query to use SELECT and whatever array of properties we need for our display (label, description, etc.) and then pass the results through sem:query-results-serialize to generate SPARQL XML:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at
"/MarkLogic/semantics.xqy";
let $sparql := "PREFIX cts: <http://marklogic.com/cts#>
                PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
               SELECT DISTINCT ?s ?c
                WHERE{ 
                   ?s ?p ?o .
                   ?s rdfs:comment ?c .
                   FILTER ( lang(?c) = 'en' )
                   FILTER cts:contains(?o, cts:word-query('paranormal')) 
                }"
let $results := sem:sparql($sparql,(),("default-graph=magician"),())  
return
(
sem:query-results-serialize($results)
)

Or, if you’d rather serialize the resulting triples in a particular format such as RDF/XML:

sem:rdf-serialize($results,'rdfxml')

I mentioned pagination earlier. Certainly, this would be a lot easier with the Search API, but it can be done with pure SPARQL and a bit of imagination:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at
"/MarkLogic/semantics.xqy";
declare namespace sparql = "http://www.w3.org/2005/sparql-results#";
let $q := 'paranormal'
let $query := cts:word-query($q,"case-insensitive")

let $search-page-size := 2
let $search-start := 1
let $bindings := map:map()
let $put := map:put($bindings,"q",$q)
let $sparql :=  fn:concat(
                "PREFIX cts: <http://marklogic.com/cts#>
                SELECT DISTINCT ?s 
                WHERE{ 
                   ?s ?p ?o .
                   FILTER cts:contains(?o, cts:word-query(?q)) 
                }",
                "LIMIT ",
                $search-page-size,
                " OFFSET ",
               $search-start
               )
let $results := sem:sparql($sparql,($bindings),("default-graph=magician"),($query))  
return
(
sem:query-results-serialize($results)
)

One last issue I encountered with this approach is that reliance on SPARQL requires that all users granted access to this search interface will need sem:sparql execute privileges. SPARQL 1.1 allows for updates to be made to the database via queries. Though this feature is not currently included in Marklogic’s implementation of SPARQL, it might be in version 8. Does this mean that SPARQL privileges are not something you’d want to hand out to read-only users? Perhaps.

After building my own search interface using some of the approaches described above, I feel that Marklogic is at its best when it’s used as a document store. So, perhaps the best approach is to mirror the construction of a document repository, with each term its own document, and employing embedded RDF triples. Something like this abbreviated and modified LC record for Harry Houdini:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="http://id.loc.gov/authorities/names/n79096862">
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
        <skos:prefLabel xml:lang="en" xmlns:skos="http://www.w3.org/2004/02/skos/core#">Houdini,
            Harry, 1874-1926</skos:prefLabel>
        <skos:exactMatch rdf:resource="http://viaf.org/viaf/sourceID/LC%7Cn+79096862#skos:Concept"
            xmlns:skos="http://www.w3.org/2004/02/skos/core#"/>
        <skos:inScheme rdf:resource="http://id.loc.gov/authorities/names"
            xmlns:skos="http://www.w3.org/2004/02/skos/core#"/>
        <skos:altLabel xml:lang="en" xmlns:skos="http://www.w3.org/2004/02/skos/core#">Weiss,
            Ehrich, 1874-1926</skos:altLabel>
    </rdf:Description>
    <sem:triples xmlns:sem="http://marklogic.com/semantics">
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</sem:predicate>
            <sem:object>http://www.w3.org/2004/02/skos/core#Concept</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#prefLabel</sem:predicate>
            <sem:object xml:lang="en">Duke Thomas</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#exactMatch</sem:predicate>
            <sem:object>http://viaf.org/viaf/sourceID/LC%7Cn+79096862#skos:Concept</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#inScheme</sem:predicate>
            <sem:object>http://id.loc.gov/authorities/names</sem:object>
        </sem:triple>
        <sem:triple>
            <sem:subject>http://id.loc.gov/authorities/names/n79096862</sem:subject>
            <sem:predicate>http://www.w3.org/2004/02/skos/core#altLabel</sem:predicate>
            <sem:object xml:lang="en">Weiss, Ehrich, 1874-1926</sem:object>
        </sem:triple>
    </sem:triples>
</rdf:RDF>

I’ll be trying this approach next.

Custom conversions of XML to JSON in XSLT

There are already several resources online devoted to converting XML to JSON with XSLT. Unfortunately, most of these resources describe only how to generate a quick JSON view closely resembling the original XML. This might be all you need, and if so you are in luck. I’ve seen several very useful XSLT templates on the web to do just this. But what if you would like to tweak your JSON output a little? Supposedly, JSON is preferred by developers as a simpler and more straightforward standard than their old foe XML. Why then should we slavishly copy the mistakes of the past into the future?

Let’s take some sample XML I pulled from the World Heritage Centre describing a cultural heritage site in Brazil (I also edited it slightly):

<site>
    <date_inscribed>1983</date_inscribed>
    <http_url>http://whc.unesco.org/en/list/275</http_url>
    <id_number>275</id_number>
    <image_url>http://whc.unesco.org/uploads/sites/site_275.jpg</image_url>
    <iso_code>ar,br</iso_code>
    <latitude>-28.5433333300</latitude>
    <location>State of Rio Grande do Sul, Brazil; Province of Misiones, Argentina</location>
    <longitude>-54.2658333300</longitude>
    <region>Latin America and the Caribbean</region>
    <short_description>&lt;p&gt;The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil,
        and those of San Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and
        Santa Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They
        are the impressive remains of five Jesuit missions, built in the land of the Guaranis during
        the 17th and 18th centuries. Each is characterized by a specific layout and a different
        state of conservation.&lt;/p&gt;</short_description>
    <states>Argentina,Brazil</states>
</site>

Now let’s pass it through one of the many XML to JSON XSLT templates I came across online. This one is from Convert XML to JSON using XSLT and the output looks like:

{"site": {
    "date_inscribed": "1983",
    "http_url": "http://whc.unesco.org/en/list/275",
    "id_number": "275",
    "image_url": "http://whc.unesco.org/uploads/sites/site_275.jpg",
    "iso_code": "ar,br",
    "latitude": "-28.5433333300",
    "location": "State of Rio Grande do Sul, Brazil; Province of Misiones, Argentina",
    "longitude": "-54.2658333300",
    "region": "Latin America and the Caribbean",
    "short_description": "<p>The ruins of S&atilde;o Miguel das Miss&otilde;es in Brazil, and those of San Ignacio Min&iacute;, Santa Ana, Nuestra Se&ntilde;ora de Loreto and Santa Mar&iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the Guaranis during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation.",
    "states": "Argentina,Brazil"
}}

This XML is fairly simplistic and does not contain any attributes, but I think the example does convey what is possible with XSLT. Now we have a fairly accurate rendition of our original XML object.

But looking at this output there are a few things I’d like to change:

–‘location’, ‘states’ and ‘iso_code’ contain multiple values. Those should really be arrays.

–I would like to use the GeoJSON standard for encoding latitude and longitude

–The coordinates should be numbers, not strings

With the changes described above, the JSON above would end up looking something like:

{"site": {
    "date_inscribed": "1983",
    "http_url": "http://whc.unesco.org/en/list/275",
    "id_number": "275",
    "image_url": "http://whc.unesco.org/uploads/sites/site_275.jpg",
    "iso_codes": [
        "ar",
        "br"
    ],
    "geometry": {
        "type": "Point",
        "coordinates": [
            -28.54333333,
            -54.26583333
        ]
    },
    "locations": [
        "State of Rio Grande do Sul, Brazil",
        "Province of Misiones, Argentina"
    ],
    "region": "Latin America and the Caribbean",
    "short_description": "<p>The ruins of S&atilde;o Miguel das Miss&otilde;es in Brazil, and those of San Ignacio Min&iacute;, Santa Ana, Nuestra Se&ntilde;ora de Loreto and Santa Mar&iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the Guaranis during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation.",
    "states": [
        "Argentina",
        "Brazil"
    ]
}}

If I was feeling really confident, I might want to associated the country names with their codes and perhaps group all of the location data in its own object. But for now let’s be happy with the changes we have made. We are now using a common standard for our latitude and longitude values which will helpfully give developers a leg up on consuming our data. We have also parsed those multi-value fields into arrays which will allow for better searching across these fields. Note that in ‘iso_codes’ and ‘states’ the delimiter is a comma and in ‘locations’ it is a semi-colon. We’ve just ironed out that discrepancy and made parsing this data a lot easier on our JSON users.

Of course, now we need to write our own XSLT, and that’s mostly what I wanted to discuss here. It may seem that <> is not too far from {}, but there are syntactical differences that make a transform challenging – especially when we wish to dig into the data and make some changes.

When writing XSLT one can employ a push method where the source tree is pushed through a set of templates, a pull method where specific nodes are retrieved and employed in the desired fashion, or hybrid of the two. To generate the JSON above we’ll definitely need to use the hybrid approach. Here’s an XSLT that will translate our source data to the JSON above:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" method="text" media-type="application/json"/>
    <xsl:output method="text" encoding="utf-8"/>
    <xsl:template match="site">
        <xsl:text>{"site":{</xsl:text>
        
        <xsl:apply-templates/>
        <xsl:text>"geometry": {"type":"Point","coordinates":[</xsl:text>
        <xsl:value-of select="latitude"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="longitude"/>
        <xsl:text>]}</xsl:text>
        <xsl:text>}}</xsl:text>
    </xsl:template>
    
    <!-- String values from /site -->
    <xsl:template match="date_inscribed|http_url|id_number|image_url|region|short_description">
        <xsl:text>"</xsl:text>
        <xsl:value-of select="local-name()"/>
        <xsl:text>":"</xsl:text>
        <xsl:value-of select="normalize-space(.)"/>
        <xsl:text>",</xsl:text>
        <xsl:apply-templates/>
    </xsl:template>
    
    <!-- comma separated array values values from /site -->
    <xsl:template match="iso_code|states">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,','))"/>
        <xsl:text>"</xsl:text>
        <xsl:choose>
            <xsl:when test="local-name()='iso_code'">
                <xsl:text>iso_codes</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="local-name()"/>
            </xsl:otherwise>
        </xsl:choose>
        <xsl:text>":[</xsl:text>
        <xsl:for-each select="$tokens">
            <xsl:text>"</xsl:text>
            <xsl:value-of select="normalize-space(.)"/>
            <xsl:text>"</xsl:text>
            <xsl:if test="position() != last()">
                <xsl:text>, </xsl:text>
            </xsl:if>
        </xsl:for-each>
        <xsl:text>],</xsl:text>
    </xsl:template>
    
    <!-- semi-colon separated array values values from /site -->
    <xsl:template match="location">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,';'))"/>
        <xsl:text>"</xsl:text>
        <xsl:choose>
            <xsl:when test="local-name()='location'">
                <xsl:text>locations</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="local-name()"/>
            </xsl:otherwise>
        </xsl:choose>
        <xsl:text>":[</xsl:text>
        <xsl:for-each select="$tokens">
            <xsl:text>"</xsl:text>
            <xsl:value-of select="normalize-space(.)"/>
            <xsl:text>"</xsl:text>
            <xsl:if test="position() != last()">
                <xsl:text>, </xsl:text>
            </xsl:if>
        </xsl:for-each>
        <xsl:text>],</xsl:text>
    </xsl:template>
    
    <!-- Whenever you match any node or any attribute -->
    <xsl:template match="node()|@*">        
        <!-- Including any attributes it has and any child nodes -->
        <xsl:apply-templates select="@*|node()"/>
    </xsl:template>
</xsl:stylesheet>

The source data uses inconsistent pluralization so I’ve adjusted some of the local-names. It also uses different and separator tokens as mentioned above, so I’ve had to create a few duplicate templates. Everything in the main template is representative of the ‘pull’ methodology. There is really no true source in the original document for the GeoJSON we want to output, so we construct it within the ‘site’ template.

This transform will output null values if one of the expected elements in the source XML is missing. Go ahead and comment out latitude and see what happens; you should end up with a null value in your JSON output. This is only an issue for objects created in the aforementioned ‘pull’ style transform in the main template.

We could go ahead and add an if statement here to test for the presence of an element before converting, but then we run the risk of introducing a trailing comma which would invalidate the JSON output. And in fact, the more we add to the ‘pull’ section of the transform the higher the risk of creating invalid, or at least not useful, JSON.

One strategy I would recommend is to split your transform into two steps. In the first step you would generate an XML view of your desired JSON, and in the second you would parse this secondary ‘JSON-ML’ into actual JSON. Here’s an example of what I mean using a JSON-ML standard I’ve borrowed from Marklogic:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    <xsl:template match="site">
        <xsl:element name="json" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:element name="site" namespace="http://marklogic.com/xdmp/json/basic">
                <xsl:attribute name="type">
                    <xsl:text>object</xsl:text>
                </xsl:attribute>
                <xsl:apply-templates/>
            </xsl:element>
        </xsl:element>
    </xsl:template>
    <!-- String values from /site -->
    <xsl:template match="date_inscribed|http_url|id_number|image_url|region|short_description">
        <xsl:element name="{local-name()}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>string</xsl:text>
            </xsl:attribute>
            <xsl:value-of select="normalize-space(.)"/>
        </xsl:element>
    </xsl:template>
    <!-- comma separated array values values from /site -->
    <xsl:template match="iso_code|states">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,','))"/>
        <xsl:variable name="object-name">
            <xsl:choose>
                <xsl:when test="local-name()='iso_code'">
                    <xsl:text>iso_codes</xsl:text>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="local-name()"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:variable>
        <xsl:element name="{$object-name}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>array</xsl:text>
            </xsl:attribute>
            <xsl:for-each select="$tokens">
                <xsl:element name="json" namespace="http://marklogic.com/xdmp/json/basic">
                    <xsl:attribute name="type">
                        <xsl:text>string</xsl:text>
                    </xsl:attribute>
                    <xsl:value-of select="normalize-space(.)"/>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
    <!-- semi-colon separated array values values from /site -->
    <xsl:template match="location">
        <xsl:variable name="tokens" select="distinct-values(tokenize(.,';'))"/>
        <xsl:variable name="object-name">
            <xsl:choose>
                <xsl:when test="local-name()='iso_code'">
                    <xsl:text>iso_codes</xsl:text>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="local-name()"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:variable>
        <xsl:element name="{$object-name}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>array</xsl:text>
            </xsl:attribute>
            <xsl:for-each select="$tokens">
                <xsl:element name="json" namespace="http://marklogic.com/xdmp/json/basic">
                    <xsl:attribute name="type">
                        <xsl:text>string</xsl:text>
                    </xsl:attribute>
                    <xsl:value-of select="normalize-space(.)"/>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
    
    <!-- Whenever you match any node or any attribute -->
    <xsl:template match="node()|@*">
        <!-- Including any attributes it has and any child nodes -->
        <xsl:apply-templates select="@*|node()"/>
    </xsl:template>
</xsl:stylesheet>

Which will generate the following JSON-ML:

<json xmlns="http://marklogic.com/xdmp/json/basic">
   <site type="object">
      <date_inscribed type="string">1983</date_inscribed>
      <http_url type="string">http://whc.unesco.org/en/list/275</http_url>
      <id_number type="string">275</id_number>
      <image_url type="string">http://whc.unesco.org/uploads/sites/site_275.jpg</image_url>
      <iso_codes type="array">
         <json type="string">ar</json>
         <json type="string">br</json>
      </iso_codes>
      <location type="array">
         <json type="string">State of Rio Grande do Sul, Brazil</json>
         <json type="string">Province of Misiones, Argentina</json>
      </location>
      <region type="string">Latin America and the Caribbean</region>
      <short_description type="string">&lt;p&gt;The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil, and those of San Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and Santa Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the Guaranis during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation.&lt;/p&gt;</short_description>
      <states type="array">
         <json type="string">Argentina</json>
         <json type="string">Brazil</json>
      </states>
   </site>
</json>

And finally, here is a generic stylesheet for transforming any JSON-ML document to valid JSON:

<xsl:stylesheet version="2.0" xmlns:fn="http://www.w3.org/2005/xpath-functions"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:appl="http://ap.org/schemas/03/2005/appl"
    xmlns:json="http://marklogic.com/xdmp/json/basic" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <xsl:output omit-xml-declaration="yes" method="text" encoding="UTF-8"
        media-type="application/json"/>
    
    <xsl:template match="json:json">
        <xsl:text>{</xsl:text>
        <xsl:for-each select="child::*[not(string-length(.)=0)]">
            <xsl:choose>
                <xsl:when test="normalize-space()=''"/>
                <xsl:otherwise>
                    <xsl:call-template name="recurse"/>
                    <xsl:if test="not(position()=last())">
                        <xsl:text>,</xsl:text>
                    </xsl:if>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:for-each>
        <xsl:text>}</xsl:text>
    </xsl:template>
    
    <xsl:template name="recurse">
        <xsl:choose>
            <xsl:when test="@type='string'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:text>"</xsl:text>
                <xsl:value-of select="."/>
                <xsl:text>"</xsl:text>
            </xsl:when>
            <xsl:when test="@type='number' or @type='boolean'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:value-of select="."/>
            </xsl:when>
            <xsl:when test="@type='object'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:text>{</xsl:text>
                <xsl:for-each select="child::*[not(string-length(.)=0)]">
                    <xsl:call-template name="recurse"/>
                    <xsl:if test="not(position()=last())">
                        <xsl:text>,</xsl:text>
                    </xsl:if>
                </xsl:for-each>
                <xsl:text>}</xsl:text>
            </xsl:when>
            <xsl:when test="@type='array'">
                <xsl:choose>
                    <xsl:when test="not(local-name()='json')">
                        <xsl:text>"</xsl:text>
                        <xsl:value-of select="local-name()"/>
                        <xsl:text>":</xsl:text>
                    </xsl:when>
                    <xsl:otherwise/>
                </xsl:choose>
                <xsl:text>[</xsl:text>
                <xsl:for-each select="child::*[not(string-length(.)=0)]">
                    <xsl:call-template name="recurse"/>
                    <xsl:if test="not(position()=last())">
                        <xsl:text>,</xsl:text>
                    </xsl:if>
                </xsl:for-each>
                <xsl:text>]</xsl:text>
            </xsl:when>
            <xsl:otherwise/>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>

This approach eliminates any danger of outputting an unnecessary trailing comma, and also nicely filters out any null values, but it obviously adds some processing time to your transform. It may also be unnecessary for your purposes, if the push style transform is sufficient for your desired data model.

I’ll end with a few other templates which may be useful to the would-be JSON XSLT writer.

In my example above, the XML that is encoded in short_description is already serialized, but if you need to do the work of serialization yourself, try the following:

Source XML:

    <short_description>
        <p>The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil, and those of San Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and Santa
            Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are
            the impressive remains of five Jesuit missions, built in the land of the Guaranis during
            the 17th and 18th centuries. Each is characterized by a specific layout and a different
            state of conservation.</p>
    </short_description>

XSLT:

   <!-- serialize xml to string -->
    
    <xsl:template match="*" mode="serialize">
        <xsl:text>&lt;</xsl:text>
        <xsl:value-of select="name(.)"/>
        <xsl:text>&gt;</xsl:text>
        <xsl:apply-templates mode="serialize"/>
        <xsl:text>&lt;/</xsl:text>
        <xsl:value-of select="name(.)"/>
        <xsl:text>&gt;</xsl:text>
    </xsl:template>

    <xsl:template match="short_description">
        <xsl:variable name="desc">
            <xsl:apply-templates select="./*" mode="serialize"/>
        </xsl:variable>
        <xsl:element name="{local-name()}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>string</xsl:text>
            </xsl:attribute>
            <xsl:value-of select="normalize-space($desc)"/>
        </xsl:element>
    </xsl:template>

And finally a few templates for escaping characters that would have potentially ill effects on your JSON output:

Source XML:

    <short_description>
        <p>The ruins of S&amp;atilde;o Miguel das Miss&amp;otilde;es in Brazil, and those of San
            Ignacio Min&amp;iacute;, Santa Ana, Nuestra Se&amp;ntilde;ora de Loreto and Santa
            Mar&amp;iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are
            the impressive remains of five Jesuit missions, built in the land of the "Guaranis" during
            the 17th and 18th centuries. Each is characterized by a specific layout and a different
            state of conservation.</p>
    </short_description>

XSLT:

<!-- Escape the backslash (\) before everything else. -->
    <xsl:template name="escape-string">
        <xsl:param name="s"/>
        <xsl:choose>
            <xsl:when test="contains($s,'\')">
                <xsl:call-template name="escape-quot-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'\'),'\\')"/>
                </xsl:call-template>
                <xsl:call-template name="escape-string">
                    <xsl:with-param name="s" select="substring-after($s,'\')"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:call-template name="escape-quot-string">
                    <xsl:with-param name="s" select="$s"/>
                </xsl:call-template>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <!-- Escape the double quote ("). -->
    <xsl:template name="escape-quot-string">
        <xsl:param name="s"/>
        <xsl:choose>
            <xsl:when test="contains($s,'&quot;')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'&quot;'),'\&quot;')"/>
                </xsl:call-template>
                <xsl:call-template name="escape-quot-string">
                    <xsl:with-param name="s" select="substring-after($s,'&quot;')"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="$s"/>
                </xsl:call-template>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <xsl:template name="encode-string">
        <xsl:param name="s"/>
        <xsl:choose>
            <!-- tab -->
            <xsl:when test="contains($s,'	')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'	'),'\t',substring-after($s,'	'))"/>
                </xsl:call-template>
            </xsl:when>
            <!-- line feed -->
            <xsl:when test="contains($s,'
')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'
'),'\n',substring-after($s,'
'))"/>
                </xsl:call-template>
            </xsl:when>
            <!-- carriage return -->
            <xsl:when test="contains($s,'
')">
                <xsl:call-template name="encode-string">
                    <xsl:with-param name="s" select="concat(substring-before($s,'
'),'\r',substring-after($s,'
'))"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="$s"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>
<xsl:template match="short_description">
        <xsl:variable name="desc">
            <xsl:apply-templates select="./*" mode="serialize"/>
        </xsl:variable>
        <xsl:element name="{local-name()}" namespace="http://marklogic.com/xdmp/json/basic">
            <xsl:attribute name="type">
                <xsl:text>string</xsl:text>
            </xsl:attribute>
            
            <xsl:call-template name="escape-string">
                <xsl:with-param name="s" select="normalize-space($desc)"/>
            </xsl:call-template>
        </xsl:element>
    </xsl:template>

Which will ultimately produce:

"short_description": "<p>The ruins of S&atilde;o Miguel das Miss&otilde;es in Brazil, and those of San Ignacio Min&iacute;, Santa Ana, Nuestra Se&ntilde;ora de Loreto and Santa Mar&iacute;a la Mayor in Argentina, lie at the heart of a tropical forest. They are the impressive remains of five Jesuit missions, built in the land of the \"Guaranis\" during the 17th and 18th centuries. Each is characterized by a specific layout and a different state of conservation."

There are perhaps better technologies for generating JSON from XML, but if XSLT is your preferred tool then by all means you should use it.

Mock objects and APIs

Since I do not have a computer science background (I have a BA in English Lit and an MA in Information Science), I sometimes think I’ve uncovered something entirely new which turns out to be common practice among programmers. The latest such ‘discovery’ is apparently called, by those who know better, a ‘mock object.’ Wikipedia says that a ‘programmer typically creates a mock object to test the behavior of some other object.’ Oh right, that’s exactly what I’ve done. Ok, then.

In the last year I have had to write two applications which accessed an API that did not exist yet. The reasons for this are obscure and perhaps best left unremarked upon, but I did learn something in the process. When building similar applications in the past I had found that it was incredibly useful to, you know, actually have an API to run them against, so it was suggested that I mock up the–currently non-existent–API. In fact, this turned out to be so useful that if ever I have to build another application that accesses an API I will repeat this procedure. A mock API allows you to test any and all responses that might be returned, and actually having the responses to test against is far more productive (at least it was for me) than simply reading about them in the documentation.

Having now read through the Wikipedia page on mock objects, I know that my own mock API is actually more of a ‘fake API.’ This is because I am not really testing the request itself or the submitted data. Instead I simply return a particular HTTP status code along with a bit of sample JSON in the body. Regardless of my indiscretion with the terminology, if you’d like to learn how I generated my mock/fake API, read on.

My mock fake API in Python

Setting up a simple web server is quite easy using Python’s Basic HTTP server. The following code will create a ‘things’ endpoint at which we can GET a particular thing via it’s Id:

import BaseHTTPServer
import re

class MyHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        a = re.compile("\/things\/[a-zA-Z0-9]*")
        if a.match(self.path):
            self.send_response(200)
            self.send_header('Content-type', 'application/json')
            self.end_headers()
            self.wfile.write(open('data.json').read())

server_class = BaseHTTPServer.HTTPServer
httpd = server_class(('', 1234), MyHandler)
httpd.serve_forever()

To test this code you need to also create a sample json file, named ‘data.json,’ in the same folder as the above Python code. Now you can access the URL http://localhost:1234/things/1234, which should return whatever snippet of json you’ve stored in data.json. The regex on line 6 can be altered to accommodate whatever call you wish to emulate. In this case a ‘thing’ Id can be any number of numbers and lower or upper case letters.

It is similarly easy to handle other kinds of requests, such as PUT, POST and HEAD. Here’s a sample POST against the ‘things’ endpoint that returns a HTTP status code of 404 and a snippet of JSON stored as ‘error.json’:

import BaseHTTPServer

class MyHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_POST(self):
        if self.path == '/things':
            self.send_response(404)
            self.send_header('Content-type', 'application/json')
            self.end_headers()
            self.wfile.write(open('error.json').read())

server_class = BaseHTTPServer.HTTPServer
httpd = server_class(('', 1234), MyHandler)
httpd.serve_forever()

Should you wish to test the content of this POST and make this more of an actual mock object, you can read the contents of the submitted data using the following:

postLen = int(self.headers['Content-Length'])
postData = self.rfile.read(postLen)

Of course, this approach requires the presence of a ‘Content-Length’ header, but there are probably more direct methods you could try. Lastly, I occasionally found it useful to randomly return different status codes:

self.send_response(random.choice([200,404]))

The server code above could certainly be more dynamic, but for my use case it was easy enough to make manual edits to return a different code temporarily, or to alter the response body in some way.

Getting started with Tesseract

Setup

The specifics of installing Tesseract on your machine will depend on your operating system, but in general there are three basic steps required:

  • Installing the Leptonica library which handles the image analysis and processing required by Tesseract. Files are located here: http://www.leptonica.com/download.html
  • Installing Tesseract source; files can be downloaded here.
  • Installing the Tesseract language data, available from the same download link above. It is possible to download multiple languages and call them simultaneously (more on this below). For my OCR testing process I downloaded and used the English data files.

Languages

If you’ve installed the Tesseract language data as described above you’ve already seen the quite substantial list of supported languages. Once you’ve copied the languages to the tessdata folder they can be invoked on the command line with the -l parameter. For example:

tesseract example.jpg out -l eng

In my tests I only used the English language training data, but what if your corpus contains texts with mixed languages? You should be able invoke two languages at once like so:

tesseract example.jpg out -l eng+spa

This obviously requires that you either have a list of expected languages that you invoke each time, or that you choose the language at the time of processing, perhaps based on the metadata of an object. I expect there will be an impact to processing times if you invoke Tesseract with a long list of languages since it would have to check each token against several dictionaries. That’s just a guess, though.

Outputs

By default Tesseract produces plain text files. I did however confirm that the software could produce hOCR. To achieve this, you need to create a config file in the same directory as Tesseract with the single line:

tessedit_create_hocr 1

Then you can invoke tesseract with:

tesseract input.tiff output -l eng +myconfig

The hOCR output might be a helpful step in generating ‘smart’ PDFs from your images. There are even several scripts out there to do just that; for example, hocr-pdf.

Image preprocessing

There is not much documentation available for Tesseract, but there is some anecdotal information available on sites like StackOverflow. In general, it is suggested to do some preprocessing of your images before running them through Tesseract. For example, full color magazine pages might benefit from ‘grayscale’ processing and certain more vintage documents, where the text is often blurred, might benefit from some contrast. A popular solution for this kind of image conversion is ImageMagick. A simple grayscale invocation of the ImageMagick tool looks like:

convert example.jpg -type grayscale

If you want to adjust the images DPI (say to 300, which is often recommended for optimal OCR) try:

convert example.jpg -density 300 -type grayscale

Also recommended is a script called textcleaner available from Fred’s ImageMagick Scripts. Much of what is contained in this script eludes me, but it seems to merge a dozen or so different conversions all with the intent of producing the most readable text possible. I saw Fred’s work cited often within OCR discussions. Here’s how to invoke textcleaner with the added grayscale step:

textcleaner -g $i

Summary

As I mentioned above, Tesseract itself does not come with much documentation, and the more ephemeral community-produced literature is fragmented and often hard to find. So, I hope this is helpful to someone who is just getting started with Tesseract.