There is no shortage of beginner-friendly articles about text classification using machine learning, for which I am immensely grateful. In general, these posts attempt to classify some set of text into one or more categories: email or spam, positive or negative sentiment, a finite set of topical categories (e.g. sports, arts, politics). This last example can be described as a multi-class problem. Here’s a definition of multi-class taken from the scikit-learn documentation:
Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
This is certainly fine for a simple classification task such as slotting a news article into a broad vertical such as ‘Travel,’ or ‘Weather,’ but if our taxonomy is even a bit wider or deeper we will find ourselves struggling to assign each piece of text to a single category. Take for example, the following article:
PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage.
Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer.
How should we classify this document? Is it about dance? Or about car accidents? Or perhaps about sports injuries? If we look for inspiration in the IPTC Media Topics taxonomy, we might end up with the following topics:
|accident and emergency incident||http://cv.iptc.org/newscodes/mediatopic/20000139|
This kind of scenario, where a single sample can be associated with multiple targets (accident and ballet), is called multi-label classification. Let’s crib one more time from the scikit-learn documentation:
Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
And if we look a bit closer at these topics, we might notice that ‘ballet’ is a child of ‘dance’, which is itself a child of ‘arts and entertainment’. The full hierarchy of both terms can be expressed as the following:
- arts, culture and entertainment
- arts and entertainment
- arts and entertainment
- disaster, accident and emergency incident
- accident and emergency incident
We’ve quickly transitioned from a ‘simple’ multi-class classification problem to a multi-label classification problem that is further complicated by a set of hierarchically structured targets. Should we only apply the narrowest of topics in our taxonomy? Do we create a classifier for all topics, broad and narrow, and does the application of one mean anything for the other?
Sadly, I was not able to find many beginner-friendly articles written about hierarchical multi-label classification. I wish I could tell you that this will be that very article, but I can’t and it won’t. Maybe if I outline the problem, someone else will be inspired to write that article. And then we all benefit!
A simple example of multi-label classification
Let’s table the discussion of hierarchy for now and start with the simplest implementation of multi-label classification we can find.
The two main methods for approaching multi-label classification are problem transformations and algorithm adaptations. You will find a good overview of the two approaches here and here. Problem transformation techniques convert the multi-label task into a set of binary classification tasks, somewhat simplifying the task. For each label in the training data we create a single binary classifier and then the set of binary classifiers are then evaluated in concert. This is also referred to as a one-vs.-rest classifier. Let’s walk through a simple example.
Our training set:
|PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage. Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer.||ballet|
|PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage. Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer.||accident and emergency incident
You’ll notice that in the training data I have repeated the example text on two rows, one per label. Not knowing how many labels an example might have, and therefore how many columns I’d need for a single row display, this seemed the best way to encode the information. You might start with something a bit different. Regardless of where you start, we need to make some modifications before training a multi-label model.
Essentially, we need end up here:
|PROVIDENCE, R.I. (AP) — A ballet dancer who was seriously injured in a Rhode Island hit-and-run over the summer has returned to the stage. Festival Ballet Providence dancer Jordan Nelson was riding his bike in June when he was struck by a car. He suffered skull fractures and a broken clavicle. WLNE-TV reports doctors told Nelson he’d never dance again but he wouldn’t accept that as an answer.||[ballet, accident and emergency incident]|
Where our ‘labels’ value is an array of label strings. Here is how I transformed my data:
import pandas as pd path_to_csv = ‘training_data.csv' dataset = pd.read_csv(path_to_csv,usecols=["label","example"]) #modify dataset for multilabel grouped = dataset.groupby('example') df = grouped['label'].aggregate(lambda x: list(x)).reset_index(name="labels")
This will group my data by example and then pull all of the related labels into an array. This is great, but we are not quite done. Though we can intuitively understand the meaning of our lists of strings, they will be too cumbersome for our model to process. We need to convert these arrays into the expected multi-label format, a binary matrix indicating the presence (or absence) of a label. We do this using scikit-learn’s MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer X = df['example'] y = df['labels'] y = MultiLabelBinarizer().fit_transform(y)
Now that our data is in the correct format, we can train a model. The OneVsRestClassifier allows us to use the binary classifier of our choice. Let’s start with LinearSVC:
from sklearn.model_selection import train_test_split import numpy as np from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer #split data into test and train random_state = np.random.RandomState(0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=random_state) #our pipeline transforms our text into a vector and then applies OneVsRest using LinearSVC pipeline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(LinearSVC())) ]) pipeline.fit(X_train,y_train)
That, I think, is the simplest approach to multi-label classification. Of course, my results were (seemingly) abysmal. More on evaluation metrics later.
Another problem transformation technique is the classifier chains method. This approach is similar to one-vs.-rest, seen above, in that it is comprised of several binary classifiers. But in a classifier chain the output of each classifier is passed on to the next classifier in the chain (along with the original input, in our case the news text). This approach is intended to improve our classifier by taking label dependencies/co-occurrences into consideration.
Our working example classed with ‘ballet’ and ‘accident and emergency incident’ is perhaps not the best represenation of label interdependence, since these two topics will not co-occur with great frequency (we hope!). However, if we browse our favorite news site, mentally classifying each article into a set of topics, we should come up with a few sets of commonly co-occurring topics. ‘Elections’ and ‘campaign finance.’ ‘Football’ and ‘sports injuries.’ ‘Coal mining’ and ‘environment.’ (For the purpose of these examples, I am inventing my own news topics, rather than looking to IPTC).
There are a few, frequently cited, papers on the subject of classifier chains (such as, Classifier Chains for Multi-label Classification) and a scikit-learn implementation, described here. In the example, the order of the chains is random. The documentation notes:
Because the models in each chain are arranged randomly there is significant variation in performance among the chains. Presumably there is an optimal ordering of the classes in a chain that will yield the best performance. However we do not know that ordering a priori. Instead we can construct an voting ensemble of classifier chains by averaging the binary predictions of the chains and apply a threshold of 0.5.
Since we have an implicit order in our hierarchical taxonomy, I wonder if this can be used to improve performance. Of course, there is no guarantee that co-occurrence will be limited to labels in the same taxonomy branch. At any rate, I have yet to implement this method.
Yet another problem transformation technique is the label powerset method. In this approach, each combination of labels in the training set is considered as a unique class. So, instead of two classes for our example, ‘ballet’ and ‘accident and emergency incident,’ we would have a single class ‘ballet, accident and emergency incident.’ If you are starting with something like IPTC’s media topics, a rather large taxonomy which may also be applied in an unexpected fashion (e.g. lots of cross-hierarchy cooccurrences), the resulting set of classes may be too large. Also, we cannot guarantee that our training set will have an example for every potential combination of labels. Mostly for this latter reason, I don’t think this method is appropriate for news classification.
This brings us to algorithm adaptations, methods that modify an existing algorithm so it can directly cope with a multi-label dataset.
There are several scikit-learn libraries that are described as having support for multi-label classification, a list which includes decision tree, k-neighbor, random forest, and ridge classifiers. Decision trees seem to have some promise for the problem, especially considering the issue of hierarchy. There has been some research on the subject, see Decision Trees for Hierarchical Multi-label Classification.
- MLkNN – a multi-label adapted kNN classifier
- Described in the paper, ML-KNN: A lazy learning approach to multi-label learning
- MLARAM – A Multi-Label Hierarchical ARAM Neural Network
- Described in the paper, HARAM: A Hierarchical ARAM Neural Network for Large-Scale Text Classification
Again, there are not enough examples of applying these methods for text classification, at least not enough at my level (novice). I think the scikit-multilearn library is likely the obvious next step as it implements several algorithms from commonly cited articles in the literature. Although, it may also be worthwhile to run through all the available scikit-learn multi-label-compliant algorithms, just to see if there are any easy wins to be had.
Notes on accuracy
After fitting the simple OneVsRestClassifier seen above, I was disappointed by the low accuracy score. Little did I know that the evaluation metrics I was used to using were not appropriate for a multi-label scenario. Here’s a note from the OneVsRestClassifier documentation regarding accuracy:
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
That is pretty harsh. What we need is a metric that will reflect partial accuracy. For instance, we apply ‘ballet’ correctly, but we also apply ‘weather’ incorrectly to the same example text. This is not a wholly inaccurate classification, it is only partially inaccurate. Luckily, we have a few options.
- Hamming loss
- the fraction of the wrong labels to the total number of labels
- a loss function, so the optimal value is 0
- scikit-learn implementation of hamming loss
- Jaccard similarity coefficient score
- the size of the intersection divided by the size of the union of the sample sets
- ranges from 0 to 1, and 1 is the optimal score
- scikit-learn implementation of jaccard similarity
- Coverage error measure
- average “depth” (how far we need to go through the ranked scores) to cover all true labels
- the optimal value is the average number of true labels.
- scikit-learn implementation of coverage error
- Averaged (micro and macro) F1 scores
An example of hamming loss and jaccard similarity using scikit-learn:
from sklearn.metrics import hamming_loss from sklearn.metrics import jaccard_similarity_score y_pred = pipeline.predict(X_test) print(hamming_loss(y_test,y_pred)) print(jaccard_similarity_score(y_test, y_pred))
Above are my results using the simple OneVsRestClassifier described earlier. The hamming loss, seems good? The jaccard similarity, less so.
One approach to dealing with my hierarchical taxonomy would be to simply flatten it, ignoring the hierarchy entirely. And this is exactly what I’ve done so far. This seems to be a fine approach for the short term, as I have yet to explore all of the available multi-label algorithms described above. Perhaps, the ‘flat’ approach will be good enough. Perhaps not.
If not, there are some novel ideas out there that use the hierarchy to the advantage of the classifier. Unfortunately, most of these ideas are described in academic papers, with few including any bootstrap code.
One approach which seemed interesting is described in a PyData talk by Jurgen Van Gael: Hierarchical Text Classification using Python (and friends). There is a lot to chew on here, but essentially this approach uses a set of Naïve Bayes classifiers to route a document through the branches of our hierarchical tree, and then individual classifiers for each node in the branch. Using IPTC Media Topics as our example again, we might have a set of Naïve Bayes classifiers to route a document to one of the top level terms (arts, culture and entertainment, education, environment, politics, society, sport, etc.) and then a different classifier for any subsequent nodes in the tree. I’m assuming there would be a set of Naïve Bayes classifiers for any hierarchical level where multiple paths can be followed. Van Gael also notes that if a training example is associated with a class that is 5 levels deep, that training example is copied to each of that class’s ancestors. It seems like a promising approach, but it requires a lot of orchestration, also several more classifiers than just the simple flattened approach.
Another potential issue with a corpus tagged with a relatively deep taxonomy, is that many of the deepest labels will have less examples. The more granular a concept the less broadly it can be applied. If we look at a news corpus that has been tagged with the IPTC Media Topics taxonomy we will likely find plenty of examples for ‘health,’ but far fewer for ‘dietary supplements’ (which is 4 levels down from ‘health’).
Generally, our classification models are better served by having a balanced number of examples across the target classes. Given a large enough corpus we may be able to ensure that all classes are equally represented, but it is inevitable that some will lag behind.
A few options:
We could modify the individual binary classifiers we’ve wrapped with the one-vs.-rest classifier. For example, the LinearSVC classifier (shown above) has a ‘class_weight’ parameter which purports (if set to ‘balanced’) to automatically adjust weights inversely proportional to class frequencies in the input data. So, our instances of ‘dietary supplements’ which appear less frequently should be weighted appropriately.
We could use the imbalanced-learn library. In particular, the method RandomUnderSample can be easily added to our pipeline to equalize the number of examples per class before training begins. However, it is not clear if this will work in a multi-label scenario.
Or, if we have decided to use one of the adapted algorithms provided by scikit-multilearn (described above), we could follow their suggestion to use a k-fold cross-validation approach with sklearn.model_selection.KFold. The scikit-multilearn folks also mention:
If your data set exhibits a strong label co-occurrence structure you might want to use a label-combination based stratified k-fold.
But this method uses the label powerset approach, in which cooccurring labels are combined into unique classes. This would have the same drawbacks described above, in that our training becomes more expensive (many more classes) and we may be unable to accurately tag content in the future as the classifier can only tag content with label combinations it saw during training.
If you have any thoughts about where I should go next, or regarding any false assumptions I might have made above, I’d love to hear from you!