using fastText to classify phone call dialogue

fastText

fastText is a useful tool that allows us to use text data and train supervised and unsupervised models. While this is possible without fastText using sklearn, spacy, etc. there are plenty of guides on those and not much information on fastText. It allows you to use it from the command line very straightforward or there is a python libary included. Part of why it’s appeling is that it is incredibly quick, straightforward and doesn’t require much knowledge to use. To quote the official website:

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.

Use cases

I’m not going to do much with the actual model in this post but the ability to tag lines from a salesperson or customer is very useful. I can imagine that call centers would be able to much more easily disect and learn about their customer service reps and get better ideas on how to deescalature aggressive calls or expediate a phone call when it’s a sure thing for sales. Other ideas such as outcome and mannerisms could be pinpointed and the dialogue better analysed.

Setup

To install fastText into a python project, you can either install it from the git repo or clone into your folder and then pipenv install -e fastText/

Cleaning the data

Like all machine learning problems, the data doesn’t come in the correct form to feed to the fastText python module. While the official docs use sed/awk which is much faster, preprocessing the data to work with fastText could be convoluted if you have data that doesn’t fit very close to what is expended.

To set the data up for fastText we are going to do as minimal preprocessing as possible since the dataset I am using has too many nuanced issues to dive through every normalization/cleaning we could do. We are just going to lower case the lines and create the labels from the column that has pretagged labels along with sometimes multiple labels. Ideally this stuff would be cleaned prior to ingestion into the database since processing the data with pandas could become a bit unrealistic at certain sizes.

import pandas as pd
import numpy as np
import csv


def read_data(file='./voiceops_query.csv'):
    df = pd.read_csv(file, sep='|')
    return df


def add_label(s):
    if ':' in s:
        s = s.split(':')[0]
    if '(' in s:
        s = s.split('(')[0]
    return '__label__' + s.lower().replace(' ', '-')


def get_labels(df, column='category_label', none_label='__label__none'):
    '''
    set the data to be read
    '''
    # replace all empty with label
    df = df.where((pd.notnull(df)), none_label)
    for idx, row in df[column].iteritems():
        if row != none_label:
            if '*' in row:
                row = ' '.join([add_label(r) for r in row.split('*')])
            else:
                row = add_label(row)
            df.loc[idx, column] = row
    return df


def save_df(df, columns=['category_label', 'content']):
    # i couldnt figure out how to use pd.to_csv in this maner without the quotes
    # or without error whentrying to use quoting=csv.QUOTE_NONE, if you know how
    # to properly do this i would love to know! please contact me!
    df = df[columns].values
    with open('labeled_data.txt', 'w') as f:
        for line in df:
            f.write(f"{line[0]} {line[1].lower()}\n")

Training the model

Fortunately most of the functionality is built into fastText although there are a few big issues for the python library that haven’t been PR’ed and the library (see: https://github.com/facebookresearch/fastText/pull/552 and https://github.com/facebookresearch/fastText/pull/517)

In a serperate file we will train the model without doing much

import fastText
from fastText import train_supervised


class Model:
    def __init__(self, train_data, pretrained=None):
        if pretrained:
            pass
        self.train_data = train_data


    def train(self, params=None):
        self.model = train_supervised(self.train_data)

    def create_ft_matrix(self, ft_matrix=None):
        self.ft_words = self.model.get_words()
        self.word_frequencies = dict(zip(*self.model.get_words(include_freq=True)))
        self.ft_matrix = ft_matrix
        if self.ft_matrix is None:
            self.ft_matrix = np.empty((len(self.ft_words), self.model.get_dimension()))
            for i, word in enumerate(self.ft_words):
                self.ft_matrix[i, :] = self.model.get_word_vector(word)

    def find_nearest_neighbor(self, query, vectors, n=10,  cossims=None):
        if cossims is None:
            cossims = np.matmul(vectors, query, out=cossims)

        norms = np.sqrt((query**2).sum() * (vectors**2).sum(axis=1))
        cossims = cossims/norms
        result_i = np.argpartition(-cossims, range(n+1))[1:n+1]
        return list(zip(result_i, cossims[result_i]))

    def nearest_words(self, word, n=10, word_freq=None):
        result = self.find_nearest_neighbor(
            self.model.get_word_vector(word), self.ft_matrix, n=n)
        if word_freq:
            return [(self.ft_words[r[0]], r[1]) for r in result if self.word_frequencies[self.ft_words[r[0]]] >= word_freq]
        else:
            return [(self.ft_words[r[0]], r[1]) for r in result]
        pass

    def predict_sentence(self, sent, n=5):
        return self.model.predict(sent, k = n)


if __name__ == '__main__':
    model = Model('labeled_data.txt')
    model.train()
    print(model.predict_sentence('im super upset this is bullshit'))

Baseline

While we are not worrying about model validation/testing the model in this example (it’s important, just not always the most interesting process), the model isn’t particularily good.

The results out of the box with this limited dataset and huge amount of none labels makes pretty much anything that should be obviously tagged as always __label__none and something uncorrectly tagged as well, for instance an angry customer saying soemthing about talking to a manager, you may think would be labeled __label__speak-with-manager but:

>>> print(model.predict_sentence('im super upset this is bullshit'))
('__label__none',
  '__label__places-on-hold',
  '__label__hold-start',
  '__label__hold-end',
  '__label__objection'),
 array([9.94461238e-01, 1.71548547e-03, 6.53617783e-04, 5.64708549e-04, 5.18437242e-04]))

because the majority of labels are none, you would realistically use some sampling technique (StratifiedKFold), or make the model predict the anomolies that would be tagged and then train with just the labels that aren’t none. Both would be quite straightforward and could offer significant model improvments. For my example, I’m not sure the dataset is even correctly tagged in whole so I am not going to do anything extranous.

How we could improve the model

While this model we are labeling everything together, it is very obvious that who is speaking in the dialogue plays an important factor in what labels we would have. Doing this would require us to split the data into customer and customer service and would allow us to probably have much more useful results. While we arent using any model validation, theres no way to show for certain for this purpose if our model is actually getting better, a few of the things we can do is use n-grams and pretrained vectors.

The pretrained vectors are available on the fastText website or you can just run

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip; unzip crawl-300d-2M.vec.zip;

and the only thing we changed is simple parameters passed into train_supervised:

train_supervised(
            self.train_data,
            epoch=100,
            wordNgrams=2,
            loss="ns",
            pretrainedVectors="./crawl-300d-2M.vec")

Now, just by changing the model parameters to be more extensive and using pretrained vectors we are able to get pretty good results for the same sentence (if we continue ignoring the __label__none):

print(model.predict_sentence('im super upset this is bullshit'))
(('__label__none',
  "__label__customer-disagrees-with-rep's-claim",
  '__label__rep-is-angry-or-aggressive',
  '__label__customer-expresses-frustration/anger',
  '__label__customer-is-angry-or-aggressive'),
  array([1.00001001e+00, 1.00358202e-05, 1.00074540e-05, 1.00002699e-05, 1.00001935e-05]))

Code

The code and dataset are available here: https://gitlab.com/besiktas/callFastText but may change as I would like to integrate fastText into a more robust and interesting project.

Other

Interested in implementing this or something similar into a system you have? I’d love to help, please get in contact with graham(dot)annett(at)gmail(dot)com