using fastText to classify phone call dialogue
fastText
fastText is a useful tool that allows us to use text data and train supervised and unsupervised models. While this is possible without fastText using sklearn, spacy, etc. there are plenty of guides on those and not much information on fastText. It allows you to use it from the command line very straightforward or there is a python libary included. Part of why it’s appeling is that it is incredibly quick, straightforward and doesn’t require much knowledge to use. To quote the official website:
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.
Use cases
I’m not going to do much with the actual model in this post but rather give an example where we can tag lines from a call transcript. This could be used in a variety of applications from call centers to team meetings. Using this tagged output we could further analzye the data for outcomes/call mannerisms/etc.
Setup
To install fastText into a python project, you can either install it from the git repo or clone into your folder and then using something such as pipenv pipenv install -e fastText/
Cleaning the data
Like most machine learning problems, the data I am using isn’t by default in the correct form to feed to the fastText python module. While the official docs use sed/awk which is much faster, preprocessing the data to work with fastText could be convoluted if you have data that doesn’t fit very close to what is expected.
To set the data up for fastText we are going to do as minimal preprocessing as possible since the dataset I am using has too many nuanced issues to dive through every normalization/cleaning we could do. We are just going to lowercase the words and create the labels from the column that has pretagged labels along with sometimes multiple labels.
import pandas as pd
import numpy as np
import csv
def read_data(file='./voiceops_query.csv'):
df = pd.read_csv(file, sep='|')
return df
def add_label(s):
if ':' in s:
s = s.split(':', 1)[0]
if '(' in s:
s = s.split('(', 1)[0]
return '__label__' + s.lower().replace(' ', '-')
def get_labels(df, column='category_label', none_label='__label__none'):
'''
set the data to be read
'''
# replace all empty with label
df = df.where((pd.notnull(df)), none_label)
for idx, row in df[column].iteritems():
if row != none_label:
if '*' in row:
row = ' '.join([add_label(r) for r in row.split('*')])
else:
row = add_label(row)
df.loc[idx, column] = row
return df
def save_df(df, columns=['category_label', 'content']):
# i couldnt figure out how to use pd.to_csv in this maner without the quotes
# or without error whentrying to use quoting=csv.QUOTE_NONE, if you know how
# to properly do this i would love to know! please contact me!
df = df[columns].values
with open('labeled_data.txt', 'w') as f:
for line in df:
f.write(f"{line[0]} {line[1].lower()}\n")
Training the model
Fortunately most of the functionality is built into fastText.
In a serperate file we will train the model without doing much
import fastText
from fastText import train_supervised
class Model:
def __init__(self, train_data, pretrained=None):
if pretrained:
pass
self.train_data = train_data
def train(self, params=None):
self.model = train_supervised(self.train_data)
def create_ft_matrix(self, ft_matrix=None):
self.ft_words = self.model.get_words()
self.word_frequencies = dict(zip(*self.model.get_words(include_freq=True)))
self.ft_matrix = ft_matrix
if self.ft_matrix is None:
self.ft_matrix = np.empty((len(self.ft_words), self.model.get_dimension()))
for i, word in enumerate(self.ft_words):
self.ft_matrix[i, :] = self.model.get_word_vector(word)
def find_nearest_neighbor(self, query, vectors, n=10, cossims=None):
if cossims is None:
cossims = np.matmul(vectors, query, out=cossims)
norms = np.sqrt((query**2).sum() * (vectors**2).sum(axis=1))
cossims = cossims/norms
result_i = np.argpartition(-cossims, range(n+1))[1:n+1]
return list(zip(result_i, cossims[result_i]))
def nearest_words(self, word, n=10, word_freq=None):
result = self.find_nearest_neighbor(
self.model.get_word_vector(word), self.ft_matrix, n=n)
if word_freq:
return [(self.ft_words[r[0]], r[1]) for r in result if self.word_frequencies[self.ft_words[r[0]]] >= word_freq]
else:
return [(self.ft_words[r[0]], r[1]) for r in result]
pass
def predict_sentence(self, sent, n=5):
return self.model.predict(sent, k = n)
if __name__ == '__main__':
model = Model('labeled_data.txt')
model.train()
print(model.predict_sentence('im super upset this is bullshit'))
Baseline
The results out of the box with this limited dataset and huge amount of none labels makes pretty much anything that should be obviously tagged as always __label__none
and sometimes uncorrectly tagged as well, for instance an angry customer saying something about talking to a manager, you may think would be labeled __label__speak-with-manager
but:
>>> print(model.predict_sentence('im super upset this is experience'))
('__label__none',
'__label__places-on-hold',
'__label__hold-start',
'__label__hold-end',
'__label__objection'),
array([9.94461238e-01, 1.71548547e-03, 6.53617783e-04, 5.64708549e-04, 5.18437242e-04]))
because the majority of labels are none, you would realistically use some sampling technique (StratifiedKFold), or make the model predict the anomalies that would be tagged and then train with just the labels that aren’t none or use class weights. Both would be quite straightforward and could offer significant model improvments.
How we could improve the model
While with this model we are labeling everything together, it is very obvious that who is speaking in the dialogue plays an important factor in what labels we would have. While we arent using any model validation, there is no way to show for certain with this purpose if our model is actually getting better, a few of the things we can do is use n-grams and pretrained vectors.
The pretrained vectors are available on the fastText website or you can just run
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip; unzip crawl-300d-2M.vec.zip;
and the only thing we changed is simple parameters passed into train_supervised:
train_supervised(
self.train_data,
epoch=100,
wordNgrams=2,
loss="ns",
pretrainedVectors="./crawl-300d-2M.vec")
Now, just by changing the model parameters to be more extensive and using pretrained vectors we are able to get pretty good results (just via manual verification) for the same sentence (if we continue ignoring the __label__none
):
print(model.predict_sentence('im super upset this is experience'))
(('__label__none',
"__label__customer-disagrees-with-rep's-claim",
'__label__rep-is-angry-or-aggressive',
'__label__customer-expresses-frustration/anger',
'__label__customer-is-angry-or-aggressive'),
array([1.00001001e+00, 1.00358202e-05, 1.00074540e-05, 1.00002699e-05, 1.00001935e-05]))