using fastText to classify phone call dialogue
This is the first part of using fastText to tag lines of dialogue from sales-call transcripts. fastText is an open-source library for supervised and unsupervised learning of text representations and classifiers, and the thing I like about it is how fast and simple it is to get a model going compared to something like sklearn or spaCy. Here I walk through setting it up, getting the call data into the format it wants, training a supervised model, and a few things that helped the results.
Background
The official fastText site describes it as “lightweight” and able to run on “standard, generic hardware,” with the potential to optimize models for mobile devices. I first heard about it as it is known as a no-fuss way to train word embeddings and text classifiers. fastText uses character n-grams so it can handle words it hasn’t seen before, and because it comes with pretrained vectors for a ton of languages, its a common starting point for tasks related to classification, sentiment, and language identification without needing much compute.
What I’m using it for
That goal for us is tagging individual lines of dialogue from call transcripts with a category. Once the lines are tagged, you can start pulling out things like how a call went or where it turned, which is handy for call-center transcripts or even just meeting notes.
Setting up fastText
To use fastText from Python you can install it straight from the GitHub repo, or clone it into your project and install it with something like pipenv: pipenv install -e fastText/.
Data preparation
While the fastText docs suggest using cli tools like sed/awk for speed, I’m doing the prep in Python as that is what the rest of my pipeline is already using and I expected the data to be complex enough that using only cli tools may become cumbersome. The goal is minimal preprocessing, just enough to get the text into the format fastText expects with the right labels. The script below normalizes the text and builds the labels:
import pandas as pd
import numpy as np
def read_data(file='./data_query.csv'):
df = pd.read_csv(file, sep='|')
return df
def add_label(s):
s = s.split(':', 1)[0].split('(', 1)[0]
return '__label__' + s.lower().replace(' ', '-')
def get_labels(df, column='category_label', none_label='__label__none'):
df = df.where(pd.notnull(df), none_label)
for idx, row in df[column].iteritems():
if row != none_label:
row = ' '.join(add_label(r) for r in row.split('*')) if '*' in row else add_label(row)
df.loc[idx, column] = row
return df
def save_df(df, columns=['category_label', 'content']):
df = df[columns].values
with open('labeled_data.txt', 'w') as f:
for line in df:
f.write(f"{line[0]} {line[1].lower()}\n")
Model training
Training is the easy part with fastText. I wrapped it in a small class that handles creating and training the model, plus a couple of helper methods for poking at the learned embeddings:
import fastText
from fastText import train_supervised
class Model:
def __init__(self, train_data, pretrained=None):
self.train_data = train_data
def train(self, params=None):
self.model = train_supervised(self.train_data)
self.prepare_embeddings()
def prepare_embeddings(self):
self.ft_words = self.model.get_words(include_freq=True)
self.word_frequencies = dict(zip(*self.ft_words))
self.ft_matrix = np.array([self.model.get_word_vector(w) for w, _ in self.ft_words])
def nearest_words(self, word, n=10):
query = self.model.get_word_vector(word)
cossims = np.dot(self.ft_matrix, query) / (np.linalg.norm(self.ft_matrix, axis=1) * np.linalg.norm(query))
result_i = np.argpartition(-cossims, range(n+1))[1:n+1]
return [(self.ft_words[i][0], cossims[i]) for i in result_i]
def predict_sentence(self, sentence, n=5):
return self.model.predict(sentence, k=n)
if __name__ == '__main__':
model = Model('labeled_data.txt')
model.train()
print(model.predict_sentence('I am very upset this is unacceptable'))
Improving the model
Out of the box the model tends to get dragged down by how many lines fall under the neutral / none label. Stratified sampling or class weighting helps with that imbalance. Beyond that, n-grams and pretrained vectors are the two things that gave me the most improvement.
To use pretrained vectors, download them from the fastText site and pass them in during training:
train_supervised(
self.train_data,
epoch=100,
wordNgrams=2,
loss="ns",
pretrainedVectors="./crawl-300d-2M.vec")
Tuning these (epochs, n-grams, and the pretrained vectors) is what moved the accuracy the most for me.
More data cleaning
The prep above is pretty minimal (lowercasing and building labels). If you want to push the model further, a few more cleaning steps can help:
- Removing Stop Words: Stop words such as “the”, “is”, and “at” often add little value to text classification models. Removing these can help reduce the dimensionality of the model and focus on more meaningful words.
- Stemming and Lemmatization: Reducing words to their root forms helps in generalizing different forms of the same word, enabling the model to learn from the base meaning rather than specific derivations.
- Handling Typos and Variants: Implementing spell check or synonym replacement can normalize the vocabulary, ensuring that variations in spelling do not affect the training process.
Example of implementing additional cleaning:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
def clean_text(text):
# Remove stopwords and stem words
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
words = text.split()
filtered_words = [stemmer.stem(word) for word in words if word not in stop_words]
return ' '.join(filtered_words)
That covers the basic pipeline for part one: label the transcript lines, train a fastText classifier, and do a bit of cleanup to squeeze out more accuracy. I’d like to follow this up with a closer look at how well it actually does on the call data and where it falls over.