using fastText to classify phone call dialogue

June 26, 2018

fastText: A Short Guide

fastText is a powerful, open-source library designed to facilitate both supervised and unsupervised learning of text representations and classifiers. Unlike other tools like sklearn or spaCy, fastText is particularly praised for its speed and ease of use, which make it accessible even for those with limited technical background. The official fastText website describes it as “lightweight” and capable of running on “standard, generic hardware,” with the potential to optimize models for mobile devices.

Practical Applications

In this article, we will explore a practical scenario: tagging dialogue lines from call transcripts. This functionality is highly relevant for applications ranging from customer service in call centers to record-keeping in team meetings. By analyzing these tagged transcripts, businesses can extract insights on call outcomes, customer service quality, and communication patterns.

Setting Up fastText

To integrate fastText into your Python project, you can either install it directly from its GitHub repository or by cloning it into your project directory and using a package manager like pipenv: pipenv install -e fastText/.

Data Preparation

Effective machine learning models require well-prepared data. While the fastText documentation suggests using UNIX tools like sed/awk for speed, here we present a Python approach for those working within a different ecosystem or with more complex data sets. Our aim is minimal preprocessing, adjusting only as necessary to meet fastText’s input requirements. Below is a Python script that prepares a dataset by normalizing text and creating appropriate labels:

import pandas as pd
import numpy as np

def read_data(file='./data_query.csv'):
    df = pd.read_csv(file, sep='|')
    return df

def add_label(s):
    s = s.split(':', 1)[0].split('(', 1)[0]
    return '__label__' + s.lower().replace(' ', '-')

def get_labels(df, column='category_label', none_label='__label__none'):
    df = df.where(pd.notnull(df), none_label)
    for idx, row in df[column].iteritems():
        if row != none_label:
            row = ' '.join(add_label(r) for r in row.split('*')) if '*' in row else add_label(row)
            df.loc[idx, column] = row
    return df

def save_df(df, columns=['category_label', 'content']):
    df = df[columns].values
    with open('labeled_data.txt', 'w') as f:
        for line in df:
            f.write(f"{line[0]} {line[1].lower()}\n")

Model Training

The beauty of fastText lies in its simplicity for model training. Here, we encapsulate the training process in a Python class, focusing on model instantiation, training, and utility methods to explore the learned embeddings:

import fastText
from fastText import train_supervised

class Model:
    def __init__(self, train_data, pretrained=None):
        self.train_data = train_data

    def train(self, params=None):
        self.model = train_supervised(self.train_data)
        self.prepare_embeddings()

    def prepare_embeddings(self):
        self.ft_words = self.model.get_words(include_freq=True)
        self.word_frequencies = dict(zip(*self.ft_words))
        self.ft_matrix = np.array([self.model.get_word_vector(w) for w, _ in self.ft_words])

    def nearest_words(self, word, n=10):
        query = self.model.get_word_vector(word)
        cossims = np.dot(self.ft_matrix, query) / (np.linalg.norm(self.ft_matrix, axis=1) * np.linalg.norm(query))
        result_i = np.argpartition(-cossims, range(n+1))[1:n+1]
        return [(self.ft_words[i][0], cossims[i]) for i in result_i]

    def predict_sentence(self, sentence, n=5):
        return self.model.predict(sentence, k=n)

if __name__ == '__main__':
    model = Model('labeled_data.txt')
    model.train()
    print(model.predict_sentence('I am very upset this is unacceptable'))

Model Evaluation and Improvement

Initially, the model’s performance might be limited by an over-representation of neutral or ’none’ labels. Strategies like stratified sampling or class weighting can address these imbalances. Further enhancements might involve using n-grams and pretrained vectors to enrich the model’s understanding and predictive capabilities.

To leverage pretrained vectors, you can download them from the fastText website and integrate them during training:

train_supervised(
    self.train_data,
    epoch=100,
    wordNgrams=2,
    loss="ns",
    pretrainedVectors="./crawl-300d-2M.vec")

By adjusting these parameters, you can significantly improve the model’s accuracy, achieving more reliable and nuanced text classification.

Expanded Data Preparation Section

Additional Cleaning Techniques

While our initial approach to data cleaning involves minimal preprocessing such as lowercasing words and creating structured labels, further refinement can significantly enhance model performance. This might include:

Removing Stop Words: Stop words such as “the”, “is”, and “at” often add little value to text classification models. Removing these can help reduce the dimensionality of the model and focus on more meaningful words.
Stemming and Lemmatization: Reducing words to their root forms helps in generalizing different forms of the same word, enabling the model to learn from the base meaning rather than specific derivations.
Handling Typos and Variants: Implementing spell check or synonym replacement can normalize the vocabulary, ensuring that variations in spelling do not affect the training process.

Example of implementing additional cleaning:

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def clean_text(text):
    # Remove stopwords and stem words
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    words = text.split()
    filtered_words = [stemmer.stem(word) for word in words if word not in stop_words]
    return ' '.join(filtered_words)

Through this guide, we’ve covered the essential aspects of using fastText for practical text classification tasks, setting a foundation for further exploration and refinement of your machine learning models.