DeepQA with Customer Service Reps


The dataset we will be using is quite special in the sense that usually this type of data is quite proprietary and for good reason. While it may include a multitude of different

Preparing the data

While we are not going to do much besides run the data through a preliminary DeepQA repo to see what happens we need to format the data. This will give us a grasp of how medicore the data off the bat is (lots of inaudible/random cutt off, not much information )

First Steps

The data itself is a bit overwhelming and while I was interested in using OpenMNT and various other frameworks to try and look at the problem to find useufl insight related to callers, sentiment, phrases, etc. This is a pretty comprehensive dataset with lots of good data

first_name content is_customer ranker
Raymond Marion Thank you for calling Government Vacation Rewa... f 1
Raymond Marion Hi, my name is Mico [/]. t 2
Raymond Marion Hi Mico. How are you? f 3
Raymond Marion I'm great. How are you? t 4
Raymond Marion All right. What's the phone number on your acc... f 5
Raymond Marion I believe it's [/]. I believe that's still I h... t 6
Raymond Marion Le me go cut it. I got you. Can I get you to v... f 7
Raymond Marion [/]. t 8
Raymond Marion All right. Cool. Let me see here. All right. I... f 9
Raymond Marion Okay. So I haven't logged in for a while and ... t 10

Converting the data for DeepQA format

While the data has a lot of discrepencies beacuse the majority of the calls seemed to br trasncribed by non english speaker with not much statistical measures to make sure what is correct, fortunately the data is chronological order. This allows us to use the DeepQA format and a simple script

import pandas as pd

def load_data():
    data_file = '../data/data.csv'
    df = pd.read_csv(data_file, sep=',')
    return df

def prepare_deepqa(df):
    filepath = 'DeepQA.txt'
    sep = '==='
    new_list = []

    for group in df.groupby('call_id'):
        prev_val = 0
        recurring = 0
        for row in group[1].iterrows():
            if row[1].ranker > prev_val:
                prev_val += 1
                recurring += 1
        # probably will want to look into this
        print('call _id {} had muultiple recoring ranker lines {}'.format(group[0], recurring))

    if True:
        with open(filepath, 'w') as file_handler:
            for item in new_list:
        print('processed but didnt save')

Setting up tensorboard on the server

While it is not necessary it’s nice to have tensorboard/jupyter etc for data munging and various stuff

Fortunately this is quite startforward just, run jupyter notebook --ip --no-browser and you will be give n a token