using generative neural nets in keras to create ‘on-the-fly’ dialogue

graham a

September 10, 2015

Updates

6/6/23 A question was recently asked about this post, but the original code here is no longer functional. It was created with limited programming experience and remains only for historical reference.

4/10/17 Most python modules have changed significantly since this was written, causing issues with youtube-dl and keras. Anyone working on or possessing an updated version is encouraged to get in touch.

Introduction

Generative neural nets have seen interesting developments, though there has yet to be many applications in business at this point. This example is not the most advanced, but it demonstrates how generative neural nets can serve as impactful training or augmentation tools.

Start

I began with a simple yet powerful insight: YouTube video subtitles represent an untapped gold mine of conversational dialogue data. While not perfect, these subtitles capture something special that traditional text sources often miss - the natural rhythm, inflections, and quirks of human conversation. This makes them an ideal training dataset for machine learning models focused on natural language processing.

To build my training set, I specifically targeted YouTube videos featuring sales calls and customer service interactions. Here are two examples that proved particularly useful:

The video selection criteria was straightforward: long duration, phone-based dialogue, and pre-existing subtitles. While the caption accuracy varies, even with some transcription errors, I was able to compile a substantial corpus. For reference, a minimum dataset size of around 500,000 characters is recommended for effective training.

The data collection process was streamlined using a Python script that leverages youtube-dl and pysrt libraries to automatically fetch and process the subtitle data.

import youtube_dl, pysrt
import numpy as np

class audio_source(object):
    def __init__(self, url):
        self.url = url
        self.ydl_opts = {
            'subtitles': 'en',
            'writesubtitles': True,
            'writeautomaticsub': True}

        self.subtitlesavailable = self.are_subs_available()

        if self.subtitlesavailable:
            self.grab_auto_subs()

    def are_subs_available(self):
        with youtube_dl.YoutubeDL(self.ydl_opts) as ydl:
            subs = ydl.extract_info(self.url, download=False)
        if subs['requested_subtitles']:
            self.title = subs['title']
            self.subs_url = subs['requested_subtitles']['en']['url']
            return True
        else:
            return False

    def grab_auto_subs(self):
        """
        grab's subs or cc depending on whats available,
        think it grabs both if subtitles are available
        issue with ydl_opts but doesn't bother me
        """
        try:
            urllib.request.urlretrieve(
                self.subs_url, 'youtube-dl-texts/' + self.title + '.srt')
            print("subtitles saved directly from youtube\n")
            text = pysrt.open('youtube-dl-texts/' + self.title + '.srt')
            self.text = text.text.replace('\n', ' ')
        except IOError:
            print("\n *** saving sub's didn't work *** \n")

with open('other/url_list','r') as datafile:
    url_list = datafile.read().splitlines()

total_text = []

for u in url_list:
    try:
        total_text.append(audio_source(url=u).text)
    except AttributeError:
        pass
total_text = ' '.join(total_text).lower()

print(len(total_text))

>>>

Training the generative neural net

At this point you have a mass of text that if you were to actually read it, would look quite incoherent and useless (also notice I am not creating a separation between texts like many other’s have and would probably be very useful in disseminating when a conversation should be ended etc). There is hopefully enough data to create an end result for the time being and the errors will “regress to the mean”.

Here’s an example of some of the last 260 chars of the dialogue i have from slightly less than 1 MB worth of text from videos:

print(total_text[-260:])
>>>'more information about those meetings and travel make sure to fax it to this number at the bottom and are you into the grand prize drawing weeks stay at intercontinental resort Tahiti be sure to fax in that form you all right thank you feel you have a great day'

To train the model we first need to do a bit of preprocessing since the generative neural net uses sequential data character by character (well in steps, but character by character for each step (a fair amount of this is from the keras LSTM generating example)

chars = set(total_text)

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))


maxlen = 20
step = 1
sentences = []
next_chars = []
for i in range(0, len(total_text) - maxlen, step):
    sentences.append(total_text[i: i + maxlen])
    next_chars.append(total_text[i + maxlen])
print('nb sequences:', len(sentences))



X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

LSTM training

For the NN library, I am using Keras for a few reasons but it is so far my favorite python NN library due to how modular and easy to understand it is (and the creator and contributors seem incredibly smart). Quick prototyping and experimentation helps. For my example, using an LSTM based RNN architecture the most effective way from my experimentation to generate useful results.

One of the better models i found:

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM


model = Sequential()
model.add(LSTM(len(chars), 512, return_sequences=True))
model.add(Dropout(0.20))
# use 20% dropout on all LSTM layers: http://arxiv.org/abs/1312.4569

model.add(LSTM(512, 512, return_sequences=True))
model.add(Dropout(0.20))

model.add(LSTM(512, 256, return_sequences=True))
model.add(Dropout(0.20))

model.add(LSTM(256, 256, return_sequences=False))
model.add(Dropout(0.20))

model.add(Dense(256, len(chars)))
model.add(Activation('softmax'))

# compile or load weights then compile depending
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model.fit(X,y,nb_epoch=50)

>>> Epoch 0
>>> 7744/285648 [>.............................] - ETA: 4717s - loss: 3.0232

If you would like to see a rudimentary visualization of the architecture: nn arch

This will probably take quite awhile to train but GPU based training dramatically speeds up the .fit() process. Loss is hard to say for certain but a minimum that levels around .5 is ideal.

At this point, a model is trained and we are ready to generate some recommended dialogue.

Generate some text

The final part of this is being able to speak something to your computer (or potentially, computer would listen to what you or someone else is saying in some app or extension) and from there get the speech into text form to generate a suggestion of what to follow that sentence with.

There’s a few ways to do this but the easiest is to register and get an API key for google speech to text, and install some libraries to be able to use the python speech recognition module

Use a personal key in the Recognizer to avoid abusing of the built in API token. You need to subscribe to a mailing list and then enable the api but it takes about 2 minutes.

You can incorporate whatever was spoken into the model as well, but that’s for a later date. right now, all i will do is set it up so you speak to it for a moment and then it generates some text and prints that out.

import speech_recognition as sr

recognizer = sr.Recognizer(key=myKey)

def speech2text(r=recognizer):

    # speak to microphone, use google api, return text
    input('press enter then speak: \n'+'------'*5)
    with sr.Microphone() as source:
        audio = r.listen(source)
        try:
            print('\nprocessing...\n')
            return r.recognize(audio).lower()
        except LookupError:
            pass

def gentext():

    seed_text = speech2text()
    generated = '' + seed_text
    print('------'*5+'\nyou said: \n'+'"' + seed_text +'"')


    print('------'*5+'\n generating...\n'+ '------'*5)
    for iteration in range(50):
        # create x vector from seed to predict off of
        x = np.zeros((1, len(seed_text), len(chars)))
        for t, char in enumerate(seed_text):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = np.argmax(preds)
        next_char = indices_char[next_index]

        generated += next_char
        seed_text = seed_text[1:] + next_char
    print('\n\nfollow up with: ' + generated)

Here’s one of the better single example’s I encountered with this model after fitting:


press enter then speak:
------------------------------

processing...

------------------------------
you said:
"i would like to talk to you about a house i saw that you had for sale"
------------------------------
 generating...
------------------------------

follow up with:
i would like to talk to you about a house i saw that you had for sale tell me what was its price though and i can reall

Performance Note

Training and prediction with GPU acceleration provides approximately 3-4x speedup compared to CPU on a 2013 MacBook Pro.

Conclusion

This approach demonstrates the potential for real-time dialogue generation from audio sources. The system could be extended to:

Process live phone calls or chat conversations
Work with specialized domain-specific datasets
Combine multiple models to handle different speakers or dialogue styles
Distinguish between different speakers in conversations

If you’re interested in exploring these possibilities further or have questions about implementing similar systems, please reach out.