using generative neural nets in keras to create ‘on-the-fly’ dialogue

Note 4/10/17: almost all the python modules have changed quite a bit since this original post and there are issues with youtube-dl and keras, if you would like to work on an updated version or have an updated version please let me know!


There’s been a few cool things done with generative neural nets so far but to my knowledge, very few generative neural nets have found a useful application in any publicly discussed business application. This is by no means the best use or the most interesting, but it is an incredibly interesting idea and is a potential starting point for generative neural nets to be utilized in a way that is incredibly beneficial for training or as an augmentative tool.

There’s a lot of potential for this and other similar sorts of technologies and I’d love to work on or collaborate with others on something. If you are interested please contact me at the email listed on the bottom of this post.


My initial plan started with the idea that the available subtitles on youtube videos are a good database of conversational dialogue. This is a massive dataset and could potentially be used to train a lot of different machine learning models. It may not be perfect, but it does seem to carry many of the interesting inflections and peculiarities of human speech that written word does not always capture along with having a variety of ways in which the conversations flow.

The training set to create this model is just a collection of youtube vids that deal with sales or call oriented dialogue. For instance, here are a couple of videos with subtitles:

example 1 example 1

There is not a particular reason I used any of these videos other than they are very long, may have phone dialogue, and have subtitles/closed captions already. I tried to find vids that seemed like the captions were somewhat accurate but there are obvious errors it’s effect on the training set is noticeable.

From here, all that is necessary is to create a fairly large corpus (~500k characters is minimum).

Using a python script with youtube-dl and pysrt to grab the subtitles/closed captions allows a quick and streamlined pipeline to grab a lot of videos subtitles.

import youtube_dl, pysrt
import numpy as np

class audio_source(object):
    def __init__(self, url):
        self.url = url
        self.ydl_opts = {
            'subtitles': 'en',
            'writesubtitles': True,
            'writeautomaticsub': True}

        self.subtitlesavailable = self.are_subs_available()

        if self.subtitlesavailable:

    def are_subs_available(self):
        with youtube_dl.YoutubeDL(self.ydl_opts) as ydl:
            subs = ydl.extract_info(self.url, download=False)
        if subs['requested_subtitles']:
            self.title = subs['title']
            self.subs_url = subs['requested_subtitles']['en']['url']
            return True
            return False

    def grab_auto_subs(self):
        grab's subs or cc depending on whats available,
        think it grabs both if subtitles are available
        issue with ydl_opts but doesn't bother me
                self.subs_url, 'youtube-dl-texts/' + self.title + '.srt')
            print("subtitles saved directly from youtube\n")
            text ='youtube-dl-texts/' + self.title + '.srt')
            self.text = text.text.replace('\n', ' ')
        except IOError:
            print("\n *** saving sub's didn't work *** \n")

with open('other/url_list','r') as datafile:
    url_list =

total_text = []

for u in url_list:
    except AttributeError:
total_text = ' '.join(total_text).lower()



Training the generative neural net

At this point you have a mass of text that if you were to actually read it, would look quite incoherent and useless (also notice I am not creating a separation between texts like many other’s have and would probably be very useful in disseminating when a conversation should be ended etc). There is hopefully enough data to create an end result for the time being and the errors will “regress to the mean”.

Here’s an example of some of the last 260 chars of the dialogue i have from slightly less than 1 MB worth of text from videos:

>>>'more information about those meetings and travel make sure to fax it to this number at the bottom and are you into the grand prize drawing weeks stay at intercontinental resort Tahiti be sure to fax in that form you all right thank you feel you have a great day'

To train the model we first need to do a bit of preprocessing since the generative neural net uses sequential data character by character (well in steps, but character by character for each step (a fair amount of this is from the keras LSTM generating example)

chars = set(total_text)

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

maxlen = 20
step = 1
sentences = []
next_chars = []
for i in range(0, len(total_text) - maxlen, step):
    sentences.append(total_text[i: i + maxlen])
    next_chars.append(total_text[i + maxlen])
print('nb sequences:', len(sentences))

X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

LSTM training

For the NN library, I am using Keras for a few reasons but it is so far my favorite python NN library due to how modular and easy to understand it is (and the creator and contributors seem incredibly smart). Quick prototyping and experimentation helps. For my example, using an LSTM based RNN architecture the most effective way from my experimentation to generate useful results.

One of the better models i found:

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM

model = Sequential()
model.add(LSTM(len(chars), 512, return_sequences=True))
# use 20% dropout on all LSTM layers:

model.add(LSTM(512, 512, return_sequences=True))

model.add(LSTM(512, 256, return_sequences=True))

model.add(LSTM(256, 256, return_sequences=False))

model.add(Dense(256, len(chars)))

# compile or load weights then compile depending
model.compile(loss='categorical_crossentropy', optimizer='rmsprop'),y,nb_epoch=50)

>>> Epoch 0
>>> 7744/285648 [>.............................] - ETA: 4717s - loss: 3.0232

If you would like to see a rudimentary visualization of the architecture: nn arch

This will probably take quite awhile to train but GPU based training dramatically speeds up the .fit() process. Loss is hard to say for certain but a minimum that levels around .5 is ideal.

At this point, a model is trained and we are ready to generate some recommended dialogue.

Generate some text

The final part of this is being able to speak something to your computer (or potentially, computer would listen to what you or someone else is saying in some app or extension) and from there get the speech into text form to generate a suggestion of what to follow that sentence with.

There’s a few ways to do this but the easiest is to register and get an API key for google speech to text, and install some libraries to be able to use the python speech recognition module

Use a personal key in the Recognizer to avoid abusing of the built in API token. You need to subscribe to a mailing list and then enable the api but it takes about 2 minutes.

You can incorporate whatever was spoken into the model as well, but that’s for a later date. right now, all i will do is set it up so you speak to it for a moment and then it generates some text and prints that out.

import speech_recognition as sr

recognizer = sr.Recognizer(key=myKey)

def speech2text(r=recognizer):

    # speak to microphone, use google api, return text
    input('press enter then speak: \n'+'------'*5)
    with sr.Microphone() as source:
        audio = r.listen(source)
            return r.recognize(audio).lower()
        except LookupError:

def gentext():

    seed_text = speech2text()
    generated = '' + seed_text
    print('------'*5+'\nyou said: \n'+'"' + seed_text +'"')

    print('------'*5+'\n generating...\n'+ '------'*5)
    for iteration in range(50):
        # create x vector from seed to predict off of
        x = np.zeros((1, len(seed_text), len(chars)))
        for t, char in enumerate(seed_text):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = np.argmax(preds)
        next_char = indices_char[next_index]

        generated += next_char
        seed_text = seed_text[1:] + next_char
    print('\n\nfollow up with: ' + generated)

Here’s one of the better single example’s I encountered with this model after fitting:


press enter then speak:


you said:
"i would like to talk to you about a house i saw that you had for sale"

follow up with:
i would like to talk to you about a house i saw that you had for sale tell me what was its price though and i can reall


In terms of training the model, training/predicting with a GPU vs CPU is about 3-4x faster on my 2013 macbook pro


With something like this, it’s very easy to see how you could splice in audio from a phone call or text chat that this would carry over very well to. given the right data set’s theres tons of potential uses. along with this, there’s also ways to stack and blend models together that provide different and separate different dialogue/differentiate people within dialogue. If you are interested in hearing more about this, or hearing more about this type of stuff, contact me at the email posted below:

contact me