downloading files from kaggle

May 17, 2016

Often it can be a hassle to get files from Kaggle onto a remote server. This is a quick walkthrough of a script I wrote using RoboBrowser to log into Kaggle and download a competition’s files from the command line. You could also use this for other sites where you need to log in with a username/password to download a file.

Previously I had used cookies from Chrome to allow me to download files or a python command line module that allowed me to specify the competition, but neither of these worked efficiently and were clunky/prone to breaking. I had been meaning to write a script to do this for some time now but had never gotten around to it. After I was able to get this working, I thought it would be worth creating a post about.

Overview:

The script itself follows a straightforward process:

↳ logging in
    ↳ getting competition download files
        ↳ downloading files

Logging in

Probably the easiest part of this was figuring out how to log in with RoboBrowser. Simply open a kaggle.com/account/login browser page with RoboBrowser, fill the forms and then submit.

browser = RoboBrowser(history=True)
base = 'https://www.kaggle.com'
browser.open('/'.join([base, 'account/login']))

login_form = browser.get_form(action='/account/login')
login_form['UserName'] = username
login_form['Password'] = password
browser.submit_form(login_form)

Getting competition download files

This part was easy as well. It consists of going to the competition page, calling get_links, and then only caring about ones that end in ‘.zip’ (could add other types or only certain file sizes as well in this part).

browser.open('/'.join([base, 'c', competition, 'data']))
files = []
for a_href in browser.get_links():
    if '.zip' in a_href.text:
        files.append(a_href)

Downloading the files

The most difficult part of this process was figuring out how to download the files with RoboBrowser as there was no obvious method to get them with the RoboBrowser library. After trying a few different approaches, the best way I was able to do this (if there’s a better way let me know!) is to use browser.session.get with stream=True and then simply write the stream.

for f in files:
    request = browser.session.get(base + f.attrs['href'], stream=True)
    with open(f.attrs['name'], "wb") as zip_file:
        zip_file.write(request.content)

Conclusion:

The final script can be found here and run with the following (assuming draper-satellite-image-chronology is the competition):

python3 main.py --competition draper-satellite-image-chronology --username my_username --password my_password