downloading files from kaggle

Issue

Often there is no simple way to get the files from kaggle to a remote server. While previously I had used either a cookies extension or a python command line module that allowed me to specify the competition, neither of these work efficiently or at all for various reasons. I had been meaning to write a script to do this for some time now had never gotten around to it. Finally I did it and here it is posted, this is also incredibly useful for downloading files that are locked away by login/password on some site. For this, I am using RoboBrowser.

Overview:

The script itself consists of logging in -> getting competition download files -> downloading files.

Logging in

Probably the easiest part of this was figuring out how to login with RoboBrowser. Simply open a kaggle.com/account/login browser page with RoboBrowser, fill the forms in and then submit.

browser = RoboBrowser(history=True)
base = 'https://www.kaggle.com'
browser.open('/'.join([base, 'account/login']))

login_form = browser.get_form(action='/account/login')
login_form['UserName'] = username
login_form['Password'] = password
browser.submit_form(login_form)

Getting competition download files

This part as well was easy. Consists of going to competition page, get_links and then only care about ones that end in ‘.zip’ (could add other types or only certain file sizes as well in this part).

browser.open('/'.join([base, 'c', competition, 'data']))
files = []
for a_href in browser.get_links():
    if '.zip' in a_href.text:
        files.append(a_href)

Downloading the files

Probably the hardest part was figuring out how to download the files with RoboBrowser since I couldn’t find out any obvious method or way to get them. Figured out the best way to do this (if there’s a better way let me know!) is by using browser.session.get, and stream=True and then basically writing the stream.

for f in files:
    request = browser.session.get(base + f.attrs['href'], stream=True)
    with open(f.attrs['name'], "wb") as zip_file:
        zip_file.write(request.content)

Conclusion:

Final script can be found here and run with (assuming draper-satellite-image-chronology is the competition:

python3 main.py –competition draper-satellite-image-chronology –username my_username –password my_password