downloading files from kaggle
Often it can be a hassle to get files from kaggle to a remote server.
Previously I had used cookies from chrome to allow me to download files or a python command line module that allowed me to specify the competition, neither of these work efficiently and are clunky/prone to breaking. I had been meaning to write a script to do this for some time now had never gotten around to it. Finally I did it and here it is posted, this is also incredibly useful for downloading files that are locked away by login/password on some site. For this, I am using RoboBrowser.
Overview:
The script itself consists of logging in -> getting competition download files -> downloading files.
Logging in
Probably the easiest part of this was figuring out how to login with RoboBrowser. Simply open a kaggle.com/account/login browser page with RoboBrowser, fill the forms in and then submit.
browser = RoboBrowser(history=True)
base = 'https://www.kaggle.com'
browser.open('/'.join([base, 'account/login']))
login_form = browser.get_form(action='/account/login')
login_form['UserName'] = username
login_form['Password'] = password
browser.submit_form(login_form)
Getting competition download files
This part as well was easy. Consists of going to competition page, get_links and then only care about ones that end in ‘.zip’ (could add other types or only certain file sizes as well in this part).
browser.open('/'.join([base, 'c', competition, 'data']))
files = []
for a_href in browser.get_links():
if '.zip' in a_href.text:
files.append(a_href)
Downloading the files
Probably the hardest part was figuring out how to download the files with RoboBrowser since I couldn’t find out any obvious method or way to get them. Figured out the best way to do this (if there’s a better way let me know!) is by using browser.session.get, and stream=True and then basically writing the stream.
for f in files:
request = browser.session.get(base + f.attrs['href'], stream=True)
with open(f.attrs['name'], "wb") as zip_file:
zip_file.write(request.content)
Conclusion:
Final script can be found here and run with (assuming draper-satellite-image-chronology is the competition:
python3 main.py --competition draper-satellite-image-chronology --username my_username --password my_password