A few years ago, when Flickr was new, we made kind of a silly decision to only store our images on Flickr. While this did make a transition between computers easier and freed up a little bit of drive space, we decided that we'd like to pull those pictures back onto our own system.
In the past, I've tried several Flickr Downloadr (missing 'e' intended as a pun) programs, and everything choked or did strange things. Last night, right after I crawled into bed, I decided that I knew a better way.
I crawled out of bed, did a bit of Googling, and found an excellent Python Flickr library, and someone that wrote a python script to backup Flickr pictures to Amazon's S3. In 15 minutes or so, I had a solution that would page through our public pictures, check to see if they were already downloaded, and store them in year and month folders.
import flickr
import urllib
import os.path
import os
page = 1
total_photos = found_photos = 0
while True:
photos = flickr.people_getPublicPhotos('68432331@N00',
100, page)
if not len(photos):
break
for photo in photos:
total_photos = total_photos + 1
photoYear = photo.datetaken[0:4]
photoMonth = photo.datetaken[5:7]
photoURL = photo.getURL('Original', 'source')
photoPath =
"C:\FlickrPics\%s\%s\%s.jpg" % (photoYear, photoMonth, photo.id)
if not os.path.exists(photoPath):
if not os.path.exists(
"c:\FlickrPics\%s" % photoYear):
os.mkdir(
"c:\FlickrPics\%s" % photoYear)
if not os.path.exists(
"c:\FlickrPics\%s\%s" % (photoYear, photoMonth)):
s.mkdir(
"c:\FlickrPics\%s\%s&" % (photoYear, photoMonth))
urllib.urlretrieve(photoURL, photoPath)
found_photos += 1
page = page + 1
print
" Moving to page %s
" % page
print
"Found %s photos, saved %s new photos" % (total_photos, found_photos)
</p>
While running the script, I noticed that the obvious slow part of the process was downloading the images. I wanted a way to download them in parallel, and found my solution in an excellent Python thread pool solution.
I added the following method to their code, that will tell me the number of jobs pending in the thread pool:
#new method in the ThreadPool class
def getWaitingTaskCount(self):
self.__taskLock.acquire()
count = len(self.__tasks)
self.__taskLock.release()
return count
I had to rewrite my code a little to put the download code into a method, and then I let 'er rip. Downloading now happened 3 at a time. It took about 2 hours to download our 5GB collection of approx 3,300 pictures, and the job was done!
I don't bother with authenticated requests (all of our pictures are public), and I don't do any error checking. It is a one off script, after all. :)
I love python. I really do.
import flickr
import urllib
import os.path
import os
import threading
from time import sleep
#code from ThreadPool not shown.
#Copied from linked solution, with addition new method listed above
pool = ThreadPool(
3)
def getPicture(data):
photoURL = data[0]
photoPath = data[1]
urllib.urlretrieve(photoURL, photoPath)
print photoPath
page = 1
total_photos = found_photos = 0
while True:
photos = flickr.people_getPublicPhotos(
"68432331@N00",
100, page)
if not len(photos):
break
for photo in photos:
total_photos = total_photos + 1
photoYear = photo.datetaken[0:4]
photoMonth = photo.datetaken[5:7]
photoURL = photo.getURL('Original', 'source')
photoPath =
"C:\FlickrPics\%s\%s\%s.jpg" % (photoYear, photoMonth, photo.id)
if not os.path.exists(photoPath):
if not os.path.exists(
"c:\FlickrPics\%s" % photoYear):
os.mkdir(
"c:\FlickrPics\%s" % photoYear)
if not os.path.exists(
"c:\FlickrPics\%s\%s" % (photoYear, photoMonth)):
s.mkdir(
"c:\FlickrPics\%s\%s&" % (photoYear, photoMonth))
# Insert tasks into the queue and let them run
pool.queueTask(getPicture, (photoURL, photoPath))
found_photos += 1
page = page + 1
#don't get too far ahead of the download threads
while pool.getWaitingTaskCount() > 10:
sleep(
1)
print
" Moving to page %s" % page
# When all tasks are finished, allow the threads to terminate
pool.joinAll()
print
"Found %s photos, saved %s new photos" % (total_photos, found_photos)