Tuesday, February 14, 2012

Download files from an online folder using axel.

EDIT: Thanks Adithya Shriram: I forgot to tell, it's really easy to do this with wget. But i wanted multithreaded download and wanted to do some scripting.
There are a LOT of ways to do the same.

I had to visit this page to download all those compressed files from the website. Now, my favorite downloader being axel, I badly wanted to download all of them using axel. But, axel did not support multiple file dowload from a website (in short, sitegrabbing). I thought of writing a shell script, but Python has been in my mind for quite a long time now. So, i decided to go ahead and write a python code.

The code goes like this, with explanations wherever required. Most of the definitions of classes, functions are directly from the manual.


import sys

This module is imported to use the argument from the commandline.

import subprocess

This module enables us to call a subprocess from the python script, and make the script wait till the subprocess is executed.

import urllib

This module enables us to use url, get the website and use the contents of the webpage.
I'll use the webpage contents to extract links from the webpage.

args= sys.argv
url=args[1]


Now, when i store the arguments supplied by the user in args, its contents will be something like: ['downloadlinks.py', 'http://blahblah.com/folder1/listing/'].
i store the second argument in this list in url by doing url=args[1].

f=urllib.urlopen(url)

I open the url, store the obtained webpage in f.

for i in f:
        words=i.split(' ')
        for word in words:
                if word.rfind('href')!=-1:
                        word= word.lstrip('href="')
                        word= word.split('"')[0]
                        word= url+word
                        subprocess.call(["axel","-n 60", word])


The rest of the process seems self explanatory, and the idea to get all the links properly will change from page to page. The page i used, had these contents, so it worked:


<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="infobox_property_definitions_en.nq.bz2">infobox_property_definitions_en.nq.bz2</a></td><td align="right">10-Aug-2011 10:09  </td><td align="right">1.1M</td><td>&nbsp;</td></tr>



I was jumping around in joy because this is the first code that i wrote in python to parse text, do something meaningful out of it. Python is simple and powerful. I bow to its power.
   

2 comments: