SecurityTubeBeta
Watch ... Learn ... Contribute
securitytube home
securitytube questions
divider
upload video on SecurityTube
 
SecurityTube Questions - a Q&A section for Infosec and Hacking launched!!!
 
Video Categories:

Crawling the Web for Fun and Profit

 
 

With over a couple of billion web pages on the Internet, it is but tempting to see how one can mine much of this information for fun or for profit. In this video, i run you through how to program a web crawler which will fetch pages and parse their content, so it can be converted into a useful format.

The web crawler which we create in this tutorial, consists of mainly 2 parts:
  1. Document fetching engine : This fetches the raw HTML page data from a website

  2. Document parsing engine : This uses an HTML DOM Parser to parse the page and derive useful input from it.
Once you have learned how to parse the data, then the next step is to store the data in a database. This will allow you to tun further analysis on the data and derive interesting insights.

We shall use the Python language and the BeautifulSoup DOM parser to pull this off. The video is very interactive and i use a "type as you go" methodology to help you understand the programming techniques.

The code for this tutorial is available for download. 

The Camtasia Studio video content presented here requires a more recent version of the Adobe Flash Player. If you are you using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Flash Player by downloading here.

SecurityTube Questions - a Q&A section for Infosec and Hacking launched!!!
 

 
Related Videos from: Automated Site Miners Programming
divider
You are Viewing this Video Now!
92 views
1157 views
4511 views
4717 views

Author
Vivek-Ramachandran

Vivek Ramachandran is a security evangelist and has been working in computer security related fields for the past 7 years. In 2007, Vivek spoke at world renowned conferences Defcon (WEP Cloaking Exposed) and Toorcon (The Caffe Latte Attack). The discovery of the Caffe Latte Attack was covered by CBS5 news, BBC online, Network World etc news agencies.In 2006, Vivek was announced as one of winners of the Microsoft Security Shootout contest held in India among 65,000 participants. He has also been a recipient of a Team Achievement at Cisco Systems for his work on 802.1x and Port Security modules on the Catalyst 6500 switches. Currently he spends all of his time maintaining Security- Freak.Net , SecurityTube.Net and is the co-founder of Axonize. Vivek, is a Bachelor in Electronics and Communications Engineering from the prestigious Indian Institute of Technology, Guwahati.You can contact him at vivek[at]securitytube.net

 
©2007 Freak Labs