SecurityTubeBeta
Watch ... Learn ... Contribute
|
|
|
|
 |
|
|
|
| |
|
| |
|
|
|
|
|
|
Crawling the Web for Fun and Profit
|
| |
|
| |
With over a couple of billion web pages on the Internet, it is but tempting to see how one can mine much of this information for fun or for profit. In this video, i run you through how to program a web crawler which will fetch pages and parse their content, so it can be converted into a useful format.
The web crawler which we create in this tutorial, consists of mainly 2 parts:
- Document fetching engine : This fetches the raw HTML page data from a website
- Document parsing engine : This uses an HTML DOM Parser to parse the page and derive useful input from it.
Once you have learned how to parse the data, then the next step is to store the data in a database. This will allow you to tun further analysis on the data and derive interesting insights.
We shall use the Python language and the BeautifulSoup DOM parser to pull this off. The video is very interactive and i use a "type as you go" methodology to help you understand the programming techniques.
The code for this tutorial is available for download.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Related Videos from: Automated Site Miners Programming |
 |
| | | | | |
| | You are Viewing this Video Now! | | | |
92 views | 1157 views | 4511 views | 4717 views | | |
|
|
|
|
|
|
|
|
|
Author |
 |
Vivek
Ramachandran is a security evangelist and has been working in
computer security related fields for the past 7 years. In 2007,
Vivek spoke at world renowned conferences Defcon (WEP Cloaking Exposed) and Toorcon (The Caffe
Latte Attack). The discovery of the Caffe Latte Attack was
covered by CBS5 news, BBC online, Network World etc news
agencies.In 2006, Vivek was announced as one of winners of the
Microsoft Security Shootout contest held in India among 65,000
participants. He has also been a recipient of a Team Achievement
at Cisco Systems for his work
on 802.1x and Port Security modules on the Catalyst 6500 switches.
Currently he spends all of his time maintaining Security-
Freak.Net , SecurityTube.Net and is the
co-founder of Axonize. Vivek,
is a Bachelor in Electronics and Communications Engineering from
the prestigious Indian Institute of Technology, Guwahati.You can contact him at vivek[at]securitytube.net
|
|
 |
|
|
|
|
| |
 |
|