Spider or Ant
According to the philosophies of computer science a spider is the same as a ant. Yes! that, in fact is true because this almost jargon is talking about web crawling or web scuttering which is further used for automatic indexing and web scraping. I used the spider algorithm to get data from a few sites. The first hack was using lynx and Beautiful Soup and, later, a better and elegant solution using Python?s urlparse and urllib instead of lynx. Beautiful Soup is indeed beautiful, as beautiful as the morning Sun shining on the dew covered grass. Excuse the poetess in me! So, coming back, I have used Beautiful Soup extensively till now and it makes getting data into an accessible format much, much easier. I cannot imagine a more elegant way to get and process data. Through this project I learnt the intricacies that the Google Search Bot must have done through and solved using their googliness. Also the research paper ?The Anatomy of a Large-Scale Hypertextual Web Search Engine’ by Sergey Brin and Lawrence Page is a good read.
A spider algorithm or a web crawler is an internet bot that crawls a site for information usually in the form of text. A spider works by first going to the seminal link and then parsing the text on that web page to find more links that direct it to other web pages. These freshly found links are added to a stack or queue. All the links or urls in this stack or queue are traversed by the spider one by one.
So, from the initial web page the spider lands to another web page and continues its journey forth, recursively. In this way it crawls throughout the web. This means that from one page the spider should be able to access all the pages that are present on the World Wide Web, but my spider does not do that because I have limited it to access only the links present on the initial web page.
Spiders are usually used for web indexing. Other uses are to update web content or indexes. Spiders have the ability to copy all the content of a web page (if the page permits). All this data will be processed by a search engine at a later stage for indexing so that searching and information retrieval is rendered faster and easier.
In order to get data from a site I usually use one of the following four methods:
- curl command - Tool to transfer data from or to a server using http/https/ftp
- lynx command - World Wide Web (WWW) client/browser for users running terminals.
- wget command - Enables non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.
- w3m command - It is a text based Web browser and pager.
These are not the only way to get links or urls from a site. An easier and direct way is to use Python’s parsing libraries namely urlparse, urllib along with Beautiful Soup. Another alternate way is to use HTTP get and post requests.
The file robots.txt provides permission to the crawler to get the data from the page. It is essential that scraping be done only for those web pages that allow it. In order to read this file we can employ robotparser of Python. Then if the permission is granted the spider will create a separate thread and go to this site to parse it.
The biggest advantage of using Beautiful Soup is its soup objects. These soup objects are flexible enough to be used capriciously. I have used them by sorting according to tags. This enables me to differentiate between the various HTML tags which is very important because this entails the basis of all web pages. I have used ?href? tags to find the url of the next page that will be appended to the list of the urls. If instead I had used the ?a? tag, then the url would turn out to be only a page and no hyperlink would exist, so a better option was to use the ?href? tags.
The implementation of the spider algorithm can be found on GitHub.
blog comments powered by Disqus