Spiderbots

Spiders

We use Google (and other search engines) to find things on the Web, but how does Google know where all of these pages are located and which ones are relevant to your queries?

Google uses virtual robots (called spiders) to index and explore the Web. Spiders aren’t really robotic; they are simply small programs that follow a very simple routine:

  1. Visit a series of web pages
  2. Gather all of the links on each page visited
  3. Add the links to its list of pages to visit in the future
  4. Repeat until it has visited all pages in its list

Watch the video below to see how these spiders create Google’s index of the web and affect your search results:

https://www.youtube.com/embed/BNHR6IQJGZs

Assignment

Now you will observe a spider in action! Download this Processing sketchbook for a toy spiderbot.

Click to Download: SpiderBot

This is not a very robust program, as it will only work with webpages that use simple HTML. But, it will allow you to see how a spider traverses the web, and it’s very small. Unlike a search engine spider indexing all of the pages it encounters for later retrieval, SpiderBot just counts—pages visited, number of words seen, etc.

All of the spidering work is done in the function named processURL. Have a look at the code, noting what a spiderbot does:

  1. visit webpages,
  2. gather all of the links on each page visited, and
  3. add them to its list of pages to visit in the future.

When you execute SpiderBot.pde, you are presented with a small interface that details indexing statistics as the bot works, and a few controls to tailor the output. The following images explain each of the statistics reported by the SpiderBot program.

1: These are the dynamically generated statistics of the indexing process. Pages Indexed indicates how many pages you’ve processed. Once they are processed, the spider will not visit them again, even if another link to the page is found later. Pages Queued shows how many pages have been added to the worklist. This begins with just one page, and all other pages added are simply URLs found while spidering.

2: Processing Speed allows the user to slow down or speed up the process. Note that you can see the progress of your program in the Processing console.

3: START begins the spidering process; STOP stops the spidering process and presents the final report.

4: These controls fine-tune the final report. Top # of Words selects the number of highest frequency terms to report. Min. Word Length limits the terms reported to a certain length. In other words, setting the minimum length to two will cause all one-letter words to be skipped in the report. Stop Words? is a checkbox that will cause commonly used words to be skipped in the report. There is a file in the SpiderBot document folder called stopwords.txt that contains this list. You may add to or remove from it if you like.

5: The spider begins at this URL. You may alter the beginning URL in the Processing source by changing the value of START_HERE at the top of the program. However, remember that many sites, particularly those with more advanced features, will not work with this simple program.

The final report is generated in the Processing console. You may click and drag the slider between the source code window and the console window to alter the height of the console.

6: This details a summary of the dynamic statistics generated during the web crawl.

7: A table of the highest frequency terms is generated when either STOP is pressed or the Pages Queued falls to zero. This report is configurable using the controls outlined in figure above.

Submission

You must submit a report (≥1 page) outlining the following:

  • An introduction to the spidering process. Include a synopsis of how a spider gets its worklist of pages to index.
  • A comparison of SpiderBot on two websites with similar subjects. Note that Wikipedia pages work well, because they are formatted fairly simply and include a lot of outgoing links. For example, https://en.wikipedia.org/wiki/Android_(operating_system) and https://en.wikipedia.org/wiki/IOS.
    • Include tables of high frequency terms, with different options selected (such as Min. Word Length and Stop Words).
    • Include a summary for pages, lines, words, and characters indexed for each. To facilitate comparison, run each spider routine for the same amount of time at the same speed.
  • A brief analysis of the comparison. Consider the following questions:
    • Is it easy to tell which lists of high frequency indexed terms belong to which site? How might adjusting the stop word list help you distinguish?
    • As the process progresses, note the URLs that are being crawled by the spider. Do they have the same domain name as the starting URL? How does this affect the report?

Robot or Not?

Seeing how it’s so easy to write a simple program to autonomously crawl the Web and access even the deepest corners of the Internet without any human supervision, how much Web traffic is actual humans and how much is the work of these bots?