ECEN 1200 - Telecommunications 1

Peter Mathys, Fall 2006, 9/15/06


Homework/Computer Lab 3: Databases, Searching the Web


How to Submit Solutions for this Homework/Computer Lab
Due Date: Friday 9-22-06
E-mail To: no-spam e-mail
Subject Line: HWCL 03
1'st Line: Your name and student number

This computer lab is written for Windows 9x/2000/NT/XT based PCs, with a Mozilla/Firefox Browser or an Internet Explorer. Those portions of the lab that are not platform-specific can also be run on a Mac.

Quick Links


Goals of this Lab

The goals of this computer lab are:


Major Search Engines

It is estimated that there are several billion documents (1 billion = 109 = 1,000,000,000) on the Internet, and this number can be expected to continue growing in the forseeable future. Thus, the need for tools to find documents and other resources on the WWW is quite obvious. What is commonly (and somewhat imprecisely) referred to as search engines can be divided into three categories:

A human-powered directory is usually hierarchically organized and, most importantly, compiled by human editors. Prominent examples are the Yahoo! Directory and the dmoz open directory project which is also accessible (with enhanced relevance ranking) at Google Directory.

A crawler-based search engine is a collection of URIs (uniform resource identifiers) that has been collected by a program that systematically "crawls" or "spiders" the Web, i.e., searches the IP address space for documents, and then follows all the links contained in these documents, and the links found in the linked documents, etc. The pages and other resources found in this way are analyzed in terms of their content, keywords, and number and quality of links to other items. The results of this analysis are stored in a database and indexed for later retrieval by search queries. In many cases a crawler-based search engine indexes the full text of the documents that it finds, so that it is later possible to search for any word or combination of words that ocurred anywhere in a document. Examples of crawler-based search engines are Google and Yahoo! Search, which both contain several billions of indexed documents.

A meta search engine, finally, does not do any searching or indexing by itself, but rather sends queries out to several search engines and then processes and rearranges the answers before returning them to the user. Examples of meta search engines are Dogpile and info.com.

The following table lists some of the major search engines and their main classification:

Name URL Remarks
AllTheWeb http://www.alltheweb.com/ Crawler-based, acquired by Overture (now a Yahoo company) in April 2003. Powered by Yahoo since 2003.
AltaVista http://www.altavista.com/ Crawler-based, opened in December 1995. Was for several years the "Google" of its day. Now owned by Yahoo and using Yahoo's search engine.
AOL Search http://search.aol.com/ Human-powered, supplemented by Google.
Ask Jeeves http://www.ask.com/ Uses crawler-based technology to provide results to its users. Uses Teoma search engine which it purchased in September 2001.
Dogpile http://www.dogpile.com/ Meta search engine. Displays results from Google, Yahoo, MSN Search, Ask Jeeves, LookSmart, About, and others. Owned by InfoSpace.
Excite http://www.excite.com/ Meta search engine. Acquired by InfoSpace in 2002.
Gigablast http://www.gigablast.com/ Small (compared to Google) crawler-based search engine. Experimental but dependable, operated by a single person (Matt Wells).
Google http://www.google.com/ Crawler-based, ranking by popularity plus sponsored results. By far the most popular search engine
Google AdWords https://adwords.google.com/ The Google AdWords program places paid listings within Google's search results as well as some other sites that carry its listings.
Google Directory http://directory.google.com/ Human-powered from Open Directory and integrated with Google's sophisticated search technology.
HotBot http://www.hotbot.com/ Provides easy access to the web's three major crawler-based search engine: Yahoo, Google, and Teoma. Similar to a meta search engine, but shows results from different search engines on separate pages.
info.com http://www.info.com/ Meta search engine. Displays results from Google, Ask Jeeves, Yahoo, Altavista, Overture, Inktomi, LookSmart, Open Directory, and others.
LookSmart http://www.looksmart.com/ Primarily human-powered, supplemented by WiseNut crawler.
Lycos http://www.lycos.com/ Provides access to human-powered results from LookSmart and crawler-based results from Yahoo.
MetaCrawler http://www.metacrawler.com/ One of the oldest meta search services. Owned by InfoSpace.
MSN Search http://search.msn.com/ Uses its own crawler-based technology in addition to ite own human-powered directory results.
Netscape Search http://search.netscape.com/ Crawler-based, using Google for its main listings. Owned by AOL Time Warner.
Open Directory http://dmoz.org/ Web directory which uses volunteer editors to catalog the Web. Launched in June 1998. Now owned by AOL Time Warner.
Overture www.overture.com Oldest major paid placement search engine, distributing listings to a wide range of search engines. Purchased by Yahoo in October 2003.
Teoma http://www.teoma.com/ Crawler-based, with intelligent search refining, owned by AskJeeves since September 2001.
WebCrawler http://www.webcrawler.com/ Was one of the first crawler-based search engines. Acquired by InfoSpace in 2002 it is now a meta search engine. Offers fast and clean ad-free interface.
WiseNut http://www.wisenut.com/ Crawler-based, owned by LookSmart.
Yahoo http://search.yahoo.com/ Crawler-based search engine using Yahoo's own search technology.
Yahoo Directory http://dir.yahoo.com/ Human-powered (about 150 editors) web directory, in existence since late 1994.
Zeal http://www.zeal.com/ Human-powered directory owned by LookSmart.

Three of the biggest indexed databases, which are often used as backup supplements by human-powered Web directories, are Google, Yahoo, and Teoma.


Databases

The goal of a large database, which tries to organize the documents of an evolving medium such as the World-Wide Web, is to have as much flexibility as possible, to list every relationship between items exactly once, to use the minimum amount of storage, and to achieve short access times for queries. To understand the ideas behind the relational database model, which is the one most commonly used for modern database management systems (DBMS), it is best to look at a small example. The following figure shows the cards in a conventional card file for an imaginary "Mini Zoo" with seven animals.

Card File of Mini Zoo

One disadvantage of such a card file that one can see quite quickly is the potential for inconsistency because the same information appears in more than one place. For example, if Joe changes his schedule from 10 am - 4 pm to 9 am - 3 pm, then all the cards of the animals for which Joe is the caretaker must be updated. What if only the Duck and Fish cards are updated, but the Frog card is forgotten? Someone might come later and only look at the Frog card and think that Joe still works from 10 am - 4 pm.

A straightforward way to convert the card file to a format that can be used for information storage and retrieval on a computer is to use a table with as many columns as there are properties for each animal and as many rows as there are animals. Since some animals can live on land or in water, two columns are needed for the Habitat properties. Similarly, three columns ar needed for the Color properties. This leads to the flat file database shown in the following figure.

Flat File Database for Mini Zoo

Because the number of columns is fixed, this is potentially wasting a lot of space. Moreover, the rectangular table structure makes future changes and upgrades difficult, especially as the number of entries (rows) becomes large. Suppose, for example, that an exotic bird with 5 different colors is added to the Mini Zoo. Now the whole data base needs to be expanded by adding two more Color columns. Searching is also not particularly easy in a flat file database. Suppose you wanted to know how many slow animals that can live on land are in the Mini Zoo. You need to go sequentially through all records (rows in the table) to answer this question.

In a modern relational database each record of the flat file database is broken up into smaller pieces such that each fundamental relationship between an entity and its properties occurs once and only once, and such that searches can be done much more quickly than reading sequentially through all records. The following figure shows how the Mini Zoo data can be arranged using 5 tables, each with only 2 columns, in the form of a relational database.

Relational Database for Mini Zoo

Now the number of colors an animal can have is not restricted by the number of columns the table has, there will just simply be as many entries (rows) in the Color Table as the animal has different colors. Similarly, if one of the caretakers changes their schedule, that change needs to made in only one place, namely the Caretaker Schedule Table.

To prepare the database for efficient searching, e.g., for all animals that are slow and live on land, each table in the database must be preprocessed by sorting it with respect to the second column. The result of this for the Mini Zoo relational database is is shown in the next picture.

Relational Database After Sorting wrt 2'nd Column

Now the fundamental relationships that are stored in the database can be used to logically derive more complicated relationships using the mathematical theory of sets and Boolean operators (AND, OR, NOT, see next section), in order to deliver the results for specific queries. For example, to answer the question which animals have Joe as caretaker and are slow, the set of all animals that have Joe as caretaker ({Duck,Fish,Frog}) and the set of all animals that are slow ({Duck,Frog,Ladybug}) are first determined. Then the intersection or "AND" operation of the two sets is computed, resulting in the answer {Duck,Frog}. If the question is which animals have Joe as caretaker or are slow, then the union or "OR" operation of the sets Joe and slow is computed, with the result {Duck,Fish,Frog,Ladybug}.


Sets and Boolean Operators

One of the key ideas behind the relational database model is the use of a well-defined mathematical foundation to formulate queries for the retrieval of documents from the database. Every document and all properties of documents are regarded as objects that can be grouped into sets. For example, in the Mini Zoo database the set of all animals that are grey forms a set, namely Grey = {Dolphin,Duck}, and the set of all animals that can live on land forms a set Land = {Duck,Frog,Horse,Ladybug,Lion}. Using these sets, one can easily answer the question "Which animals can live on land and are grey?" as Duck. Using Boolean operators (AND, OR, NOT), one obtains Grey AND Land = {Duck}. The AND operator is also called the intersection between two sets. Instead of AND one can also use OR and ask for all animals in the Mini Zoo that can live on land or are grey. The result is the set Grey OR Land ={Dolphin,Duck,Frog,Horse,Ladybug,Lion}. Note that Duck only appears once in the result, even though it appears in both the Grey and the Land sets. The OR operation of two or more sets is also called the union of the sets. The meaning of AND and OR is shown graphically in the form of Venn diagrams in the following figure using the two sets A and B.

Boolean relations for AND, OR

There is a fundamental relationship between set addition, denoted by +, and the operators AND and OR, which says that

A OR B = A + B - (A AND B).

Knowledge of this relationship will allow you, for instance, to test whether a search engine interprets the AND and OR operations properly. In terms of the Grey and Land sets defined above, one would write

Grey + Land = {Dolphin,Duck,Duck,Frog,Horse,Ladybug,Lion} (note Duck appearing twice)
Grey AND Land = {Duck}
==> Grey + Land - (Grey AND Land) = Grey OR Land = {Dolphin,Duck,Frog,Horse,Ladybug,Lion}

where Duck appears only once in the last set.

Similarly, looking only at the size of the sets, denoted by |set|, one finds that

|Grey| = 2 , |Land| = 5
|Grey AND Land| = 1
==> |Grey| + |Land| - |Grey AND Land| = 2 + 5 - 1 = 6 = |Grey OR Land|

The NOT operator is useful for excluding objects from a set, or building the complement of a set. In the Mini Zoo, for example, one can ask for the set of all animals that are not fast, i.e., NOT Fast = {Duck,Frog,Ladybug}. Since in this case an animal can only be either fast or slow, NOT Fast = Slow and the set of fast animals is said to be the complement of the set of slow animals. This is shown graphically in the following figure.

Boolean relations for NOT

One can now ask such (seemingly meaningless) questions as "What is the set of all animals that are fast AND NOT fast?" or "What is the set of all animals that are fast OR NOT fast?" The answers are quite clearly:

Fast AND (NOT Fast) = Fast AND Slow = {} (empty set, no animal can be fast and slow simultaneously)
Fast OR (NOT Fast) = Fast OR Slow = Everything (each animal is something, either fast or slow).

Statements like these can be used to analyze the behavior and certain properties, like the total size, of search engines.


Relational Database Example

Below are the tables of the relational database of another "Mini Zoo" with the following 12 animals:

Bear, Beaver, Crocodile, Dragonfly, Eagle, Elephant,
Giraffe, Mouse, Octopus, Parrot, Rabbit, Slamander

Note that the entries in the tables are sorted alphabetically with respect to the first column.

+--------------------+   +--------------------------+   +--------------------+
| Animal Speed Table |   | Caretaker Schedule Table |   | Animal Color Table |
|--------------------|   |--------------------------|   |--------------------|
| Bear          Slow |   | Anne         6 am - 8 am |   | Bear         Black |
| Beaver        Fast |   | Bill         8 am - 9 am |   | Bear         Brown |
| Crocodile     Slow |   | Bill         5 pm - 6 pm |   | Beaver       Brown |
| Dragonfly     Slow |   | Nancy        7 am - 9 am |   | Crocodile    Green |
| Eagle         Fast |   | Nancy        3 pm - 5 pm |   | Dragonfly     Blue |
| Elephant      Slow |   | Paul         1 pm - 3 pm |   | Dragonfly   Orange |
| Giraffe       Fast |   | Rob         10 am - 3 pm |   | Dragonfly   Yellow |
| Mouse         Fast |   +--------------------------+   | Eagle        Black |
| Octopus       Slow |                                  | Eagle        White |
| Parrot        Slow |                                  | Elephant      Grey |
| Rabbit        Slow |                                  | Giraffe      Brown |
| Salamander    Fast |                                  | Giraffe     Yellow |
+--------------------+       +----------------------+   | Mouse         Grey |
                             | Animal Habitat Table |   | Mouse        White |
+------------------------+   |----------------------|   | Octopus      Black |
| Animal Caretaker Table |   | Bear           Land  |   | Octopus      Brown |
|------------------------|   | Beaver         Land  |   | Octopus     Orange |
| Bear               Rob |   | Beaver         Water |   | Octopus        Red |
| Beaver            Anne |   | Crocodile      Land  |   | Octopus     Yellow |
| Beaver            Paul |   | Crocodile      Water |   | Parrot       Black |
| Crocodile         Anne |   | Dragonfly      Air   |   | Parrot        Blue |
| Crocodile         Paul |   | Dragonfly      Land  |   | Parrot       Green |
| Dragonfly        Nancy |   | Eagle          Air   |   | Parrot         Red |
| Eagle            Nancy |   | Eagle          Land  |   | Parrot       White |
| Elephant           Rob |   | Elephant       Land  |   | Parrot      Yellow |
| Giraffe            Rob |   | Giraffe        Land  |   | Rabbit       Black |
| Mouse             Bill |   | Mouse          Land  |   | Rabbit       Brown |
| Octopus           Anne |   | Octopus        Water |   | Rabbit       White |
| Octopus           Paul |   | Parrot         Air   |   | Salamander   Black |
| Parrot           Nancy |   | Parrot         Land  |   | Salamander  Yellow |
| Rabbit            Bill |   | Rabbit         Land  |   +--------------------+
| Salamander        Anne |   | Salamander     Land  |
| Salamander        Paul |   | Salamander     Water |
+------------------------+   +----------------------+
    

Practice your understanding of databases by answering the following questions:

  1. Which animals can live on land and are red?
  2. Which animals are slow and either black or grey?
  3. Which animals are fast and can live on land and are not yellow?
  4. In a conventional cardfile, what would the card for the Salamander look like?

Features of Search Engines

An important question to ask when using a search engine is how the user interface works and how a query that is entered is interpreted by the search engine. Some particular features of interest are:

  1. Can the Boolean operators AND, OR and AND NOT be used? Can such questions as recipes AND chicken OR beef be asked if one is interested in recipes for cooking either chicken or beef, or dolphins AND NOT nfl if one is interested in those dolphins that live in water? Many search engines allow you to use AND, OR, and NOT (usually capitalized) directly. In some cases one needs to use Advanced Search and then choose from a pulldown menu. Instead of AND some search engines use + (the plus sign), instead of OR some use ( ) (parentheses), and instead of NOT some use - (the minus sign).
  2. Are "stop words" included in the index? Stop words are words that occur frequently in a language, e.g., in English

    a, the, to, of, in, and, for, an, by, with, from, an, I, it, that, not, can, no, if, we, has, was, up, but

    are the most common stop words. This becomes important in a search for the Who (the rock band) or the famous phrase To be or not to be.
  3. Is the text in the query case sensitive, i.e., are the search results different when asking for internet or for Internet?
  4. Is the plural of a word automatically included in the search or not? For example, when looking for a used car, is it better to search for used car or used cars, or do both queries yield identical results?
  5. Is it possible to search for phrases, such as "refrigerator magnet", thereby avoiding those documents that are concerned with the magnet that seals the refrigerator door when you close it?
  6. Are "wildcard words" allowed in phrases, i.e., can a * (asterisk) be used as a placeholder for one or more unknown words? For instance, when using the search phrase "baby with * bathtub", does the search engine automatically look for phrases like "baby with bathtub", "baby with the bathtub", "baby with water in the bathtub" etc?
  7. What is the actual size of the database, i.e., how many documents are actually indexed and and included when a specific search is performed?
  8. What is the page depth amount, i.e., if a web page is 1 MB (megabyte) long, is the whole text indexed, or does a search engine only look at the first 100 kB (kilobytes), for example?

There are more properties that one could ask for, but in this lab only the ones listed above are considered.


How to Test and Estimate Search Engine Features

Some questions about the features of search engines can be answered by reading help and explanation files provided by the operators of the search engine. However, for reasons that depend on marketing, competition pride, neglect and other not so easily quantifiable driving forces, this information may be incomplete, outdated, or just plain misleading. Another option is to use selected queries and simple statistical analysis methods to determine the actual features of a search engine. Here are some methods that work quite well for the features mentioned above:

  1. Boolean operators. Use the following queries and note the number of documents returned in each case:

    health
    science
    health OR science (note the capitalized OR)
    health AND science (note the capitalized AND)
    health AND NOT science (note the capitalized AND NOT)

    Then, use the numbers in the following formulas and check whether they hold with reasonable accuracy:

    #(health OR science) = #(health) + #(science) - #(health AND science)

    #(health AND science) + #(health AND NOT science) = #(health)

    As a quick check, verify that #(health AND science) is much smaller than #(health OR science) and that #(health AND NOT science) is somewhat smaller than #(health).
  2. Stop words. Try the words a, of and the and see what response you get. You would expect each of these stopwords to occur in roughly two thirds (2/3) of all documents. Note: In some cases you may have to specify a search for stop words as +the or "the" instead of just the.
  3. Case sensitivity. Try the words may and May and look at the difference in the number of returned documents.
  4. Plural inclusion. Try computer and computers and, if possible, computer OR computers and look at the number of returned hits.
  5. Phrases. Try chocolate cookie and "chocolate cookie" and look at the difference in the number of returned documents (if any). The phrase "chocolate cookie" should return much fewer documents than chocolate cookie.
  6. Wildcards in phrases. Assuming that phrases (enclosed in quotes) can be used, try "baby with bathwater" and record the number of hits, which should be quite low. Then try "baby with the bathwater" and record the number of hits. This should be significantly higher than "baby with bathwater". Now try "baby with * bathwater" where * is used for a wildcard word. If "baby with * bathwater" gives a similar number of hits as "baby with bathwater" then wildcards do not work. If "baby with * bathwater" gives a similar number of hits as "baby with the bathwater" then * stands for exactly one wildcard word. If "baby with * bathwater" gives a noticeably larger amount of hits than "baby with the bathwater" then * stands for one or more (and possibly zero) wildcard words.
  7. Actual index size. Suppose you knew the (approximate) percentage of URIs or Web pages that contain a particular keyword, such as the ones given in the following table.

    Keyword Occurrence in Web Pages
    health6.7%
    science4.1%
    digital4.3%
    entertainment4.3%
    medicine1.4%

    Use each of these words in a query for the search engine whose size you want to estimate and record the number of hits returned. Then divide the number of hits by the percentage given for a particular word and multiply the result by 100. This gives an estimate of the size of the database based on a single particular keyword. Do this for several keywords and then average over all the size estimates obtained in this way. This gives a pretty good estimate of the size of the database used by a search engine.
  8. Page depth. Make use of a long file (1 megabyte (MB) or more) that is likely to be indexed in most search engines and which contains distinct phrases. An example is the file humbn10.txt (1.45 MB) which contains the whole text of the novel Of Human Bondage by W. Somerset Maugham. Instead of humbn10.txt you may also see the HTML versions humbn10.txt.html and humbn10.html which are of similar length. But do not make use of any files that are excerpts from this text, conversions to pdf, or other modifications (e.g., part 2 of 5) for this test. Then search for a unique phrase like "Philip parted from Emma with tears" near the beginning of the file. Check that one of the humbn10 files listed above is among the search results. Then make another search, e.g., with the unique phrase "Philip had no idea where the field was" which occurs about 100 kB (kilobytes) into the text. Check again that one of the humbn10 files is among the search results. Repeat with the unique phrases "You can get the Magdalen" (200 kB into the text), "Lawson white with passion now" (430 kB into the text), and "Athelny told Philip that he could easily" (about 1100 kB into the text). Note up to which phrase any of the original humbn10 files still occurs among the hits returned by the search engine. This is a good estimate of the page depth to which the search engine analyzes the documents that it indexes.

Your Task

To obtain credit for this homework/computer lab you need to answer the questions stated below. Send your solution to no-spam e-mail.

Rules for the Submission of Homework/Computerlab Solutions

Format: E-mail your solution as a plain text (ASCII) file. Do not use word processor files like Microsoft Word or "rich-text" HTML files.

Corrections: If you need to make corrections after you submitted your solution, resubmit all your answers (not only the ones that changed) since only your last submission of each homework/computer lab will be graded.

Teamwork: Teamwork is fine for the homework/computer labs, but the solutions must be turned in individually. In particular, copy and paste of entire solutions from other students is not acceptable.

Questions:

  1. Look at the relational database example above to answer the following questions:
    1. Which animals can live on land and are red?
    2. Which animals are slow and either black or grey?
    3. Which animals are fast and can live on land and are not yellow?
    4. In a conventional cardfile, what would the card for the Salamander look like?
  2. For each of the following search engines:

    Google http://www.google.com
    Ask http://search.ask.com
    Yahoo http://search.yahoo.com


    answer the feature questions below (see How to Test and Estimate Search Engine Features above). For all questions make sure that you search the whole Web (not images, directories, mp3 files, etc) and that you use the counts for the number of Web sites, rather than paid directory listings, most popular sites, etc.
    1. Which of the Boolean operators OR, AND, and AND NOT does the search engine interpret correctly?
    2. Can stop words be used in searches or are they discarded?
    3. Can the search engine search for phrases?
    4. Can wildcard words be used in phrase search?
    5. Estimate the actual index size using the keywords given above. Show your computations.
    6. Estimate the page depth amount to which the search engine indexes documents.
  3. Formulate each of the following searches such that you get the relevant answers within the first few pages and include the search phrase and the search engine you used in your solution.
    1. In the 1940's a device called Hush A Phone was sold. Unexpectedly for the designers of this device, it entered the history of Telecommunications in a significant way. Find out what that device was and why people still talk about it now.
    2. Find a reference that explains the meaning of the phrase "rat race".
    3. Who is Jeeves and how is/was he connected with the search engine Ask?
    4. Where does the name Google come from?
  4. Add a clickable list (with anchors of the form <a href="search_engine_URI">search engine</a>) of your 3 favorite search engines to your homepage (in index.html, or make a separate page if you don't want this list to be on your default homepage). Include the URI at which your page can be accessed on the Web in your e-mail.