| How to Submit Solutions for this Homework/Computer Lab | |
|---|---|
| Due Date: | Friday 9-22-06 |
| E-mail To: | ![]() |
| Subject Line: | HWCL 03 |
| 1'st Line: | Your name and student number |
This computer lab is written for Windows 9x/2000/NT/XT based PCs, with a Mozilla/Firefox Browser or an Internet Explorer. Those portions of the lab that are not platform-specific can also be run on a Mac.
The goals of this computer lab are:
It is estimated that there are several billion documents (1 billion = 109 = 1,000,000,000) on the Internet, and this number can be expected to continue growing in the forseeable future. Thus, the need for tools to find documents and other resources on the WWW is quite obvious. What is commonly (and somewhat imprecisely) referred to as search engines can be divided into three categories:
A human-powered directory is usually hierarchically organized and, most importantly, compiled by human editors. Prominent examples are the Yahoo! Directory and the dmoz open directory project which is also accessible (with enhanced relevance ranking) at Google Directory.
A crawler-based search engine is a collection of URIs (uniform resource identifiers) that has been collected by a program that systematically "crawls" or "spiders" the Web, i.e., searches the IP address space for documents, and then follows all the links contained in these documents, and the links found in the linked documents, etc. The pages and other resources found in this way are analyzed in terms of their content, keywords, and number and quality of links to other items. The results of this analysis are stored in a database and indexed for later retrieval by search queries. In many cases a crawler-based search engine indexes the full text of the documents that it finds, so that it is later possible to search for any word or combination of words that ocurred anywhere in a document. Examples of crawler-based search engines are Google and Yahoo! Search, which both contain several billions of indexed documents.
A meta search engine, finally, does not do any searching or indexing by itself, but rather sends queries out to several search engines and then processes and rearranges the answers before returning them to the user. Examples of meta search engines are Dogpile and info.com.
The following table lists some of the major search engines and their main classification:
| Name | URL | Remarks |
|---|---|---|
| AllTheWeb | http://www.alltheweb.com/ | Crawler-based, acquired by Overture (now a Yahoo company) in April 2003. Powered by Yahoo since 2003. |
| AltaVista | http://www.altavista.com/ | Crawler-based, opened in December 1995. Was for several years the "Google" of its day. Now owned by Yahoo and using Yahoo's search engine. |
| AOL Search | http://search.aol.com/ | Human-powered, supplemented by Google. |
| Ask Jeeves | http://www.ask.com/ | Uses crawler-based technology to provide results to its users. Uses Teoma search engine which it purchased in September 2001. |
| Dogpile | http://www.dogpile.com/ | Meta search engine. Displays results from Google, Yahoo, MSN Search, Ask Jeeves, LookSmart, About, and others. Owned by InfoSpace. |
| Excite | http://www.excite.com/ | Meta search engine. Acquired by InfoSpace in 2002. |
| Gigablast | http://www.gigablast.com/ | Small (compared to Google) crawler-based search engine. Experimental but dependable, operated by a single person (Matt Wells). |
| http://www.google.com/ | Crawler-based, ranking by popularity plus sponsored results. By far the most popular search engine | |
| Google AdWords | https://adwords.google.com/ | The Google AdWords program places paid listings within Google's search results as well as some other sites that carry its listings. |
| Google Directory | http://directory.google.com/ | Human-powered from Open Directory and integrated with Google's sophisticated search technology. |
| HotBot | http://www.hotbot.com/ | Provides easy access to the web's three major crawler-based search engine: Yahoo, Google, and Teoma. Similar to a meta search engine, but shows results from different search engines on separate pages. |
| info.com | http://www.info.com/ | Meta search engine. Displays results from Google, Ask Jeeves, Yahoo, Altavista, Overture, Inktomi, LookSmart, Open Directory, and others. |
| LookSmart | http://www.looksmart.com/ | Primarily human-powered, supplemented by WiseNut crawler. |
| Lycos | http://www.lycos.com/ | Provides access to human-powered results from LookSmart and crawler-based results from Yahoo. |
| MetaCrawler | http://www.metacrawler.com/ | One of the oldest meta search services. Owned by InfoSpace. |
| MSN Search | http://search.msn.com/ | Uses its own crawler-based technology in addition to ite own human-powered directory results. |
| Netscape Search | http://search.netscape.com/ | Crawler-based, using Google for its main listings. Owned by AOL Time Warner. |
| Open Directory | http://dmoz.org/ | Web directory which uses volunteer editors to catalog the Web. Launched in June 1998. Now owned by AOL Time Warner. |
| Overture | www.overture.com | Oldest major paid placement search engine, distributing listings to a wide range of search engines. Purchased by Yahoo in October 2003. |
| Teoma | http://www.teoma.com/ | Crawler-based, with intelligent search refining, owned by AskJeeves since September 2001. |
| WebCrawler | http://www.webcrawler.com/ | Was one of the first crawler-based search engines. Acquired by InfoSpace in 2002 it is now a meta search engine. Offers fast and clean ad-free interface. |
| WiseNut | http://www.wisenut.com/ | Crawler-based, owned by LookSmart. |
| Yahoo | http://search.yahoo.com/ | Crawler-based search engine using Yahoo's own search technology. |
| Yahoo Directory | http://dir.yahoo.com/ | Human-powered (about 150 editors) web directory, in existence since late 1994. |
| Zeal | http://www.zeal.com/ | Human-powered directory owned by LookSmart. |
Three of the biggest indexed databases, which are often used as backup supplements by human-powered Web directories, are Google, Yahoo, and Teoma.
The goal of a large database, which tries to organize the documents of an evolving medium such as the World-Wide Web, is to have as much flexibility as possible, to list every relationship between items exactly once, to use the minimum amount of storage, and to achieve short access times for queries. To understand the ideas behind the relational database model, which is the one most commonly used for modern database management systems (DBMS), it is best to look at a small example. The following figure shows the cards in a conventional card file for an imaginary "Mini Zoo" with seven animals.
One disadvantage of such a card file that one can see quite quickly is the potential for inconsistency because the same information appears in more than one place. For example, if Joe changes his schedule from 10 am - 4 pm to 9 am - 3 pm, then all the cards of the animals for which Joe is the caretaker must be updated. What if only the Duck and Fish cards are updated, but the Frog card is forgotten? Someone might come later and only look at the Frog card and think that Joe still works from 10 am - 4 pm.
A straightforward way to convert the card file to a format that can be used for information storage and retrieval on a computer is to use a table with as many columns as there are properties for each animal and as many rows as there are animals. Since some animals can live on land or in water, two columns are needed for the Habitat properties. Similarly, three columns ar needed for the Color properties. This leads to the flat file database shown in the following figure.
Because the number of columns is fixed, this is potentially wasting a lot of space. Moreover, the rectangular table structure makes future changes and upgrades difficult, especially as the number of entries (rows) becomes large. Suppose, for example, that an exotic bird with 5 different colors is added to the Mini Zoo. Now the whole data base needs to be expanded by adding two more Color columns. Searching is also not particularly easy in a flat file database. Suppose you wanted to know how many slow animals that can live on land are in the Mini Zoo. You need to go sequentially through all records (rows in the table) to answer this question.
In a modern relational database each record of the flat file database is broken up into smaller pieces such that each fundamental relationship between an entity and its properties occurs once and only once, and such that searches can be done much more quickly than reading sequentially through all records. The following figure shows how the Mini Zoo data can be arranged using 5 tables, each with only 2 columns, in the form of a relational database.
Now the number of colors an animal can have is not restricted by the number of columns the table has, there will just simply be as many entries (rows) in the Color Table as the animal has different colors. Similarly, if one of the caretakers changes their schedule, that change needs to made in only one place, namely the Caretaker Schedule Table.
To prepare the database for efficient searching, e.g., for all animals that are slow and live on land, each table in the database must be preprocessed by sorting it with respect to the second column. The result of this for the Mini Zoo relational database is is shown in the next picture.
Now the fundamental relationships that are stored in the database can be used to logically derive more complicated relationships using the mathematical theory of sets and Boolean operators (AND, OR, NOT, see next section), in order to deliver the results for specific queries. For example, to answer the question which animals have Joe as caretaker and are slow, the set of all animals that have Joe as caretaker ({Duck,Fish,Frog}) and the set of all animals that are slow ({Duck,Frog,Ladybug}) are first determined. Then the intersection or "AND" operation of the two sets is computed, resulting in the answer {Duck,Frog}. If the question is which animals have Joe as caretaker or are slow, then the union or "OR" operation of the sets Joe and slow is computed, with the result {Duck,Fish,Frog,Ladybug}.
One of the key ideas behind the relational database model is the use of a well-defined mathematical foundation to formulate queries for the retrieval of documents from the database. Every document and all properties of documents are regarded as objects that can be grouped into sets. For example, in the Mini Zoo database the set of all animals that are grey forms a set, namely Grey = {Dolphin,Duck}, and the set of all animals that can live on land forms a set Land = {Duck,Frog,Horse,Ladybug,Lion}. Using these sets, one can easily answer the question "Which animals can live on land and are grey?" as Duck. Using Boolean operators (AND, OR, NOT), one obtains Grey AND Land = {Duck}. The AND operator is also called the intersection between two sets. Instead of AND one can also use OR and ask for all animals in the Mini Zoo that can live on land or are grey. The result is the set Grey OR Land ={Dolphin,Duck,Frog,Horse,Ladybug,Lion}. Note that Duck only appears once in the result, even though it appears in both the Grey and the Land sets. The OR operation of two or more sets is also called the union of the sets. The meaning of AND and OR is shown graphically in the form of Venn diagrams in the following figure using the two sets A and B.
There is a fundamental relationship between set addition, denoted by
+, and the operators AND and OR, which says that
A OR B = A + B - (A AND B).
Knowledge of this relationship will allow you, for instance, to test whether
a search engine interprets the AND and OR operations
properly. In terms of the Grey and Land sets defined
above, one would write
Grey + Land = {Dolphin,Duck,Duck,Frog,Horse,Ladybug,Lion}
(note Duck appearing twice)
Grey AND Land = {Duck}
==> Grey + Land - (Grey AND Land) = Grey OR Land = {Dolphin,Duck,Frog,Horse,Ladybug,Lion}
where Duck appears only once in the last set.
Similarly, looking only at the size of the sets,
denoted by |set|, one finds that
|Grey| = 2 , |Land| = 5
|Grey AND Land| = 1
==> |Grey| + |Land| - |Grey AND Land| = 2 + 5 - 1 = 6 = |Grey OR Land|
The NOT operator is useful for excluding objects from a set, or building the complement of a set. In the Mini Zoo, for example, one can ask for the set of all animals that are not fast, i.e., NOT Fast = {Duck,Frog,Ladybug}. Since in this case an animal can only be either fast or slow, NOT Fast = Slow and the set of fast animals is said to be the complement of the set of slow animals. This is shown graphically in the following figure.
One can now ask such (seemingly meaningless)
questions as "What
is the set of all animals that are fast AND NOT fast?" or
"What is the set of all animals that are fast OR NOT fast?"
The answers are quite clearly:
Fast AND (NOT Fast) = Fast AND Slow = {} (empty set, no animal
can be fast and slow simultaneously)
Fast OR (NOT Fast) = Fast OR Slow = Everything (each animal is
something, either fast or slow).
Statements like these can be used to analyze the behavior and certain
properties, like the total size, of search engines.
Below are the tables of the relational database of another "Mini Zoo" with the following 12 animals:
Bear, Beaver, Crocodile, Dragonfly, Eagle, Elephant, Giraffe, Mouse, Octopus, Parrot, Rabbit, Slamander
Note that the entries in the tables are sorted alphabetically with respect to the first column.
+--------------------+ +--------------------------+ +--------------------+
| Animal Speed Table | | Caretaker Schedule Table | | Animal Color Table |
|--------------------| |--------------------------| |--------------------|
| Bear Slow | | Anne 6 am - 8 am | | Bear Black |
| Beaver Fast | | Bill 8 am - 9 am | | Bear Brown |
| Crocodile Slow | | Bill 5 pm - 6 pm | | Beaver Brown |
| Dragonfly Slow | | Nancy 7 am - 9 am | | Crocodile Green |
| Eagle Fast | | Nancy 3 pm - 5 pm | | Dragonfly Blue |
| Elephant Slow | | Paul 1 pm - 3 pm | | Dragonfly Orange |
| Giraffe Fast | | Rob 10 am - 3 pm | | Dragonfly Yellow |
| Mouse Fast | +--------------------------+ | Eagle Black |
| Octopus Slow | | Eagle White |
| Parrot Slow | | Elephant Grey |
| Rabbit Slow | | Giraffe Brown |
| Salamander Fast | | Giraffe Yellow |
+--------------------+ +----------------------+ | Mouse Grey |
| Animal Habitat Table | | Mouse White |
+------------------------+ |----------------------| | Octopus Black |
| Animal Caretaker Table | | Bear Land | | Octopus Brown |
|------------------------| | Beaver Land | | Octopus Orange |
| Bear Rob | | Beaver Water | | Octopus Red |
| Beaver Anne | | Crocodile Land | | Octopus Yellow |
| Beaver Paul | | Crocodile Water | | Parrot Black |
| Crocodile Anne | | Dragonfly Air | | Parrot Blue |
| Crocodile Paul | | Dragonfly Land | | Parrot Green |
| Dragonfly Nancy | | Eagle Air | | Parrot Red |
| Eagle Nancy | | Eagle Land | | Parrot White |
| Elephant Rob | | Elephant Land | | Parrot Yellow |
| Giraffe Rob | | Giraffe Land | | Rabbit Black |
| Mouse Bill | | Mouse Land | | Rabbit Brown |
| Octopus Anne | | Octopus Water | | Rabbit White |
| Octopus Paul | | Parrot Air | | Salamander Black |
| Parrot Nancy | | Parrot Land | | Salamander Yellow |
| Rabbit Bill | | Rabbit Land | +--------------------+
| Salamander Anne | | Salamander Land |
| Salamander Paul | | Salamander Water |
+------------------------+ +----------------------+
Practice your understanding of databases by answering the following questions:
An important question to ask when using a search engine is how the user interface works and how a query that is entered is interpreted by the search engine. Some particular features of interest are:
There are more properties that one could ask for, but in this lab only the ones listed above are considered.
Some questions about the features of search engines can be answered by reading help and explanation files provided by the operators of the search engine. However, for reasons that depend on marketing, competition pride, neglect and other not so easily quantifiable driving forces, this information may be incomplete, outdated, or just plain misleading. Another option is to use selected queries and simple statistical analysis methods to determine the actual features of a search engine. Here are some methods that work quite well for the features mentioned above:
| Keyword | Occurrence in Web Pages |
|---|---|
| health | 6.7% |
| science | 4.1% |
| digital | 4.3% |
| entertainment | 4.3% |
| medicine | 1.4% |
To obtain credit for this homework/computer lab you need to answer the
questions stated below. Send your solution to
.
Rules for the Submission of Homework/Computerlab Solutions
Format: E-mail your solution as a plain text (ASCII) file. Do not use word processor files like Microsoft Word or "rich-text" HTML files.
Corrections: If you need to make corrections after you submitted your solution, resubmit all your answers (not only the ones that changed) since only your last submission of each homework/computer lab will be graded.
Teamwork: Teamwork is fine for the homework/computer labs, but the solutions must be turned in individually. In particular, copy and paste of entire solutions from other students is not acceptable.
Questions:
| http://www.google.com | |
| Ask | http://search.ask.com |
| Yahoo | http://search.yahoo.com |
©1996-2006, P. Mathys. Last revised: 9-14-06, PM.