Search
Engines 
Search engines
use software robots to survey the Web and build their
databases. Web documents are retrieved and indexed.
When you enter a query at a search engine website, your
input is checked against the search engine's keyword
indices. The best matches are then returned to you as
hits.
There are two primary methods
of text searching--keyword and concept.
Keyword Searching
This is the most common form of
text search on the Web. Most search engines do their
text query and retrieval using keywords.
Unless the author of the Web document
specifies the keywords for her document (this is possible
by using meta tags in the latest version of HTML), it's
up to the search engine to determine them. Essentially,
this means that search engines pull out and index words
that are believed to be significant. Words that are
mentioned towards the top of a document and words that
are repeated several times throughout the document are
more likely to be deemed important.
Some sites index every word on every
page. Others index only part of the document. For example,
Lycos indexes the title, headings, subheadings and the
hyperlinks to other sites, along with the first 20 lines
of text.
Full-text indexing systems generally
pick up every word in the text except commonly occurring
stop words such as "a," "an," "the,"
"is," "and," "or," and
"www." AltaVista claims to index all words,
even the articles, "a," "an," and
"the." Some of the search engines discriminate
upper case from lower case; others store all words without
reference to capitalization.
The Problem With Keyword
Searching
Keyword searches have a tough time
distinguishing between words that are spelled the same
way, but mean something different (i.e. hard cider,
a hard stone, a hard exam, and the hard drive on your
computer). This often results in hits that are completely
irrelevant to your query. Some search engines also have
trouble with so-called stemming--i.e., if you enter
the word "big," should they return a hit on
the word, "bigger?" What about singular and
plural words? What about verb tenses that differ from
the word you entered by only an "s," or an
"ed"?
Search engines also cannot return
hits on keywords that mean the same, but are not actually
entered in your query. A query on heart disease would
not return a document that used the word "cardiac"
instead of "heart."
Concept-based searching
Unlike keyword search systems, concept-based
search systems try to determine what you mean, not just
what you say. In the best circumstances, a concept-based
search returns hits on documents that are "about"
the subject/theme you're exploring, even if the words
in the document don't precisely match the words you
enter into the query.
Excite is currently the best-known
general-purpose search engine site on the Web that relies
on concept-based searching.
This is also known as clustering
-- which essentially means that words are examined in
relation to other words found nearby.
How does it work? There are various
methods of building clustering systems, some of which
are highly complex, relying on sophisticated linguistic
and artificial intelligence theory that we won't even
attempt to go into here. Excite sticks to a numerical
approach. Excite's software determines meaning by calculating
the frequency with which certain important words appear.
When several words or phrases that are tagged to signal
a particular concept appear close to each other in a
text, the search engine concludes, by statistical analysis,
that the piece is "about" a certain subject.
For example, the word heart, when
used in the medical/health context, would be likely
to appear with such words as coronary, artery, lung,
stroke, cholesterol, pump, blood, attack, and arteriosclerosis.
If the word heart appears in a document with others
words such as flowers, candy, love, passion, and valentine,
a very different context is established, and the search
engine returns hits on the subject of romance.
Warning: This often works better
in theory than in practice. Concept-based indexing is
a good idea, but it's far from perfect. The results
are best when you enter a lot of words, all of which
roughly refer to the concept you're seeking information
about.
Refining Your Search
Most sites offer two different types
of searches--"basic" and "refined."
In a "basic" search, you just enter a keyword
without sifting through any pulldown menus of additional
options. Depending on the engine, though, "basic"
searches can be quite complex.
Search refining options differ from
one search engine to another, but some of the possibilities
include the ability to search on more than one word,
to give more weight to one search term than you give
to another, and to exclude words that might be likely
to muddy the results. You might also be able to search
on proper names, on phrases, and on words that are found
within a certain proximity to other search terms.
Some search engines also allow you
to specify what form you'd like your results to appear
in, and whether you wish to restrict your search to
certain fields on the internet (i.e., usenet or the
Web) or to specific parts of Web documents (i.e., the
title or URL).
Many, but not all search engines
allow you to use so-called Boolean operators to refine
your search. These are the logical terms AND, OR, NOT,
and the so-called proximal locators, NEAR and FOLLOWED
BY.
Boolean AND means that all the terms
you specify must appear in the documents, i.e., "heart"
AND "attack." You might use this if you wanted
to exclude common hits that would be irrelevant to your
query.
Boolean OR means that at least one
of the terms you specify must appear in the documents,
i.e., bronchitis, acute OR chronic. You might use this
if you didn't want to rule out too much.
Boolean NOT means that at least
one of the terms you specify must not appear in the
documents. You might use this if you anticipated results
that would be totally off-base, i.e., nirvana AND Buddhism,
NOT Cobain.
Not quite Boolean + and - Some search
engines use the characters + and - instead of Boolean
operators to include and exclude terms.
NEAR means that the terms you enter
should be within a certain number of words of each other.
FOLLOWED BY means that one term must directly follow
the other. ADJ, for adjacent, serves the same function.
A search engine that will allow you to search on phrases
uses, essentially, the same method (i.e., determining
adjacency of keywords).
Phrases: The ability to query on
phrases is very important in a search engine. Those
that allow it usually require that you enclose the phrase
in quotation marks, i.e., "space the final frontier."
Capitalization: This is essential
for searching on proper names of people, companies or
products. Unfortunately, many words in English are used
both as proper and common nouns--Bill, bill, Gates,
gates, Oracle, oracle, Lotus, lotus, Digital, digital--the
list is endless.
All the search engines have different
methods of refining queries. The best way to learn them
is to read the help files on the search engine sites
and practice!
Most of the search engines return
results with confidence or relevancy rankings. In other
words, they list the hits according to how closely they
think the results match the query. However, these lists
often leave users shaking their heads on confusion,
since, to the user, the results often seem completely
irrelevant.
Why does this happen? Basically
it's because search engine technology has not yet reached
the point where humans and computers understand each
other well enough to communicate clearly.
Most search engines use search term
frequency as a primary way of determining whether a
document is relevant. If you're researching diabetes
and the word "diabetes" appears multiple times
in a Web document, it's reasonable to assume that the
document will contain useful information. Therefore,
a document that repeats the word "diabetes"
over and over is likely to turn up near the top of your
list.
If your keyword is a common one,
or if it has multiple other meanings, you could end
up with a lot of irrelevant hits. And if your keyword
is a subject about which you desire information, you
don't need to see it repeated over and over--it's the
information about that word that you're interested in,
not the word itself.
Some search engines consider both
the frequency and the positioning of keywords to determine
relevancy, reasoning that if the keywords appear early
in the document, or in the headers, this increases the
likelihood that the document is on target. For example,
Lycos ranks hits according to how many times your keywords
appear in their indices of the document and in which
fields they appear (i.e., in headers, titles or text).
It also takes into consideration whether the documents
that emerge as hits are frequently linked to other documents
on the Web, reasoning that if other folks consider them
important, you should, too.
If you use the advanced query form
on AltaVista, you can assign relevance weights to your
query terms before conducting a search. Although this
takes some practice, it essentially allows you to have
a stronger say in what results you will get back.
As far as the user is concerned,
relevancy ranking is critical, and becomes more so as
the sheer volume of information on the Web grows. Most
of us don't have the time to sift through scores of
hits to determine which hyperlinks we should actually
explore. The more clearly relevant the results are,
the more we're likely to value the search engine.
Information On Meta Tags
Some search engines are now indexing Web documents
by the meta tags in the documents' HTML (at the beginning
of the document in the so-called "head" tag).
What this means is that the Web page author can have
some influence over which keywords are used to index
the document, and even in the description of the document
that appears when it comes up as a search engine hit.
This is obviously very important
if you are trying to draw people to your website based
on how your site ranks in search engines hit lists.
There is no perfect way to ensure
that you'll receive a high ranking. Even if you do get
a great ranking, there's no assurance that you'll keep
it for long. For example, in April 1999) one of our
Spider's Apprentice pages is the number one ranked hit
on Altavista for the phrase "how search engines
work." A few months later, however, it had dropped
down in the listings.
There is a lot of conflicting information
out there on meta-tagging. If you're confused it may
be because different search engines look at meta tags
in different ways. Some rely heavily on meta tags, others
don't use them at all. The general opinion in early
2002 is that meta tags are less useful than they were
a few years ago, largely because of the high rate of
spamdexing (web authors using false and misleading keywords
in the meta tags).
It seems to be generally agreed
that the "title" and the "description"
meta tags are important to write effectively, since
several major search engines use them in their indices.
Use relevant keywords in your title, and vary the titles
on the different pages that make up your website, in
order to target as many keywords as possible. As for
the "description" meta tag, some search engines
will use it as their short summary of your url, so make
sure your description is one that will entice surfers
to your site.
In the keyword tag, list a few keywordm
synonyns for keywords, or foreign translations of keywords
(if you anticipate traffic from foreign surfers). Make
sure the keywords refer to, or are directly related
to, the subject or material on the page. Do NOT use
false or misleading keywords in an attempt to gain a
higher ranking for your pages.
The "keyword" meta tag
has been abused by some webmasters. For example, a recent
ploy has been to put such words "sex" or "mp3"
into keyword meta tags, in hopes of luring searchers
to one's website by using popular keywords.
The search engines are aware of
such deceptive tactics, and have devised various methods
to circumvent them, so be careful. Use keywords that
are appropriate to your subject, and make sure they
appear in the top paragraphs of actual text on your
webpage. Many search engine algorithms score the words
that appear towards the top of your document more highly
than the words that appear towards the bottom. Words
that appear in HTML header tags (H1, H2, H3, etc) are
also given more weight by some search engines. It sometimes
helps to give your page a file name that makes use of
one of your prime keywords, and to include keywords
in the "alt" image tags.
One thing you should not do is use
some other company's trademarks in your meta tags. Some
website owners have been sued for trademark violations
because they've used other company names in the meta
tags. I have, in fact, testified as an expert witness
in such cases. Believe me, you do not want the expense
of being sued!
Remember that all the major search
engines have slightly different policies. If you're
designing a website and meta-tagging your documents,
we recommend that you take the time to check out what
the major search engines say in their help files about
how they each use meta tags. You might want to optimize
your meta tags for the search engines you believe are
sending the most traffic to your site.
Copied
from the Spider's Apprentice
|