
Web Publishing/Search Best Practices
When working with the Google Search Appliance, use these tips
and guidelines to improve the search experience for users trying to find
your content.
Make web pages for users, not for search engines
Create a useful, information-rich
content site. Write pages that clearly and accurately describe your content.
Don't load pages with irrelevant
words. Think about the words users would type to find your pages, and
make sure that your site actually includes those words within it.
Focus on text
Focus on the text on your site. Make sure that your TITLE
and ALT tags are descriptive and accurate. Since the Google crawler doesn't
recognize
text contained in images, avoid using graphical text and instead place
information within the alt and anchor text of pictures. When linking
to non-HTML documents, use strong descriptions within the anchor text
that describe the links your site is making.
Make your site easy to navigate
Make a site with a clear hierarchy of
hypertext links. Every page should be reachable from at least one hypertext
link. Offer a site map to your
users with hypertext links that point to the important parts of your
site. Keep the links on a given page to a reasonable number (fewer than
100).
Ensure that your site is linked
Ensure that your site is linked from
all relevant sites within your network. Interlinking between sites and
within sites gives the Google
crawler additional ability to find content, as well as improving the
quality of the search.
Make sure that the Google crawler can read your content
Validate all
HTML content to ensure that the HTML is well-formed. Use a text browser
such as Lynx to examine your site, because most search
engine spiders see your site much as Lynx would. If extra features such
as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you
from seeing all of your site in a text browser, then search engine crawlers
may have trouble crawling your site.
Allow search bots to crawl your sites without session IDs or arguments
that track their path through the site. These techniques are useful for
tracking individual user behavior, but the access pattern of bots is
entirely different. Using these techniques may result in multiple copies
of the same document being indexed for your site, as crawl robots will
see each unique URL (including session ID) as a unique document.
Ensure that your site's internal link structure provides a hypertext
link path to all of your pages. The Google search engine follows hypertext
links from one page to the next, so pages that are not linked to by others
may be missed. Additionally, you should consult the administrator
of your Google Search Appliance to ensure that your site's home page is
accessible to the search engine.
Understand why some documents may be missing from the index
Each time that the Google Search Appliance updates its database of
web pages, the documents in the index can change. Here are a few examples
of reasons why pages may not appear in the index.
- Your content pages may have been intentionally blocked by a robots.txt
file or ROBOTS meta tags.
- Your web site was inaccessible when the
crawl robot ttempted to access it, due to network or server outage.
If this happens, the Google Search
Appliance
will retry multiple times; but if the site cannot be crawled, it will not
be included in the index.
- The Google crawl robot cannot find a path
of links to your site from the starting points it was given.
- Your
content pages may not be considered relevant to the query you entered.
Ensure that the query terms exist on your target page.
- Your content
pages contain invalid HTML code.
- Your content pages were manually
removed from the index by the Google Search Appliance administrator.
Don't use frames!
In addition to being against UWM
web policy, frames tend to
cause problems with search
engines,
bookmarks,
e-mail
links and so on, because frames don't fit the conceptual model of the
web (where every document corresponds to a single URL).
Avoid placing content and links in script code
Most search engines do
not read any information found in SCRIPT tags within an HTML document.
This means that content within script code will
not be indexed, and hypertext links within script code will not be followed
when crawling. When using a scripting language, make sure that your content
and links are outside SCRIPT tags. The
content above was provided by Google.
|