“Noarchive” Meta Tag To Disable Search Engine Caching

Filed Under (Search Engine Crawling) by SEOmaster on January 01, 2008

Tagged Under : ,

“Noarchive” meta tag is used when you want to prevent or remove cached pages in a search engine. This meta tag is known to work for all major search engines including Google, Yahoo, MSN and Ask.com. It is also known that “noarchive” meta tag does NOT affect your search engine rankings or indexing, but only determines whether or not search engines will cache the crawled content.

The “noarchive” metatag can be used when a website publisher charges a fee for access to its content, and thus want to prevent content theft, but still would like the content to be indexed and ranked by search engines. Also it’s useful when the content changes frequently, and it’s not desirable to keep the outdated stale content cached by search engines for human access. This meta tag is sometimes exploitted by blackhat SEO to hide their cloaking techniques.

Sites that currently use “noachive” meta tag to protect against search engine caching include:

  • http://www.webmasterworld.com
  • http://www.nytimes.com

These sites are indexed in Google but not cached because they have the metatag ‘noarchive’ for all robots.

<META NAME=”GOOGLEBOT” CONTENT=”NOARCHIVE”>
<META NAME=”ROBOTS” CONTENT=”NOARCHIVE”>

What Determines Search Engine’s Crawl Rate?

Filed Under (Search Engine Crawling) by SEOmaster on December 27, 2007

Tagged Under : ,

The web is an ever-growing and dynamically changing world. Given the gigantic scale and dynamic nature of the web, it’s a nearly insurmountable task to maintain an up-to-date search engine index for the entire web. Therefore, one of the most important tasks of a search engine is to determine what part of the web is important and thus worth crawling more often, and reflect any updated content in their index in a timely manner. From a webmaster’s point of view, our mission is then to become a search engine’s favorite, and to make search engine robots visit our site more often than others. Here are a (non-exhaustive) list of factors that can affect the search engine’s crawl rate.

1. Relevant and Authoritative Backlinks
It’s a well known fact that backlinks help major search engines’ crawlers find your site and can give your site greater visibility in their search results. Especially links from relevant content and authoritative sources are considered a more powerful vote by search engines, and therefore are more likely to bring search engine robots to your website. Submitting your site to reputable and well-categorized web directories or major social networking sites helps your site get more exposed to crawlers.

2. Content Update and Pinging

Regular and frequent content update is another important factor that attract search engine robots. For example, the purpose of Google’s fresh crawl is to detect content update, and reflect the change in the search engine results immediately.

If your site is a blog, you can try existing pinging services such as pingomatic.com or Google’s Blog Search pinging service to proactively inform search engine robots of new posts and content changes.

3. Internal Link Structure
Another factor that affects search engine’s crawling rate is how the current page of a website is linked from other pages within the same website domain. Search engines determine the relative importance of the current page on a website based on the site’s overall internal link structure. Pages that are heavily linked to internally (e.g., site-wide pages) are considered important by search engines, and therefore receive more frequent visits from spiders.

4. Sitemap and Robots.txt

Creating a search engine sitemap for your site helps your site indexed more deeply as well as more frequently. With Google, you can create XML/TXT-formatted sitemap and submit it to your Google webmaster tools account. A typical sitemap contains a list of URLs for crawler to retrieve. If the sitemap is formatted in XML, you can specify extra information for crawlers, such as frequency of content change, last modification date, or relative importance of a page.

While sitemap informs crawlers which pages to retrieve, robots.txt does the opposite. That is, robots.txt prevents spiders from retrieving all or part of your website, which otherwise is publicly accessible by human. As webmasters become more SEO-savvy, they start to make use of robots.txt more actively (e.g., to eliminate duplicate content). But at the same time, it increases a chance for them to fumble robots.txt, and unwittingly block search engine spiders. In order to prevent any costly mistake, always arm yourself with the up-to-date syntax of robots.txt recommended by major search engines such as Google and Yahoo, and look out for Google’s crawl error reports.

5. Server Speed

Not to interfere with search engine’s crawling, the web server where your site is hosted should respond to a request in a reasonable time. Fast response time offers visitors good surfing experience. The same logic applies to search engine robots as well. Given that the search engine’s primary role is to provide users good searching experience, having your website hosted on a fast web server helps your site indexed faster and updated more frequently by search engine.

6. Set Crawl Rate Feature in Google Webmaster Tools

In your Google webmaster account, you can choose three different types of crawl speed for your website: Faster, normal, slower. The set crawl rate option is available only for top-level domain or sub-domains, but not for any internal pages or folders. An once requested crawl rate need to be renewed every 90 days. However, it’s reported that this feature does not guarantee an immediate effect on Google’s crawl rate.