The web is an ever-growing and dynamically changing world. Given the gigantic scale and dynamic nature of the web, it’s a nearly insurmountable task to maintain an up-to-date search engine index for the entire web. Therefore, one of the most important tasks of a search engine is to determine what part of the web is important and thus worth crawling more often, and reflect any updated content in their index in a timely manner. From a webmaster’s point of view, our mission is then to become a search engine’s favorite, and to make search engine robots visit our site more often than others. Here are a (non-exhaustive) list of factors that can affect the search engine’s crawl rate.
1. Relevant and Authoritative Backlinks
It’s a well known fact that backlinks help major search engines’ crawlers find your site and can give your site greater visibility in their search results. Especially links from relevant content and authoritative sources are considered a more powerful vote by search engines, and therefore are more likely to bring search engine robots to your website. Submitting your site to reputable and well-categorized web directories or major social networking sites helps your site get more exposed to crawlers.
2. Content Update and Pinging
Regular and frequent content update is another important factor that attract search engine robots. For example, the purpose of Google’s fresh crawl is to detect content update, and reflect the change in the search engine results immediately.
If your site is a blog, you can try existing pinging services such as pingomatic.com or Google’s Blog Search pinging service to proactively inform search engine robots of new posts and content changes.
3. Internal Link Structure
Another factor that affects search engine’s crawling rate is how the current page of a website is linked from other pages within the same website domain. Search engines determine the relative importance of the current page on a website based on the site’s overall internal link structure. Pages that are heavily linked to internally (e.g., site-wide pages) are considered important by search engines, and therefore receive more frequent visits from spiders.
4. Sitemap and Robots.txt
Creating a search engine sitemap for your site helps your site indexed more deeply as well as more frequently. With Google, you can create XML/TXT-formatted sitemap and submit it to your Google webmaster tools account. A typical sitemap contains a list of URLs for crawler to retrieve. If the sitemap is formatted in XML, you can specify extra information for crawlers, such as frequency of content change, last modification date, or relative importance of a page.
While sitemap informs crawlers which pages to retrieve, robots.txt does the opposite. That is, robots.txt prevents spiders from retrieving all or part of your website, which otherwise is publicly accessible by human. As webmasters become more SEO-savvy, they start to make use of robots.txt more actively (e.g., to eliminate duplicate content). But at the same time, it increases a chance for them to fumble robots.txt, and unwittingly block search engine spiders. In order to prevent any costly mistake, always arm yourself with the up-to-date syntax of robots.txt recommended by major search engines such as Google and Yahoo, and look out for Google’s crawl error reports.
5. Server Speed
Not to interfere with search engine’s crawling, the web server where your site is hosted should respond to a request in a reasonable time. Fast response time offers visitors good surfing experience. The same logic applies to search engine robots as well. Given that the search engine’s primary role is to provide users good searching experience, having your website hosted on a fast web server helps your site indexed faster and updated more frequently by search engine.
6. Set Crawl Rate Feature in Google Webmaster Tools
In your Google webmaster account, you can choose three different types of crawl speed for your website: Faster, normal, slower. The set crawl rate option is available only for top-level domain or sub-domains, but not for any internal pages or folders. An once requested crawl rate need to be renewed every 90 days. However, it’s reported that this feature does not guarantee an immediate effect on Google’s crawl rate.