How to Create a Bing robots.txt file

You can use a robots.txt file to control which directories and files on your web server a Robots Exclusion Protocol (REP)-compliant search engine crawler (aka a robot or bot) is not permitted to visit, that is, sections that should not be crawled. It is important to understand that this not by definition implies that a page that is not crawled also will not be indexed.

Steps:

Identify which directories and files on your web server you want to block from the crawler

  1. Examine your web server for published content that you do not want to be visited by search engines.
  2. Create a list of the accessible files and directories on your web server you want to disallow.Example You might want to have bots ignore crawling such site directories as /cgi-bin, /scripts, and /tmp (or their equivalents, if they exist in your server architecture).

Identify whether or not you need to specify additional instructions for a particular search engine bot beyond a generic set of crawling directives

  • Examine your web server’s referrer logs to see if there are bots crawling your site that you want to block beyond the generic directives that apply to all bots.
NOTE
Bingbot, upon finding a specific set of instructions for itself, will ignore the directives listed in the generic section, so you will need to repeat all of the general directives in addition to the specific directives you created for them in their own section of the file. 

 

Use a text editor to create the robots.txt file and add REP directives to block content from being visited by the bots. The text file should be saved in ASCII or UTF-8 encoding.

  1. Bots are referenced as user-agents in the robots.txt file. In the beginning of the file, start the first section of directives applicable to all bots by adding this line: User-agent: *
  2. Create a list of Disallow directives listing the content you want blocked. Example Given our previously used directory examples, such set of directives would look like this:
    • User-agent: *
    • Disallow: /cgi-bin/
    • Disallow: /scripts/
    • Disallow: /tmp/
    NOTE
    • You cannot list multiple content references in one line, so you’ll need to create a new Disallow: directive for each pattern to be blocked. You can, however, use wildcard characters. Note that each URL pattern starts with the forward slash, representing the root of the current site.
    • You can also use an Allow: directive for files stored in a directory whose contents will otherwise be blocked.
    • For more information on using wildcards and on creating Disallow and Allow directives, see the Webmaster Center blog article Prevent a bot from getting “lost in space”.

     

  3. If you want to add customized directives for specific bots that are not appropriate for all bots, such as crawl-delay:, add them in a custom section after the first, generic section, changing the User-agent reference to a specific bot. For a list of applicable bot names, see the Robots Database.
    noteNOTE
    Adding sets of directives customized for individual bots is not a recommended strategy. The typical need to repeat directives from the generic section complicates file maintenance tasks. Furthermore, omissions in properly maintaining these customized sections are often the source of crawling problems with search engine bots. 

     

Optional: Add a reference to your sitemap file (if you have one)

  • If you have created a Sitemap file listing the most important pages on your site, you can point the bot to it by referencing it in its own line at the end of the file.
  • Example A Sitemap file is typically saved to the root directory of a site. Such a Sitemap directive line would look like this:
  • Sitemap: http://www.your-url.com/sitemap.xml

Check for errors by validating your robots.txt file

Upload the robots.txt file to the root directory of your site

NOTE
  • You do not need to submit your new robots.txt file to the search engines. Search engine bots automatically look for a file called robots.txt in the root directory of your site regularly, and if found, will read that file first to see which, if any, directives pertain to them. Note that search engines keep a copy of your robots.txt at least for a few hours in their cache, so changes can take a few hours to be reflected in their crawl behavior.
About Bold 3960 Articles
Web developer and a senior content writer at Boldtechinfo.com

Be the first to comment

Leave a Reply

Your email address will not be published.