SEO

robots.txt: Controlling Search Engine Crawlers

A complete guide to robots.txt, including syntax, common patterns, and how to use it to control which pages search engines can access.

HandyUtils December 17, 2025 6 min read

The robots.txt file tells search engine crawlers which pages they can and cannot access on your website. It's one of the oldest web standards, and understanding it is essential for SEO and site management.

What Is robots.txt?

A plain text file at the root of your website that provides instructions to web crawlers:

https://example.com/robots.txt

Important: robots.txt is a suggestion, not a security measure. Well-behaved crawlers follow it; malicious bots ignore it.

Basic Syntax

User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://example.com/sitemap.xml

Directives

Directive Purpose
User-agent Specifies which crawler the rules apply to
Disallow Paths the crawler should not access
Allow Paths the crawler can access (overrides Disallow)
Sitemap Location of your XML sitemap
Crawl-delay Seconds between requests (not always honored)

Common Examples

Allow Everything (Default)

User-agent: *
Disallow:

Or simply an empty file. Most crawlers assume they can access everything unless told otherwise.

Block Everything

User-agent: *
Disallow: /

This blocks all compliant crawlers from your entire site.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

Block Specific Files

User-agent: *
Disallow: /config.php
Disallow: /secret-page.html

Block Specific File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$

Allow Within a Blocked Directory

User-agent: *
Disallow: /admin/
Allow: /admin/public-report/

The more specific rule takes precedence.

Different Rules for Different Bots

# Rules for Google
User-agent: Googlebot
Disallow: /google-specific-block/

# Rules for Bing
User-agent: Bingbot
Disallow: /bing-specific-block/

# Rules for everyone else
User-agent: *
Disallow: /private/

Common Crawler User-Agents

User-agent Crawler
Googlebot Google's main crawler
Googlebot-Image Google Images
Googlebot-News Google News
Bingbot Microsoft Bing
Slurp Yahoo
DuckDuckBot DuckDuckGo
Baiduspider Baidu (Chinese search)
YandexBot Yandex (Russian search)
facebot Facebook
Twitterbot Twitter
rogerbot Moz
AhrefsBot Ahrefs SEO tool
SemrushBot Semrush SEO tool

Pattern Matching

Wildcards (*)

The asterisk matches any sequence of characters:

# Block all PDF files
User-agent: *
Disallow: /*.pdf

# Block files starting with "temp"
Disallow: /temp*

# Block query parameters
Disallow: /*?*

End Anchors ($)

The dollar sign anchors to the end of the URL:

# Block only .php files (not .php.bak)
Disallow: /*.php$

# Block URLs ending with /
Disallow: /*/

Examples

# Block: /foo/bar.pdf
# Allow: /foo/bar.pdf.bak
Disallow: /*.pdf$

# Block: /page?id=1
# Block: /page?id=2&sort=asc
Disallow: /*?

# Block: /folder/anything/here
# Allow: /folder
Disallow: /folder/

Complete robots.txt Template

# robots.txt for example.com

# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /private/
Disallow: /tmp/
Disallow: /*?print=
Disallow: /*?preview=
Disallow: /*.json$

# Allow CSS and JS for rendering
Allow: /*.css$
Allow: /*.js$

# Specific rules for Google
User-agent: Googlebot
Allow: /
Disallow: /internal/

# Block AI training bots
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

# Sitemap location
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

What to Block

Good Candidates for Blocking

  • Admin areas: /admin/, /wp-admin/, /backend/
  • Internal APIs: /api/internal/
  • Search results pages: /search?, /results?
  • Duplicate content: Print versions, sorted versions
  • Staging/test areas: /staging/, /test/
  • Thank you pages: /thank-you, /confirmation
  • Cart/checkout: /cart, /checkout (debatable)
  • User-specific pages: /my-account/, /dashboard/

What NOT to Block

  • CSS and JavaScript: Crawlers need these to render pages
  • Images (usually): Unless you don't want image search traffic
  • Content you want indexed: Obviously
  • Your sitemap: Some people accidentally block it

Common Mistakes

1. Blocking CSS/JS

# BAD - prevents proper rendering
User-agent: *
Disallow: /css/
Disallow: /js/

# GOOD
User-agent: *
Disallow: /admin/
Allow: /admin/css/
Allow: /admin/js/

2. Blocking the Entire Site Accidentally

# DANGER - blocks everything!
User-agent: *
Disallow: /

# Did you mean this?
User-agent: *
Disallow: /private/

3. Wrong File Location

robots.txt MUST be at the root:

  • https://example.com/robots.txt
  • https://example.com/pages/robots.txt
  • https://example.com/blog/robots.txt

For subdomains, each needs its own robots.txt:

  • https://blog.example.com/robots.txt
  • https://shop.example.com/robots.txt

4. Using robots.txt for Security

# This doesn't hide sensitive content!
Disallow: /secret-admin-password.txt

If a URL is public, it can still be:

  • Directly accessed
  • Linked from elsewhere
  • Found through other means

For actual security: Use authentication, not robots.txt.

5. Forgetting About HTTPS/HTTP

Search engines treat these as different sites:

  • http://example.com/robots.txt
  • https://example.com/robots.txt

Make sure both exist, or redirect HTTP to HTTPS properly.

Testing robots.txt

Google Search Console

  1. Go to Search Console
  2. Choose your property
  3. Settings → robots.txt Tester
  4. Enter URLs to test

Manual Testing

Check if a URL is blocked:

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Test URLs
print(rp.can_fetch("*", "https://example.com/page"))
print(rp.can_fetch("Googlebot", "https://example.com/admin/"))

Online Tools

  • Google's robots.txt Tester (in Search Console)
  • Bing Webmaster Tools
  • Various online validators

robots.txt vs Meta Robots

Feature robots.txt Meta robots
Scope Site-wide Per page
Location Root URL HTML head
Prevents crawling Yes No
Prevents indexing No Yes
Removes from index No Yes

Important distinction:

  • robots.txt Disallow = "Don't crawl this"
  • <meta name="robots" content="noindex"> = "Don't index this"

A page can be blocked from crawling but still appear in search results (with limited info) if it has external links.

Blocking AI Crawlers

With AI training concerns, many sites now block AI-specific crawlers:

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Anthropic
User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

# Common Crawl (used for AI training)
User-agent: CCBot
Disallow: /

# Google AI (Bard training)
User-agent: Google-Extended
Disallow: /

Note: This only works for compliant crawlers. Scrapers that ignore robots.txt won't be affected.

Summary

robots.txt Checklist

✅ Place at domain root: /robots.txt ✅ Use proper syntax (case-sensitive for paths) ✅ Test with Search Console ✅ Include sitemap location ✅ Allow CSS and JavaScript ✅ Don't use for security ✅ Review periodically

Quick Reference

# Allow all
User-agent: *
Disallow:

# Block directory
Disallow: /folder/

# Block file
Disallow: /file.html

# Block file type
Disallow: /*.pdf$

# Block query strings
Disallow: /*?

# Add sitemap
Sitemap: https://example.com/sitemap.xml

Need help creating your robots.txt? Try our Robots.txt Generator!

Related Topics
robots.txt search engines crawlers bots SEO sitemap googlebot
Share this article

Continue Reading

SEO
Meta Tags for SEO: The Essential Guide

Understanding HTML meta tags and how to optimize them for search engines, social sharing, and better click-through rates.