SEO

robots.txt: Controlling Search Engine Crawlers

A complete guide to robots.txt, including syntax, common patterns, and how to use it to control which pages search engines can access.

HandyUtils December 17, 2025 6 min read

The robots.txt file tells search engine crawlers which pages they can and cannot access on your website. It's one of the oldest web standards, and understanding it is essential for SEO and site management.

What Is robots.txt?

A plain text file at the root of your website that provides instructions to web crawlers:

https://example.com/robots.txt

Important: robots.txt is a suggestion, not a security measure. Well-behaved crawlers follow it; malicious bots ignore it.

Basic Syntax

User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://example.com/sitemap.xml

Directives

Directive	Purpose
`User-agent`	Specifies which crawler the rules apply to
`Disallow`	Paths the crawler should not access
`Allow`	Paths the crawler can access (overrides Disallow)
`Sitemap`	Location of your XML sitemap
`Crawl-delay`	Seconds between requests (not always honored)

Common Examples

Allow Everything (Default)

User-agent: *
Disallow:

Or simply an empty file. Most crawlers assume they can access everything unless told otherwise.

Block Everything

User-agent: *
Disallow: /

This blocks all compliant crawlers from your entire site.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

Block Specific Files

User-agent: *
Disallow: /config.php
Disallow: /secret-page.html

Block Specific File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$

Allow Within a Blocked Directory

User-agent: *
Disallow: /admin/
Allow: /admin/public-report/

The more specific rule takes precedence.

Different Rules for Different Bots

# Rules for Google
User-agent: Googlebot
Disallow: /google-specific-block/

# Rules for Bing
User-agent: Bingbot
Disallow: /bing-specific-block/

# Rules for everyone else
User-agent: *
Disallow: /private/

Common Crawler User-Agents

User-agent	Crawler
`Googlebot`	Google's main crawler
`Googlebot-Image`	Google Images
`Googlebot-News`	Google News
`Bingbot`	Microsoft Bing
`Slurp`	Yahoo
`DuckDuckBot`	DuckDuckGo
`Baiduspider`	Baidu (Chinese search)
`YandexBot`	Yandex (Russian search)
`facebot`	Facebook
`Twitterbot`	Twitter
`rogerbot`	Moz
`AhrefsBot`	Ahrefs SEO tool
`SemrushBot`	Semrush SEO tool

Pattern Matching

Wildcards (*)

The asterisk matches any sequence of characters:

# Block all PDF files
User-agent: *
Disallow: /*.pdf

# Block files starting with "temp"
Disallow: /temp*

# Block query parameters
Disallow: /*?*

End Anchors ($)

The dollar sign anchors to the end of the URL:

# Block only .php files (not .php.bak)
Disallow: /*.php$

# Block URLs ending with /
Disallow: /*/

Examples

# Block: /foo/bar.pdf
# Allow: /foo/bar.pdf.bak
Disallow: /*.pdf$

# Block: /page?id=1
# Block: /page?id=2&sort=asc
Disallow: /*?

# Block: /folder/anything/here
# Allow: /folder
Disallow: /folder/

Complete robots.txt Template

# robots.txt for example.com

# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /private/
Disallow: /tmp/
Disallow: /*?print=
Disallow: /*?preview=
Disallow: /*.json$

# Allow CSS and JS for rendering
Allow: /*.css$
Allow: /*.js$

# Specific rules for Google
User-agent: Googlebot
Allow: /
Disallow: /internal/

# Block AI training bots
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

# Sitemap location
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

What to Block

Good Candidates for Blocking

Admin areas: /admin/, /wp-admin/, /backend/
Internal APIs: /api/internal/
Search results pages: /search?, /results?
Duplicate content: Print versions, sorted versions
Staging/test areas: /staging/, /test/
Thank you pages: /thank-you, /confirmation
Cart/checkout: /cart, /checkout (debatable)
User-specific pages: /my-account/, /dashboard/

What NOT to Block

CSS and JavaScript: Crawlers need these to render pages
Images (usually): Unless you don't want image search traffic
Content you want indexed: Obviously
Your sitemap: Some people accidentally block it

Common Mistakes

1. Blocking CSS/JS

# BAD - prevents proper rendering
User-agent: *
Disallow: /css/
Disallow: /js/

# GOOD
User-agent: *
Disallow: /admin/
Allow: /admin/css/
Allow: /admin/js/

2. Blocking the Entire Site Accidentally

# DANGER - blocks everything!
User-agent: *
Disallow: /

# Did you mean this?
User-agent: *
Disallow: /private/

3. Wrong File Location

robots.txt MUST be at the root:

✅ https://example.com/robots.txt
❌ https://example.com/pages/robots.txt
❌ https://example.com/blog/robots.txt

For subdomains, each needs its own robots.txt:

https://blog.example.com/robots.txt
https://shop.example.com/robots.txt

4. Using robots.txt for Security

# This doesn't hide sensitive content!
Disallow: /secret-admin-password.txt

If a URL is public, it can still be:

Directly accessed
Linked from elsewhere
Found through other means

For actual security: Use authentication, not robots.txt.

5. Forgetting About HTTPS/HTTP

Search engines treat these as different sites:

http://example.com/robots.txt
https://example.com/robots.txt

Make sure both exist, or redirect HTTP to HTTPS properly.

Testing robots.txt

Google Search Console

Go to Search Console
Choose your property
Settings → robots.txt Tester
Enter URLs to test

Manual Testing

Check if a URL is blocked:

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Test URLs
print(rp.can_fetch("*", "https://example.com/page"))
print(rp.can_fetch("Googlebot", "https://example.com/admin/"))

Online Tools

Google's robots.txt Tester (in Search Console)
Bing Webmaster Tools
Various online validators

robots.txt vs Meta Robots

Feature	robots.txt	Meta robots
Scope	Site-wide	Per page
Location	Root URL	HTML head
Prevents crawling	Yes	No
Prevents indexing	No	Yes
Removes from index	No	Yes

Important distinction:

robots.txt Disallow = "Don't crawl this"
<meta name="robots" content="noindex"> = "Don't index this"

A page can be blocked from crawling but still appear in search results (with limited info) if it has external links.

Blocking AI Crawlers

With AI training concerns, many sites now block AI-specific crawlers:

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Anthropic
User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

# Common Crawl (used for AI training)
User-agent: CCBot
Disallow: /

# Google AI (Bard training)
User-agent: Google-Extended
Disallow: /

Note: This only works for compliant crawlers. Scrapers that ignore robots.txt won't be affected.

Summary

robots.txt Checklist

✅ Place at domain root: /robots.txt ✅ Use proper syntax (case-sensitive for paths) ✅ Test with Search Console ✅ Include sitemap location ✅ Allow CSS and JavaScript ✅ Don't use for security ✅ Review periodically

Quick Reference

# Allow all
User-agent: *
Disallow:

# Block directory
Disallow: /folder/

# Block file
Disallow: /file.html

# Block file type
Disallow: /*.pdf$

# Block query strings
Disallow: /*?

# Add sitemap
Sitemap: https://example.com/sitemap.xml

Need help creating your robots.txt? Try our Robots.txt Generator!

Share this article

Tweet Share Share

Try These Tools

Continue Reading

SEO

Meta Tags for SEO: The Essential Guide

Understanding HTML meta tags and how to optimize them for search engines, social sharing, and better click-through rates.

7 min read