The robots.txt file tells search engine crawlers which pages they can and cannot access on your website. It's one of the oldest web standards, and understanding it is essential for SEO and site management.
What Is robots.txt?
A plain text file at the root of your website that provides instructions to web crawlers:
https://example.com/robots.txt
Important: robots.txt is a suggestion, not a security measure. Well-behaved crawlers follow it; malicious bots ignore it.
Basic Syntax
User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://example.com/sitemap.xml
Directives
| Directive | Purpose |
|---|---|
User-agent |
Specifies which crawler the rules apply to |
Disallow |
Paths the crawler should not access |
Allow |
Paths the crawler can access (overrides Disallow) |
Sitemap |
Location of your XML sitemap |
Crawl-delay |
Seconds between requests (not always honored) |
Common Examples
Allow Everything (Default)
User-agent: *
Disallow:
Or simply an empty file. Most crawlers assume they can access everything unless told otherwise.
Block Everything
User-agent: *
Disallow: /
This blocks all compliant crawlers from your entire site.
Block Specific Directories
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Block Specific Files
User-agent: *
Disallow: /config.php
Disallow: /secret-page.html
Block Specific File Types
User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$
Allow Within a Blocked Directory
User-agent: *
Disallow: /admin/
Allow: /admin/public-report/
The more specific rule takes precedence.
Different Rules for Different Bots
# Rules for Google
User-agent: Googlebot
Disallow: /google-specific-block/
# Rules for Bing
User-agent: Bingbot
Disallow: /bing-specific-block/
# Rules for everyone else
User-agent: *
Disallow: /private/
Common Crawler User-Agents
| User-agent | Crawler |
|---|---|
Googlebot |
Google's main crawler |
Googlebot-Image |
Google Images |
Googlebot-News |
Google News |
Bingbot |
Microsoft Bing |
Slurp |
Yahoo |
DuckDuckBot |
DuckDuckGo |
Baiduspider |
Baidu (Chinese search) |
YandexBot |
Yandex (Russian search) |
facebot |
|
Twitterbot |
|
rogerbot |
Moz |
AhrefsBot |
Ahrefs SEO tool |
SemrushBot |
Semrush SEO tool |
Pattern Matching
Wildcards (*)
The asterisk matches any sequence of characters:
# Block all PDF files
User-agent: *
Disallow: /*.pdf
# Block files starting with "temp"
Disallow: /temp*
# Block query parameters
Disallow: /*?*
End Anchors ($)
The dollar sign anchors to the end of the URL:
# Block only .php files (not .php.bak)
Disallow: /*.php$
# Block URLs ending with /
Disallow: /*/
Examples
# Block: /foo/bar.pdf
# Allow: /foo/bar.pdf.bak
Disallow: /*.pdf$
# Block: /page?id=1
# Block: /page?id=2&sort=asc
Disallow: /*?
# Block: /folder/anything/here
# Allow: /folder
Disallow: /folder/
Complete robots.txt Template
# robots.txt for example.com
# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /private/
Disallow: /tmp/
Disallow: /*?print=
Disallow: /*?preview=
Disallow: /*.json$
# Allow CSS and JS for rendering
Allow: /*.css$
Allow: /*.js$
# Specific rules for Google
User-agent: Googlebot
Allow: /
Disallow: /internal/
# Block AI training bots
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
# Sitemap location
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
What to Block
Good Candidates for Blocking
- Admin areas:
/admin/,/wp-admin/,/backend/ - Internal APIs:
/api/internal/ - Search results pages:
/search?,/results? - Duplicate content: Print versions, sorted versions
- Staging/test areas:
/staging/,/test/ - Thank you pages:
/thank-you,/confirmation - Cart/checkout:
/cart,/checkout(debatable) - User-specific pages:
/my-account/,/dashboard/
What NOT to Block
- CSS and JavaScript: Crawlers need these to render pages
- Images (usually): Unless you don't want image search traffic
- Content you want indexed: Obviously
- Your sitemap: Some people accidentally block it
Common Mistakes
1. Blocking CSS/JS
# BAD - prevents proper rendering
User-agent: *
Disallow: /css/
Disallow: /js/
# GOOD
User-agent: *
Disallow: /admin/
Allow: /admin/css/
Allow: /admin/js/
2. Blocking the Entire Site Accidentally
# DANGER - blocks everything!
User-agent: *
Disallow: /
# Did you mean this?
User-agent: *
Disallow: /private/
3. Wrong File Location
robots.txt MUST be at the root:
- ✅
https://example.com/robots.txt - ❌
https://example.com/pages/robots.txt - ❌
https://example.com/blog/robots.txt
For subdomains, each needs its own robots.txt:
https://blog.example.com/robots.txthttps://shop.example.com/robots.txt
4. Using robots.txt for Security
# This doesn't hide sensitive content!
Disallow: /secret-admin-password.txt
If a URL is public, it can still be:
- Directly accessed
- Linked from elsewhere
- Found through other means
For actual security: Use authentication, not robots.txt.
5. Forgetting About HTTPS/HTTP
Search engines treat these as different sites:
http://example.com/robots.txthttps://example.com/robots.txt
Make sure both exist, or redirect HTTP to HTTPS properly.
Testing robots.txt
Google Search Console
- Go to Search Console
- Choose your property
- Settings → robots.txt Tester
- Enter URLs to test
Manual Testing
Check if a URL is blocked:
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
# Test URLs
print(rp.can_fetch("*", "https://example.com/page"))
print(rp.can_fetch("Googlebot", "https://example.com/admin/"))
Online Tools
- Google's robots.txt Tester (in Search Console)
- Bing Webmaster Tools
- Various online validators
robots.txt vs Meta Robots
| Feature | robots.txt | Meta robots |
|---|---|---|
| Scope | Site-wide | Per page |
| Location | Root URL | HTML head |
| Prevents crawling | Yes | No |
| Prevents indexing | No | Yes |
| Removes from index | No | Yes |
Important distinction:
robots.txt Disallow= "Don't crawl this"<meta name="robots" content="noindex">= "Don't index this"
A page can be blocked from crawling but still appear in search results (with limited info) if it has external links.
Blocking AI Crawlers
With AI training concerns, many sites now block AI-specific crawlers:
# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
# Anthropic
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
# Common Crawl (used for AI training)
User-agent: CCBot
Disallow: /
# Google AI (Bard training)
User-agent: Google-Extended
Disallow: /
Note: This only works for compliant crawlers. Scrapers that ignore robots.txt won't be affected.
Summary
robots.txt Checklist
✅ Place at domain root: /robots.txt
✅ Use proper syntax (case-sensitive for paths)
✅ Test with Search Console
✅ Include sitemap location
✅ Allow CSS and JavaScript
✅ Don't use for security
✅ Review periodically
Quick Reference
# Allow all
User-agent: *
Disallow:
# Block directory
Disallow: /folder/
# Block file
Disallow: /file.html
# Block file type
Disallow: /*.pdf$
# Block query strings
Disallow: /*?
# Add sitemap
Sitemap: https://example.com/sitemap.xml
Need help creating your robots.txt? Try our Robots.txt Generator!