Article

Draft Robots.txt Rules Without Blocking Public Pages

A practical robots.txt checklist for sitemap discovery, private routes, AI crawler sections, and avoiding accidental blocks on public tools.

seorobots.txtcrawlerpublishing

Introduction

Robots.txt is a crawl access file, not a privacy layer. It helps crawlers discover sitemaps and avoid selected paths, but a public URL can still be visible elsewhere if it is linked, logged, or shared.

The Robots.txt Generator helps you draft a clear file for common launch patterns: public pages allowed, admin routes blocked, sitemap declared, and crawler-specific sections kept readable.

Real-world scenario

You are launching a tools site with public pages, account pages, admin pages, and an XML sitemap. A simple starting point might look like this:

User-agent: *
Allow: /
Disallow: /admin
Disallow: /sign-in
Disallow: /api

Sitemap: https://example.com/sitemap.xml

The important part is not the number of rules. It is making sure public routes like /tools, /docs, and /blog are not caught by a broad disallow pattern.

What to check

Sitemap line. Include the final production sitemap URL.

Public paths. Confirm that tools, docs, blog, facts, and policy pages remain crawlable if they should be indexed.

Private or account paths. Block obvious admin, dashboard, checkout, and sign-in paths from crawling where appropriate.

Separate page-level indexing. Use robots meta for noindex decisions on individual pages instead of trying to solve everything in robots.txt.

Common mistakes

Blocking too broadly. A rule like Disallow: /tools can remove your main library from crawl access.

Using robots.txt for secrets. Do not put sensitive URLs in robots.txt as if that hides them.

Forgetting production hostnames. Sitemap URLs should use the canonical host, not localhost or staging.

Practical QA pass

Read the final robots.txt as a crawler would: from top to bottom, one directive per line, grouped by user agent. A file that looks readable in code but ships as one long line can be hard for tools and humans to inspect. Keep the production response plain and predictable.

Then test representative paths instead of only reviewing the file. Check that /tools, /blog, /docs, /facts, and /zh are not blocked if they are meant to be public. Also check that /admin, /api, and sign-in paths are not being promoted through sitemap or internal marketing links.

Limits

Robots.txt guides compliant crawlers. It does not remove indexed URLs by itself, protect private data, or replace access control.

Next steps

Robots.txt Generator — draft crawl rules and sitemap directives
Meta Robots Tag Generator — prepare page-level indexing directives
Sitemap URL Checker — review sitemap URLs before submission
Canonical URL Generator — confirm preferred URLs for indexable pages
SEO Publishing Workflow — verify robots, sitemap, canonical, metadata, and hreflang together before launch

Final practical note

Before deploying robots.txt, check it as plain text. Each directive should be on its own line, and the sitemap should be visible without relying on JavaScript.