Find and fix indexability issues preventing search engines from crawling and indexing your important pages.
Search engines can't rank pages they can't find. Indexability issues are among the most critical SEO problems because they prevent your content from appearing in search results entirely. NitroShock's Site Audit feature helps you identify and resolve these barriers, ensuring search engines can discover, crawl, and index your most important pages.
Indexability refers to a search engine's ability to analyze and add your web pages to its search index. When a page is indexable, it can appear in search results. When it's not, it's invisible to searchers - no matter how well-optimized your content might be.
Search engines like Google use crawlers (also called bots or spiders) to discover pages on your site. These crawlers follow links, read your content, and add qualifying pages to the search index. However, multiple factors can block this process, either intentionally or accidentally.
Blocked crawling occurs when your site tells search engines not to access certain pages. This might be intentional for admin areas or checkout pages, but accidental blocks on important content create serious SEO issues.
Noindex directives explicitly tell search engines not to include pages in their index. Like crawl blocks, these serve legitimate purposes but cause problems when misapplied to valuable content.
Orphaned pages exist on your site but have no internal links pointing to them. Search engines typically discover pages by following links, so orphaned content remains hidden even if technically indexable.
Canonicalization problems occur when you tell search engines that a different URL is the preferred version of a page. Incorrect canonical tags can accidentally deindex entire sections of your site.
Sitemap omissions mean important pages aren't listed in your XML sitemap, reducing the likelihood that search engines will discover and index them promptly.
NitroShock's Site Audit crawls your website and identifies these issues across all your pages. To run an audit, navigate to your project dashboard, select the Site Audit tab, and choose whether to audit a single page or your entire site. The audit uses credits based on the number of pages checked - you'll see the exact cost before confirming.
The robots.txt file sits in your website's root directory and provides crawling instructions to search engine bots. It's the first place crawlers check when visiting your site, making it a critical control point for indexability.
When a search engine crawler visits your site, it requests https://yourdomain.com/robots.txt before accessing any other pages. The file contains directives that specify which bots can crawl which parts of your site.
A basic robots.txt file looks like this:
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://yourdomain.com/sitemap.xml
The User-agent line specifies which crawlers the rules apply to. The asterisk (*) means all crawlers. You can also target specific bots like Googlebot or Bingbot.
The Disallow directive tells crawlers not to access specified paths. In the example above, WordPress admin and core files are blocked - a common and appropriate configuration.
The Allow directive creates exceptions to Disallow rules. Here, admin-ajax.php is explicitly allowed despite being in the blocked /wp-admin/ directory, because many WordPress features require it.
The Sitemap directive tells crawlers where to find your XML sitemap, helping them discover your content more efficiently.
Blocking important content is the most serious error. A misplaced Disallow rule can prevent search engines from accessing your entire blog, product pages, or other critical sections. This often happens when developers copy robots.txt files from staging sites that intentionally block all crawling.
NitroShock's Site Audit flags pages that are blocked by robots.txt, showing you exactly which rules are preventing access. Look for these warnings in the SEO category of your audit results.
Blocking CSS and JavaScript files was once common practice, but modern search engines need these resources to properly render and understand your pages. Google explicitly recommends allowing crawler access to CSS, JavaScript, and image files.
Using robots.txt for sensitive content is ineffective because the file is publicly viewable. Anyone can read your robots.txt file, potentially discovering the locations of pages you're trying to hide. Use password protection or authentication for truly sensitive content instead.
Before deploying changes to your robots.txt file, test them to avoid accidentally blocking important pages. Most search engines provide tools for this:
robots.txtFor WordPress sites, several plugins help you manage robots.txt safely, adding interfaces that prevent syntax errors and provide warnings before blocking large sections of your site.
While robots.txt controls crawling, meta robots tags control indexing. These HTML tags tell search engines whether to include a page in their index and whether to follow links on the page.
Meta robots tags appear in the section of your HTML:
<meta name="robots" content="noindex, nofollow">
The content attribute accepts several values:
index/noindex controls whether the page should appear in search results. index allows indexing (the default), while noindex prevents it.
follow/nofollow controls whether crawlers should follow links on the page. follow allows link following (the default), while nofollow tells crawlers to ignore the page's links for ranking purposes.
noarchive prevents search engines from storing a cached copy of the page.
nosnippet prevents search engines from showing a text snippet or video preview in search results.
max-snippet:[number] limits the length of text snippets in search results.
max-image-preview:[size] controls the maximum size of image previews.
max-video-preview:[number] sets the maximum video preview length in seconds.
You can combine multiple directives with commas: noindex, nofollow or use the shorthand none (equivalent to noindex, nofollow).
Thin or duplicate content pages that provide little value should be noindexed to prevent them from diluting your site's overall quality in search engine assessments. This includes tag archives with just a few posts, empty category pages, or thank-you pages with minimal content.
Intermediate pages in conversion funnels don't need to rank in search results. Checkout pages, registration steps, and confirmation screens should typically be noindexed because users shouldn't enter your site at these points.
Internal search results and filtered pages create near-infinite variations of similar content. WordPress sites often generate these automatically, and most should be noindexed to prevent crawl waste and duplicate content issues.
Private or restricted content that logged-in users can access but shouldn't appear in public search results needs noindex tags.
NitroShock's Site Audit identifies pages with noindex tags in the SEO category. Pay special attention to these issues on pages that you expect to rank, such as:
To review noindex issues, run a site audit from the Site Audit tab in your project dashboard. The audit uses credits per page analyzed, with the exact cost displayed before you confirm. Filter results by the SEO category to focus on indexability issues.
Meta robots directives can also be sent as HTTP headers instead of HTML tags. The X-Robots-Tag header provides identical functionality and appears in the HTTP response:
X-Robots-Tag: noindex, nofollow
This approach is particularly useful for non-HTML files like PDFs or images, which can't contain HTML meta tags. WordPress typically doesn't use X-Robots-Tag headers by default, but some SEO plugins offer this functionality.
NitroShock's Site Audit checks both meta robots tags and X-Robots-Tag headers, ensuring you catch indexability issues regardless of implementation method.
XML sitemaps provide search engines with a complete list of your important pages, along with metadata about each URL. While search engines can discover pages by following links, sitemaps ensure they find your content quickly and understand your site's structure.
A basic XML sitemap looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://yourdomain.com/page-url/</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
The loc element contains the full URL of the page. This is the only required element for each URL entry.
The lastmod element indicates when the page was last modified. Search engines use this to prioritize crawling recently updated content.
The changefreq element suggests how often the page changes (always, hourly, daily, weekly, monthly, yearly, never). Search engines treat this as a hint rather than a directive.
The priority element indicates the relative importance of this page compared to other pages on your site, using values from 0.0 to 1.0. Note that this only affects internal prioritization - it doesn't influence how your pages rank against other sites.
Include only indexable pages in your sitemap. Don't list URLs that have noindex tags, redirect to other pages, or are blocked by robots.txt. Including non-indexable URLs in your sitemap sends conflicting signals to search engines and wastes crawl budget.
Keep sitemaps under 50MB and 50,000 URLs. These are search engine limits for individual sitemap files. Larger sites should use sitemap index files that reference multiple sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://yourdomain.com/sitemap-posts.xml</loc>
<lastmod>2024-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://yourdomain.com/sitemap-pages.xml</loc>
<lastmod>2024-01-10</lastmod>
</sitemap>
</sitemapindex>
Update sitemaps automatically when content changes. Most WordPress SEO plugins generate sitemaps dynamically, ensuring they always reflect your current content without manual updates.
Submit sitemaps to search engines through their respective webmaster tools (Google Search Console, Bing Webmaster Tools). You can also reference your sitemap in your robots.txt file using the Sitemap directive.
Sitemap-indexability conflicts occur when your sitemap includes pages with noindex tags, canonical tags pointing elsewhere, or pages blocked by robots.txt. NitroShock's Site Audit identifies these conflicts, which confuse search engines about whether you want pages indexed.
Missing important pages in your sitemap means search engines might not discover that content promptly. Compare your sitemap against your site's navigation to ensure key pages are included.
Outdated lastmod dates can cause search engines to skip crawling pages they believe haven't changed. Ensure your CMS or SEO plugin updates these dates when content is actually modified.
HTTP/HTTPS inconsistencies create problems when your sitemap lists pages with the wrong protocol. Your sitemap should use the same protocol (HTTPS) as your live site.
You can access your WordPress sitemap at https://yourdomain.com/sitemap.xml (the exact URL depends on your SEO plugin). Check that:
NitroShock's Site Audit cross-references your sitemap against your actual pages, identifying conflicts and omissions that could harm your indexability.
Canonical tags tell search engines which version of a page is the primary one when duplicate or very similar content exists at multiple URLs. These tags solve a common problem: your content might be accessible at different addresses, but you only want one version to rank.
The canonical tag appears in the section of your HTML:
<link rel="canonical" href="https://yourdomain.com/preferred-url/" />
This tag tells search engines "treat this page as a duplicate of the URL specified in the href attribute." Search engines then consolidate ranking signals (links, content metrics, etc.) to the canonical URL.
Parameter variations create different URLs for the same content. For example:
https://yourdomain.com/blog/post-title/https://yourdomain.com/blog/post-title/?ref=twitterhttps://yourdomain.com/blog/post-title/?utm_source=newsletterAll three URLs serve the same content but tracking parameters create distinct addresses. The canonical tag on all three should point to the clean URL without parameters.
Pagination splits content across multiple pages. Paginated archives like /blog/page/2/ should typically have self-referencing canonical tags pointing to themselves, not to page 1. Each page in the sequence contains unique content and should be indexed separately.
Print versions and alternate formats of the same content need canonical tags pointing to the main version. If you offer a print-friendly version at /blog/post-title/print/, its canonical tag should point to /blog/post-title/.
HTTP vs HTTPS versions of pages should canonicalize to your preferred protocol. All pages should have canonical tags pointing to the HTTPS version if you've migrated to secure hosting.
WWW vs non-WWW domains need consistent canonicalization. Choose one version as your preferred domain and ensure all canonical tags use it consistently.
Missing canonical tags leave search engines to determine which version of near-duplicate content is primary. Search engines usually make reasonable choices, but explicit canonicals ensure your preference is clear.
Conflicting canonical tags occur when a page's canonical tag points to one URL, but your sitemap, internal links, or other signals suggest a different URL is primary. These conflicts confuse search engines and dilute ranking signals.
Incorrect canonical targets that point to non-existent pages, redirected URLs, or noindexed pages create serious indexability issues. If a page's canonical URL isn't indexable, search engines won't index the page pointing to it either.
Canonical chains happen when page A canonicalizes to page B, which canonicalizes to page C. Search engines may not follow these chains, causing indexability problems. Each page should canonicalize directly to the final destination.
Cross-domain canonicals point to URLs on different domains and tell search engines to treat your content as a duplicate of the other site's content. This is occasionally necessary for legitimate syndication but is often a mistake that deindexes your own content in favor of another site.
NitroShock's Site Audit checks every page's canonical tag and identifies several problems:
To review canonical tag issues, run a site audit from your project's Site Audit tab and filter results by the SEO category. The audit uses credits per page analyzed, with costs shown before you confirm.
Every page should have a canonical tag, even if there are no duplicates. Best practice is to use self-referencing canonicals that point to the page's own URL:
<!-- On https://yourdomain.com/about/ -->
<link rel="canonical" href="https://yourdomain.com/about/" />
Self-referencing canonicals explicitly state "this is the primary version of this content," preventing search engines from guessing or choosing an alternate URL you didn't intend.
Most WordPress SEO plugins add self-referencing canonicals automatically, but you should verify this in your site's source code and through NitroShock's Site Audit results.
How often should I run site audits to check indexability?
Run a complete site audit monthly or after significant site changes like theme updates, plugin installations, or major content additions. For active sites