Some of us are concerned about the proper way to index every page of our blog to avoid duplication and to disable indexation to some of our blog posts. By controlling the Googlebot in crawling our blog we can properly index every page of our blog in Google. Crawling our blog is the process by which Googlebot (Google Search Engine Crawler) crawls URLs on our blog to get to our blog’s content so that it will be able to refresh and index on Google. Keep in mind that Googlebot can crawl up to 10mb per page so if one of your page is more than 10mb, make sure that the content on that page is crawled.
I hope most us are aware that Google is taking an action against duplicate content, but did you know that sometimes duplication happens inside our blog? Although labels and archives are very useful to our audience, it can still cause duplicate issue, but we can avoid it by controlling the Googlebot in accessing those particular pages. On this page we will talk about robot.txt and Meta Robots and how to use them to control Googlebot.
Robots.txtRobots.txt is used when we are restricting Googlebot to crawl a particular page on our blog, however we are not telling it not to index the pages on Google. Sometimes Robots.txt is not respected by Google especially when we are trying to restrict Googlebot in crawling a huge amount of URL on our blog, like all pages having a /search after the domain name. Search engine sometimes they think that we made a mistake in disallowing them to access our content, especially when lots of URLs and people are pointing and visiting to that content. It is good to use Robots.txt when we are saving crawl bandwidth. Bandwidth is not free, Google is paying money to crawl and index our blog.
Meta RobotsUnlike the robots.txt, Meta Robots are respected by the Search Engine crawlers. Meta Robots is placed at the top of individual pages, by using it we are controlling a single page of our blog to tell Google whether to index the page or not. Keep in mind that by using the Meta Robots we are not restricting Search Engine Crawlers to crawl our blog, in other words, we allow them to visit our pages but they are restricted to index it.
Remove Specific Page from Being IndexedLet’s say you want to remove one of your content from being indexed on Google. In order to completely remove it on Google make sure that the page is able to crawl, and then we must add Meta Robots with noindex on the head of the page. It is better than disallowing it on robots.txt, as I’ve said earlier disallowing URL using robots.txt will just restrict crawler in crawling the URL.
To set the robots.txt on your blog, just go to Blogger Dashboard ->>Settings->>preferences->>Robots.txt.
User-Agent: *The codes above, simply say that all user agents are allowed to crawl every page on our blog.
After setting your Robots.txt you can prevent Search Engine crawlers in indexing the page by using a Meta Robots. To set the Meta Robots just go to your Blogger Dashboard->>Template->>Edit Template, just below to this code <head> put the following code provided below.
<b:if cond='data:blog.canonicalUrl == "http://www.internetsmash.com/p/about-us.html"'/>The codes, simply say that if the page’s URL is http://www.internetsmash.com/p/about-us.html then it should not be indexed on Google.
<meta content='noindex,follow' name='robots'/>
How to Index Post and Static Page OnlyThere are several ways to do it. We can edit our template like as shown on top, just use the code provided below. The code simply says that if the current page is NOT a post, static page and homepage, then it should not be indexed on Google else Googlebot should index it.
<b:if cond='data:blog.pageType != "item" and data:blog.pageType != "static_page" and data:blog.url != data:blog.homepageUrl'>
<meta content='noindex,nofollow' name='robots'/>
<meta content='index, follow' name='robots'/>