How to deal with index bloat
Marcus says: "Dealing with index bloat is very actionable. A preventive Panda diet is a common recommendation for websites that have gained a lot of fat throughout the years and are now struggling to increase visibility in Google. A large number of irrelevant pages is the equivalent of empty yoghurt containers being kept and stored because you might still need them in the future. Don't get carried away with this kind of digital compulsive hoarding. Instead, keep your website fresh, tidy, and clean.
You should always ask yourself three simple questions for all your pages: Do I need this page for my users? Does this page also need to be indexed by Google? And if yes, what should this page ideally rank for? Of course, you don't need to delete all pages which are not relevant for ranking purposes, and still have a value for your users. Users can still navigate to these pages but be more rigorous about what should get indexed.
For example, you might have a faceted navigation for an online shop for shoes, and the targeted keyword is 'Adidas Superstar'. These trainers are available in ten different colours and ten different sizes, so an interested user can navigate to any variation of these 100 different pages to put them in their virtual shopping basket. However, you don't necessarily want to let those 100 product variations get indexed - only the ones which are explicitly being searched for.
You might end up with a catch-all 'Adidas Superstar' page with displayed trainers in the best-selling colour and size combination. If there's a sufficient amount of Google searches for a specific combination, such as 'black Adidas Superstars', it also makes sense to grant this option its very own indexable page. This way, you can ensure a good ratio of relevant to irrelevant pages, reduce the possibility of cannibalization, and therefore stay safe from Google's Panda.
If your website is already full of these empty yoghurt containers, it's advisable to preventively take out the trash. Ideally, using code 410 Gone instead of the more common 404s, since Google is revisiting 404 pages a couple more times. While with a 410 status code, Google will only revisit this page one more time to make sure the page is really gone. Of course, you will also need to remove any internal links and or references to these 410 pages.
If you have amassed loads of these empty yoghurt containers, which are now 410, and you want to get those out of the index as fast as possible, you can consider using a negative sitemap, which contains only pages that are 410. This way, you can remove large amounts of irrelevant pages out of Google's index very effectively."
How does an SEO find these empty yoghurt containers on their website and define what is useful for users, useful for Google, and a good page to rank for?
"Start by taking Google Search Console data and filter for all pages with clicks=0 within the 12 months timeframe. Then you take this list to Google Analytics and see if those pages are getting traffic besides from organic search. Ideally, as a third step, you should also compare this list with Googlebot behaviour using LockFile data to see if Google is still regularly crawling these pages. Since the scheduler which sends off Googlebot to fetch pages prioritises by importance, you can get a good sense of whether Google is finding any of these pages still important - and therefore relevant. If any of these pages receive no organic traffic, or any other natural or paid traffic, and if Google isn't even crawling these anymore, you can just get rid of them to thin out your website easily."
What is a good ratio of useful pages to these irrelevant pages?
"This always depends on your website. There isn't an ideal ratio you should aim for. If something's relevant - it's relevant. It's about sending out everything which is not relevant. Just let the data talk."
How do you define what a user is likely to like? Is it based upon user behaviour, or do you have to have a group of people analyse that page to ensure it's a relevant and appropriate part of the buying cycle?
"What you're referring to is the long click, and this is Google's main goal. They want somebody to click through to your page and stay on your site - not just be a one-hit-wonder on one page. They want you to surf through to your other pages on your site and not go back to the SERP.
A search completion tells Google that the page really fulfilled the intent. But you've got to start way earlier also with pages, which Google basically promotes in the top ten. To see if users find this page relevant, and before you can even evaluate the long click behaviour, you need to understand if they are clicking through to your site - this is the first sign of relevance. You might think you have the perfect page, and you're getting into the top ten, but if nobody's clicking through to your site, it is a clear indication there might be a different intent here. Your content may be great but it's just not great content for this intent.
You need to be thinking of how you can make the best possible page to fulfil this intent. The most important aspect of modern SEO is not just aspiring to rank in the top ten but being the best possible result for that specific query."
What is the most common cause of index bloat?
"The worst is always online shops. They have all these products going in and out of stock, multiple categories, and tag pages. Just using an out-of-the-box CMS might create index bloat problems by introducing categories, as well as tags. And this can happen even to a blog or any site."
You mentioned having a negative sitemap for 410s, but also removing all links to these pages. If you're struggling to remove so many different links from your website, is it enough to redirect the links to something else?
"If I have an out-of-stock product, the standard recommendation is to redirect users to the corresponding category, so they still have an inventory to choose from. I don't think this will help you in the long run because it's just not as relevant as other online shops with the product in stock. It really won't help you with ranking per se. Instead, I'd just get rid of it.
This is exactly the empty yoghurt container problem. You're keeping something because you might need it in the future. It's better to have a crisp, clean, and compact site. The negative sitemap is speeding everything up and giving Google an indication that this is everything you want out. If you keep these links, Google might be inclined to go to these pages, so I'd remove all references."
If you're training other marketing professionals on the value of dealing with index bloat, would you typically use your empty yoghurt containers analogy?
"I'm always explaining it this way. We tend to keep a lot of stuff on our websites because we think we might still need it. I think SEOs have created this problem. Fifteen years ago, if somebody came to us and said, my site is 1000 pages, what should I do? A lot of SEOs (me included) would have recommended making it 100,000 pages for every possible keyword combination. That's what you needed back then because Google wasn't that good.
Now Google is much smarter, you don't need this anymore. You are also bleeding your link juice because you've distributed to so many pages. These days, it's much more advisable to have a more compact structure. Also, Google had a different objective back then. If you remember, Google was always promoting how many billion pages it has indexed because it was in a rat race with Microsoft and Yahoo.
It's not about the quantity anymore. With social media, they simply can't keep up with the volume of URLs created every day. So now the focus is on quality - and this is why Panda came out."
Suppose an SEO is involved in the original design of a website. Is it better for blog posts and product pages just to have a single category and for any tags associated with each of those pages to be completely non-indexable by Google?
"I would always opt for one indexable structure with categories, but then I would opt out of tags or a structure by dates, such as months. If you also use these things to maximise the internal links to all of your pages, you basically end up linking all over the place, and you don't direct Google to what's really important for you.
As an SEO, I always want to be in control as much as possible. I don't want Google to have to do lots of heavy lifting. I want a structure that tells them what I'm good at and where I provide the most value to the users."
What's one thing an SEO needs to stop doing to spend more time focusing on index bloat?
"Focusing a lot of the time on link building - especially if you are already a big brand with lots of links, and Google already likes you. Any new links are unlikely to have maximum benefit if you have a sub-optimal structure. You don't build a house in the swamp - you need a solid foundation. So always fix the structure before spending a lot of money and time on acquiring new links. With a great structure, these links will have so much more power."
You can find Marcus Tandler over at Ryte.com.