We tried many tools and settled on the Screaming Frog SEO Spider for our pre-move assessments. We ponied up for the Pro version and removing the 500 URL limit is well worth it. Let’s look at how to use the SEO Spider before a migration project begins.
First, launch the SEO spider and then export your results as a .csv or Excel spreadsheet.
Second, in Excel click on the Data tab and then click on Filter. You will now see the little drop down menu for each column. Here is where it gets specific and powerful.
If you are Moving from Drupal to WordPress;
First recognize that HubSpot and Blogger (like TypePad) are essentially hosted closed systems as compared to Drupal (or MovableType or Joomla) which will allow you to customize and convolute all you like. Trust me, some people do just that, which makes those projects harder to export and migrate.
I like to expand the column width of Column A (Address) so that I can read most URLs before I begin.
- Click on the Content Filter drop down, in Column B. It is usually found on Row 2.
- Click on Select All to uncheck all boxes.
- Select the boxes that start with Text/html; charset=UTF-8. Generally the other two options visible relate to XML feeds.
- Click OK.
Filter the content for just text/html and then sort the Address column alphabetically. Now just read the URLs as you scroll down and look for patterns.
Lets find out how many pages are on the site.
In the example that I am reviewing the developers used a unique /category-name/ for each section of the website. Since there are more than a dozen of these it will be easier to find a way to exclude the blog than include all the category names and their pages.
Now here is an interesting situation. This particular domain has not implemented a canonical URL. This means that the SEO Spider is listing both the www and the non-www version of every page and post. We will have to filter out the www so that all the counts are not doubled up. It’s never simple.
- Click on the Address Filter drop down, in Column A. It is usually found on Row 2.
- Click on the Text Filters and then choose Custom Filter
- In the drop down, select “Contains” and type in www
- In the second drop down, select “does not contain” and type in /blog/
- Click OK
Now we can see a much shorter list of just the pages of the website. No images, blog posts, archives and so on. Scrolling down the list I still see all the categories and I recognize the pages names.
You can certainly modify the filter and count only blog posts or images as well. You have to pay attention to use such a powerful tool with any hope of getting useful results.
Always expect surprises when working with websites. These same principles apply to straight HTML sites, hosted solutions like HubSpot, Blogger, Active Rain and TypePad as well as Open Source solutions such as Drupal, Movable Type or Joomla.
This post is third in a series. Each post has unique information that is relevant no matter what CMS you are working with.
The first in the series is about moving from HubSpot to WordPress and has a nice tip regarding robots.txt and the SEO Spider.
The second in the series focuses on moving from Blogger to WordPress and illustrates counting blog posts.