Make Sure Your Sitemap is using Production Data

I’m building a site, and have added the sitemap to help the Googles find the pages and give them an idea when the pages need to be rescanned. I use the excellent sitemap_generator gem, and it works really well.

My typical workflow during development is to run the sitemap generator on my development machine, with the Sitemap host set to the production server, and then to check the generated file in and push to the production server.

This works great to get something up and running, but very quickly your production server data will not match your development server data, particularly where the updated_at values are used, and this technique starts to break down.

Run Sitemap locally using Production data

The next step can be to run your code locally, but connect to the Production database to get the correct data. In the latest Rails versions - 4.1 and above - the database.yml file can include a URL to the database server. This makes it easy to configure your production database locally to allow this task to run.

production: url: postgres://….

Run Sitemap Generation on Production

For actual production use, you really should be generating the Sitemap on the production servers, without needing to deploy the site to get updates.

If you have write access to your single web server you can generate your sitemap directly. My app is running on Heroku so the file system is read only, and it can go away and restart at any time. Amazon S3 seems to provide a solution here.

The sitemap_generator gem provides good documentation on how to get this done. With a liberal dose of the Heroku Scheduler to run the rake sitemap:refresh task every day, my Sitemap is now generated to S3 every night.

Make sure you then update your robots.txt file and tell the Google Webmaster Tools about the new location for the Sitemap.