Sidekiq upgrade: How we failed it

At Uscreen, our Rails monolith will soon be nine years old. We have been using Sidekiq for a long time, and since 2020, we have started to use an Enterprise. I want to thank Mike Perham and every contributor for this beautiful product.

As with every old Rails app, your gems need to be updated once in a while. So, in our case with Sidekiq, we fitted in the “in a while” group.

# Gemfile

gem 'sidekiq', '~> 5.2.1'
source 'https://enterprise.contribsys.com/' do
  gem 'sidekiq-pro', '= 4.0.5'
  gem 'sidekiq-ent', '1.8.1'
end

Theory

We decided to go with a step-by-step approach and upgrade from 5.X to 6.X first. We carefully read each gem’s upgrade guides and implemented changes into our codebase.

Version 6.X removed support for job_hash_context introduced in 5.1 and completely redesigned logging. We love to write logs with job arguments, so we added these changes to our setup:

# config/initializers/sidekiq.rb

module LoggingWithArguments
  def call(item, queue)
    Sidekiq::Context.add('queue', queue)
    Sidekiq::Context.add('args', item['args'])
    super
  end
end

Sidekiq.configure_server do |config|
  Sidekiq::JobLogger.prepend(LoggingWithArguments)
  # 💙💙💙 JSON formatted logs look perfect in the Google Cloud Logs Explorer
  config.logger.formatter = Sidekiq::Logger::Formatters::JSON.new
  # other config changes...
end

We tested our changes and decided to move on with production deployment.

Practice

Once we deployed our changes, we saw how rapidly the count of jobs in “Enqueued” grew. It went from an average of ~10-100 jobs to 350,000. The count of processes running Sidekiq grew to keep up with the workload. It was clear that something was wrong, big time wrong.

We quickly rolled back our changes and manually cleaned up the job queue. During the incident investigation, it turned out that after deployment, we started to send emails that were supposed to be sent 30-60 days ago.

So what the hell happened? It turned out that we fixed the bug. 😅

A few months ago, our creators noticed that their email/ push notifications weren’t delivered occasionally. As it wasn’t a critical part of the functionality, most of them resent the notifications and didn’t even notify us. In our logs, we would see that jobs were running, so we were trying to figure out where exactly they got lost.

To understand what was wrong, you need to know that we use super_fetch

  # config/initializers/sidekiq.rb
  config.super_fetch!

We introduced a bug with environment variables for our new infrastructure on GCP that pushed Sidekiq processes toward not graceful shutdowns. Those unfinished jobs would remain in the private queue and would be called orphaned jobs. Now, in the pre-upgrade (5.X) version, for some reason, orphaned jobs would not be recovered on restart.

If you have guesses why it could happen - feel free to comment/DM on Twitter(X)

But once we deployed the 6.X version, super_fetch began to work flawlessly.

Ooops

Conclusion

This story will not only explain why you received a notification from your favorite content creator about the live stream that has already finished but also will remind you that all software has bugs, each of them is an opportunity for new learnings and a fun story.

We all learn by doing things, and there is no shame in sharing stories like this. Learn to celebrate failures and be antifragile.

P.S. I didn’t expect so many people to like my Tweet, thank you all 🙌.

Sidekiq upgrade: How we failed it

Theory

Practice

Conclusion

Reading recommendations