We had some performance issues last week. Entirely of our own making but not in the usual way.
We nearly DDOS ourselves by sending out emails.
We do a lot of analysis in Energy Sparks and, to be honest, some of it needs optimising. Tickets are in the backlog and we are exploring solutions.
Anyway, one morning last week we had a sudden spike in memory and CPU usage. This was during the period when we send out our weekly emails to users. The emails signpost them to the latest analysis, highlighting any key issues in their energy consumption. For example that their baseload has crept up, so maybe devices are being left on. Or that their heating is running during a holiday period.
Alarms started going off, so I started investigating. Remedial action was taken and we stabilised things while digging for the root cause.
I found an issue in the email sending code pretty quickly and fixed around it.
But as I normally do when this happens, I did a full run through of our monitoring and logs to identify other contributing factors. Sometimes it’s not that the software has changed. It might be the context in which it is running. So I try to avoid looking at just the obvious fixes.
While reviewing the server logs I noticed we were getting a lot of HEAD requests during the period of slow performance. Like a LOT.
Then I noticed that all these were for URLs originating in the emails that we were sending out that morning.
While we get good engagement with these emails, it wasn’t normal user traffic. It was something else. Checking the User Agents in the requests, I realised what was happening.
At least some of the schools to which we are sending email are using security software that scans incoming emails. I assume this is common in a lot of organisations these days.
That software was reading the emails then doing HEAD requests on all the links.
I assume this is to check SSL certificates and look for dodgy redirects that might be associated with phishing.
The more emails we sent, the more HEAD requests we got. And the majority of those requests were hitting the pages that I said needed optimising.
As we were struggling to send out emails, the more we sent, the more were were being hit with waves of HEAD requests to pages that were causing additional performance issues. We were basically DDOSing ourselves with the help of some email security software.
Cue those alarms.
Now the extra fun thing is that we hadn’t implemented a HEAD handler for these URLs. But Rails silently converts a HEAD request to a GET. Before throwing away the response and just serving the headers. So application was struggling to produce analysis that was immediately thrown away.
I have no idea how long this has been happening. Perhaps for a while, and we’ve just grown to the point where its a problem. Or maybe more of the recently joined schools are using different software. I don’t know and haven’t dug further.
But this type of email scanning behaviour was new to me. I don’t know whether these email scanners routinely follow all links, or just a sample. But there didn’t seem to be any throttling. Or much use of user agents headers to help identify the source. This seems a bit unfriendly.
Clearly we had issues to fix in the application. And those performance optimisations are getting a bump up the backlog. But this was an amusing little incident and a nice example of unexpected interactions between systems.
Lessons have been learned.