We experienced a 4% failure rate handling incoming requests in a c. 20 minute window. We believe this to now be fixed. This was a transient issue that should have no permanent impact.
A production deploy at 1210Z changed our database connection pooling behaviour. This caused us to hit our database's connection limit leaving some web workers without a connection but still receiving traffic. This meant c. 3.5% of requests resulted in a 500 error.
At 1227Z we reduced the number of workers to reduce the number of connections. This brought us below the limit, and normal service was resumed.
We have significant spare worker capacity so we still comfortably handle production load with this reduced worker count, while we investigate better solutions.
Posted about 2 years ago. Jun 28, 2016 - 12:45 UTC