From Google SRE Book, chapter 5 – Eliminating Toil:
So what is toil? Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Not every task deemed toil has all these attributes, but the more closely work matches one or more of the following descriptions, the more likely it is to be toil:
- This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time.
- If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil.
- If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.21
- Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil.We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it.
- No enduring value
- If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved.
- O(n) with service growth
- If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. An ideally managed and designed service can grow by at least one order of magnitude with zero additional work, other than some one-time efforts to add resources.
Is Toil Always Bad?
Toil doesn’t make everyone unhappy all the time, especially in small amounts. Predictable and repetitive tasks can be quite calming. They produce a sense of accomplishment and quick wins. They can be low-risk and low-stress activities. Some people gravitate toward tasks involving toil and may even enjoy that type of work.
Toil isn’t always and invariably bad, and everyone needs to be absolutely clear that some amount of toil is unavoidable in the SRE role, and indeed in almost any engineering role. It’s fine in small doses, and if you’re happy with those small doses, toil is not a problem. Toil becomes toxic when experienced in large quantities. If you’re burdened with too much toil, you should be very concerned and complain loudly. Among the many reasons why too much toil is bad, consider the following:
- Career stagnation
- Your career progress will slow down or grind to a halt if you spend too little time on projects. Google rewards grungy work when it’s inevitable and has a big positive impact, but you can’t make a career out of grunge.
- Low morale
- People have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout, boredom, and discontent.
Additionally, spending too much time on toil at the expense of time spent engineering hurts an SRE organization in the following ways:
- Creates confusion
- We work hard to ensure that everyone who works in or with the SRE organization understands that we are an engineering organization. Individuals or teams within SRE that engage in too much toil undermine the clarity of that communication and confuse people about our role.
- Slows progress
- Excessive toil makes a team less productive. A product’s feature velocity will slow if the SRE team is too busy with manual work and firefighting to roll out new features promptly.
- Sets precedent
- If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by Devs to SRE. Other teams may also start expecting SREs to take on such work, which is bad for obvious reasons.
- Promotes attrition
- Even if you’re not personally unhappy with toil, your current or future teammates might like it much less. If you build too much toil into your team’s procedures, you motivate the team’s best engineers to start looking elsewhere for a more rewarding job.
- Causes breach of faith
- New hires or transfers who joined SRE with the promise of project work will feel cheated, which is bad for morale.