
Logging is a critical part of every serious project. If logging is not important in your project – you’re probably doing logging wrong. Here are a few lessons I learned over the years running multiple projects.
1 – Reserve ERROR for errors, and everything that is not a bug in your code shouldn’t be an ERROR
Every log line has a log level. The most important distinction in log levels is between ERROR and everything below ERROR. The following logic should guide you – an ERROR log line should indicate a bug in your code. If there’s an event that generates an ERROR log line which is not an indication of a bug in your code – this should not be logged as an error.
Furthermore, you should spend effort making sure that every recognizable error should be logged as such. So, most handlers should be generically wrapped by an appropriate logger, and your 500 logger or equivalent should naturally emit ERROR logs.
2 – User input validation failures should be warnings
As a natural result of our first tip, user validation failures shouldn’t be logged as errors. They are your code doing what it should be doing. However, they still merit more than an INFO log line. So use WARNING here. Other events can also be logged as WARNINGs, events such as resources running low, a fallback being used, etc.
As a natural outcome of the first two tips, we come to tip no. 3:
3 – Alert on errors, or on multiple warnings
So, your log-levels are now correct. The next step is getting notified whenever an error happens – this is an indication you have a bug in your code. But you don’t want the same error happening a lot to flood your inbox (or whatever other reporting mechanism you use.)
You can de-dup your errors yourself, for example, by hashing the call stack. Alternatively, use a service such as sentry.io to do that for you. You can now send notifications such as E-Mails and text messages when new errors appear.
Once you have that, you can also consider getting alerts for warnings that happen too many times. For example – if a particular user input validation fails often then perhaps your UX is broken. If a fallback happens too many times then perhaps your main flow is not robust enough.
4 – Make your logs informative
Be liberal with adding info logs. At the least, all cross-service and requests to your API should be logged. Other major events/decisions should probably also be logged. Personally I’d probably prefer O(1) per call to my API (i.e. don’t INFO log in a loop).
Independently of that, take care to include all the useful information you can in your logs. That includes file, line, perhaps all or part of your stack trace, and so on. The text logged should also be informative – if a particular value is incorrect log it and the desired value (be careful of privacy concerns though!)
5- Aggregate all logs into a single searchable database
Having a single, searchable log interface, instead of separate ones is critical. Being able to understand the complete flow of an issue is in many cases dependent on you seeing all the relevant information together. Having it searchable will greatly speed up your ability to find issues and fix them. Today at Flytrex we are using logz.io, but there are quite a few other effective solutions.
Bonus section
- If your project involves two or more people, decide on a logging policy explicitly.
- There’s a big difference between logging in libraries, tools that run once, or long-running programs. Each one has different needs.
- For cases when your logs are not perfect (and they never are), a tool such as rookout is very useful. It allows you to set a “logging breakpoint” anywhere in your code – without redeploying it. This already saved me hours of debugging.
Photo credit: Wood photo created by onlyyouqj – www.freepik.com
On the StarCraft team we had log-overload issues. Tips I would add from that experience are: 1 – always include a trace id, so you can follow a single chain of events through a distributed system. 2 – don’t use logs as health checks, define self-updating conditions. When an error log occurred – no one knew if the problem was still around after a minute, a day, or a month. To address this we defined “conditions” which could have been eg “10% of the recent transactions failed” which would persist the “failure status” as long as the problem persisted.