So, why this post? Well…
I'm (almost) done with my exams, so I'll finally have much more free time. What are some technical books you would recommend to me? 📚 I'm interested in Kubernetes, containers, distributed systems, security (although I'm a noob here), and more.
— Marko Mudrinić (@xmudrii) July 5, 2019
For those of you who don’t know Marko, he is a former GSoC student at the CNCF working on Kubernetes, and is a contributor to Kubernetes Cluster API, so I’m just going to take the idea that Marko’s a n00b with a pinch of salt the size of handfuls.
Anyway, it’s a common enough request that it’s probably worth documenting my 2p here. What follows is mostly things and authors that have interested me of. Other opinions are also available.
Books
On Kubernetes
- Higtower, Kelsey and Brendan Burns and Joe Beda. Kubernetes: up and running: dive into the future of infrastructure, 2nd Edition. O'Reilly, 2019.
- Burns, Brendan and Craig Tracey. Managing Kubernetes. O'Reilly, 2018.
- Garrison, Justin, and Kris Nova. Cloud Native Infrastructure: Patterns for Scalable Infrastructure and Applications in a Dynamic Environment. O'Reilly, 2017.
On distributed systems
- Burns, Brendan. Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services. O'Reilly, 2018.
- Sridharan, Cindy. Distributed Systems Observability. O'Reilly, 2018. http://bit.ly/2FXnS3f.
On organizational practice
- Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne. The Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly, 2018. https://landing.google.com/sre/workbook/toc/.
- Forsgren, Nicole and Jez Humble. Accelerate: The Science of Lean Software and Devops: Building and Scaling High Performing Technology Organizations. 2018.
- Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016. https://landing.google.com/sre/sre-book/toc/index.html.
Papers
On Cluster Orchestration
- Choudhury, Diptanu Gon, and Timothy Perrett. Designing cluster schedulers for internet-scale services. Communications of the ACM 61 no. 6 (2018): 34-40. https://doi.org/10.1145/3190564
- Leung, Andrew, Andrew Spyker, and Tim Bozarth. Titus: introducing containers to the Netflix cloud.. Communications of the ACM 61 no. 2 (2018): 38-45. https://doi.org/10.1145/3152529
- Burns, Brendan, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. Borg, Omega, and Kubernetes. Communications of the ACM 59 no. 5 (2016): 50-57. https://doi.org/10.1145/2890784
- Verma, Abhishek, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). ACM, 2015. https://doi.org/10.1145/2741948.2741964.
Distributed Systems
- Bailis, Peter and Kyle Kingsbury. The Network is Reliable: An informal survey of real-world communications failures. ACM Queue 12 no. 7 (2014): 1-13. https://doi.org/10.1145/2643130. http://bit.ly/2JfqCuO.
- DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon's highly available key-value store. ACM SIGOPS operating systems review 41 no. 6 (2007): 205-220. https://doi.org/10.1145/1323293.1294281
- Lamport, Leslie. Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32 no. 4 (2001): 51-58
Security
- Frazelle, Jessie. Research for practice: security for the modern age. Communications of the ACM 62 no. 1 (2019): 43-45. https://doi.org/10.1145/3287295. http://bit.ly/2JqTfUB.
System Architecture
- Saltzer, Jerome H., David P. Reed, and David D. Clark. End-to-end arguments in system design. Technology 100 (1984): 0661