Site Reliability Engineering How Google Runs Production Systems Github

Download Site Reliability Engineering How Google Runs Production Systems Github PDF/ePub or read online books in Mobi eBooks. Click Download or Read Online button to get Site Reliability Engineering How Google Runs Production Systems Github book now. This website allows unlimited access to, at the time of writing, more than 1.5 million titles, including hundreds of thousands of titles in various foreign languages.
Site Reliability Engineering

Author: Niall Richard Murphy
language: en
Publisher: "O'Reilly Media, Inc."
Release Date: 2016-03-23
The overwhelming majority of a software systemâ??s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Googleâ??s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. Youâ??ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficientâ??lessons directly applicable to your organization. This book is divided into four sections: Introductionâ??Learn what site reliability engineering is and why it differs from conventional IT industry practices Principlesâ??Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practicesâ??Understand the theory and practice of an SREâ??s day-to-day work: building and operating large distributed computing systems Managementâ??Explore Google's best practices for training, communication, and meetings that your organization can use
Becoming a Rockstar SRE

Author: Jeremy Proffitt
language: en
Publisher: Packt Publishing Ltd
Release Date: 2023-04-28
Excel in site reliability engineering by learning from field-driven lessons on observability and reliability in code, architecture, process, systems management, costs, and people to minimize downtime and enhance developers' output Purchase of the print or Kindle book includes a free eBook in the PDF format Key Features Understand the goals of an SRE in terms of reliability, efficiency, and constant improvement Master highly resilient architecture in server, serverless, and containerized workloads Learn the why and when of employing Kubernetes, GitHub, Prometheus, Grafana, Terraform, Python, Argo CD, and GitOps Book Description Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples. This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You'll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you'll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions. By the end of this book, you'll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE! What you will learn Get insights into the SRE role and its evolution, starting from Google's original vision Understand the key terms, such as golden signals, SLO, SLI, MTBF, MTTR, and MTTD Overcome the challenges in adopting site reliability engineering Employ reliable architecture and deployments with serverless, containerization, and release strategies Identify monitoring targets and determine observability strategy Reduce toil and leverage root cause analysis to enhance efficiency and reliability Realize how business decisions can impact quality and reliability Who this book is for This book is for IT professionals, including developers looking to advance into an SRE role, system administrators mastering technologies, and executives experiencing repeated downtime in their organizations. Anyone interested in bringing reliability and automation to their organization to drive down customer impact and revenue loss while increasing development throughput will find this book useful. A basic understanding of API and web architecture and some experience with cloud computing and services will assist with understanding the concepts covered.
Mastering Distributed Tracing

Understand how to apply distributed tracing to microservices-based architectures Key FeaturesA thorough conceptual introduction to distributed tracingAn exploration of the most important open standards in the spaceA how-to guide for code instrumentation and operating a tracing infrastructureBook Description Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems. What you will learnHow to get started with using a distributed tracing systemHow to get the most value out of end-to-end tracingLearn about open standards in the spaceLearn about code instrumentation and operating a tracing infrastructureLearn where distributed tracing fits into microservices as a core functionWho this book is for Any developer interested in testing large systems will find this book very revealing and in places, surprising. Every microservice architect and developer should have an insight into distributed tracing, and the book will help them on their way. System administrators with some development skills will also benefit. No particular programming language skills are required, although an ability to read Java, while non-essential, will help with the core chapters.