Site Reliability Engineering Crashcasts

Episodes

How Experienced SREs Make High-Stakes Decisions in Uncertain Situations

Sep 29 2024
Join us on Site Reliability Engineering Crashcasts as we delve into the critical art of decision-making under uncertainty with expert Victor.

In this episode, we explore:

The unique challenges of decision-making in SRE roles
How the OODA loop framework can enhance quick and effective decisions
The "fail fast, fail safe" approach to managing limited information
Innovative techniques like pre-mortem analysis and blameless postmortems
The impact of chaos engineering on improving team decision-making skills

Tune in to gain valuable insights on mastering high-stakes decisions in SRE!

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
Effective Strategies and Resources for Continuous Learning in SRE

Sep 29 2024
Ready to supercharge your Site Reliability Engineering skills? In this episode, Sheila and Victor delve into the best strategies and resources for continuous learning in SRE.

In this episode, we explore:

The importance of continuous learning in SRE — Discover why staying updated is crucial in this rapidly evolving field.
Effective learning strategies — Learn about online courses, technical blogs, conferences, open-source contributions, and personal projects.
Overcoming learning challenges — Get tips on managing time constraints and information overload.
Advanced learning techniques — Find out how concepts like "learning in public" and the Feynman Technique can enhance your learning process.

Tune in to gain insights and tips to stay ahead in your SRE journey!

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
The Evolution of Containerization: Insights on Docker and Kubernetes

Sep 29 2024
Curious about how containerization has revolutionized application deployment and management? Welcome to Site Reliability Engineering Crashcasts!

In this episode, we explore:

The basics of containerization and how it differs from traditional virtualization.
The crucial role Docker played in popularizing container technology.
Kubernetes' functionality and its real-world applications.
Common pitfalls in adopting containerization and expert tips to avoid them.
Valuable insights from early adopters and industry thought leaders.

Tune in to gain a comprehensive understanding and practical insights on navigating the Docker and Kubernetes ecosystem.

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
6 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
Designing Highly Available Systems: Insights from Leading Companies

Sep 29 2024
Ever wondered how leading tech companies achieve near-perfect uptime? Tune in to this episode of Site Reliability Engineering Crashcasts as Sheila and Victor break down the marvels of designing highly available systems.

In this episode, we explore:

The critical importance of highly available systems and their impact on businesses.
Fundamental strategies like redundancy and load balancing that keep systems running smoothly.
Advanced concepts such as fault tolerance and disaster recovery.
Real-world implementations, featuring Google’s impressively resilient infrastructure.

Discover the secrets behind the systems that never sleep and why striving for "three nines" or "five nines" of uptime is essential. Don't miss out on these invaluable insights!

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
6 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
Comparing Prometheus, Grafana, ELK Stack & Emerging Trends in Observability

Sep 29 2024
Dive into the essentials of monitoring and logging in this episode of Site Reliability Engineering Crashcasts with Sheila and Victor!

In this episode, we explore:

The difference between monitoring and logging, explained through a clever medical analogy.
A detailed comparison of Prometheus, Grafana, and the ELK stack, including their strengths and weaknesses.
An introduction to the three pillars of observability – metrics, logs, and traces.
Emerging trends in observability such as unified platforms and OpenTelemetry.
Best practices for implementing an effective observability strategy from the outset.

Don’t miss out on these insights that are crucial for anyone in DevOps or site reliability engineering. Tune in to gain valuable knowledge on how to effectively monitor and log your systems!

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
7 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
Techniques for Performance Troubleshooting and Latency Diagnosis in SRE

Sep 29 2024
Ready to unravel the mysteries of performance troubleshooting and latency diagnosis in SRE? Join host Sheila and expert Victor as they dive deep into essential techniques and best practices.

In this episode, we explore:

Profiling, Tracing, Logging, and Monitoring: Discover how these key tools can help you understand and improve system performance.
The USE Method: Learn how Utilization, Saturation, and Errors can systematically uncover performance issues.
The RED Method: Grasp the significance of Rate, Errors, and Duration in monitoring service health.
Common Pitfalls and Best Practices: Hear expert tips on avoiding data overwhelm and focusing on percentiles rather than averages.
Quiz Insight: Find out what seemingly innocuous component can cause unexpected latency spikes of up to 100 milliseconds!

Tune in to get a comprehensive guide on performance troubleshooting that feels like detective work!

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
7 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
Maximizing SRE Efficiency: Harnessing Automation for Self-Healing Systems

Sep 29 2024
Unlock the potential of automation in Site Reliability Engineering in this episode of Site Reliability Engineering Crashcasts!

In this episode, we explore:

What automation means for SRE and how it can transform your workflows.
Common tasks that can be automated, freeing up engineers to focus on strategic initiatives.
The concept of self-healing systems and their role in maintaining uptime and reliability.
Best practices for implementing automation, along with pitfalls to avoid for ensuring success.
A real-world example from Netflix on using automation for system resilience.

Join us as we dive deep into practical insights and strategies with Victor, our expert guest. Don't miss out on learning how to enhance your SRE practices with automation!

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
6 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free
DevOps vs. SRE: Exploring Their Similarities, Differences, and Professional Perspectives

Sep 29 2024
Dive deep into the world of DevOps and Site Reliability Engineering (SRE) with us in this enlightening episode of Site Reliability Engineering Crashcasts!

In this episode, we explore:

Definitions and foundational principles of DevOps and SRE.
The historical origins of both practices, including a surprising fact about Google’s pioneering role in SRE.
Key similarities, such as the emphasis on automation and CI/CD, and critical differences like the focus on reliability vs. speed of delivery.
An engaging analogy that compares DevOps and SRE to master chefs with distinct priorities in the kitchen.
Insights into how professionals perceive the relationship between DevOps and SRE, including common misunderstandings and pitfalls.

Tune in to gain a clearer understanding of these essential IT frameworks and hear a fun fact about Google's unique SRE practices!

Want to dive deeper into this topic? Check out our blog post here: Read more
★ Support this podcast on Patreon ★
Show More Show Less
8 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Listen for free

Audiobook Categories

Popular Lists

Explore Audible

Episodes

How Experienced SREs Make High-Stakes Decisions in Uncertain Situations

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

Effective Strategies and Resources for Continuous Learning in SRE

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

The Evolution of Containerization: Insights on Docker and Kubernetes

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

Designing Highly Available Systems: Insights from Leading Companies

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

Comparing Prometheus, Grafana, ELK Stack & Emerging Trends in Observability

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

Techniques for Performance Troubleshooting and Latency Diagnosis in SRE

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

Maximizing SRE Efficiency: Harnessing Automation for Self-Healing Systems

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed

DevOps vs. SRE: Exploring Their Similarities, Differences, and Professional Perspectives

Failed to add items

Add to basket failed.

Add to wishlist failed.

Remove from wishlist failed.

Adding to library failed

Follow podcast failed

Unfollow podcast failed