Huey
October 23, 2024

engineering on Modern monitoring & security

How we optimized LLM use for cost, quality, and safety to facilitate writing postmortems

Writing a postmortem is an essential learning process after an incident is resolved. But documenting important details comprehensively can be cumbersome, especially when responders have already moved on to the next urgent issue. To make this process easier, we implemented a feature in Bits AI that uses large language models (LLMs) to ease the writing process, aiming to retain the engineers’ control without compromising the primary goal of learning while recapping the details of the incident.To implement this solution, we...

about 1 month ago

engineering on Modern monitoring & security

Timeseries Indexing at Scale

Datadog collects billions of events from millions of hosts every minute and that number keeps growing and fast. Our data volumes grew 30x between 2017 and 2022. On top of that, the kind of queries we receive from our users has changed significantly. Why? Because our customers have grown in sophistication: they run more complex stacks, want to monitor more data, and run more complex analyses. That, in turn, puts pressure on our timeseries data store.Data stores have a number...

4 months ago

engineering on Modern monitoring & security

How We Migrated Our Static Analyzer From Java To Rust

When Codiga joined Datadog last year, we integrated our static analyzer product with Datadog’s infrastructure to release Datadog Code Analysis. Codiga’s static analyzer was written in Java and supported code analysis for Python, JavaScript, and TypeScript. It initially used ANother Tool for Language Recognition (ANTLR) for generating an abstract syntax tree (AST) of each language supported by the product.Shortly after the acquisition, we worked to expand support to additional languages, but ran into two major issues:With our existing tools and...

5 months ago

engineering on Modern monitoring & security

Engineering VP Spotlight: Ivo Dimitrov

In this edition of the Datadog Engineering Spotlight, Tom from the Community team sat down with Ivo Dimitrov, one of our Engineering VPs. Tom and Ivo spoke about Ivo’s career as an engineering manager for several top organizations, his transition from individual contributor to manager, and what excites him today about the distributed data systems his teams are building.This interview has been edited for clarity and length.How did you originally become interested in software engineering?I’ve been in the industry for...

5 months ago

engineering on Modern monitoring & security

.NET Continuous Profiler: Memory Usage

In Part 1 of this series, I presented a high-level overview of the architecture, implementation, and initialization of Datadog’s .NET profiler, which consists of several individual profilers that collect data for particular resources. I went on to discuss profiling CPU and wall time in Part 2 and exceptions and lock contention in Part 3, alongside detailed explanations of stack walking across different platforms and upscaling sampled data.This fourth and final part covers memory usage profiling—why it’s useful, how our profiler...

5 months ago

engineering on Modern monitoring & security

How We Built the Datadog Heatmap to Visualize Distributions Over Time at Arbitrary Scale

How do we surface the rich stories hidden within our users’ observability data? We can use percentiles to communicate performance for a specific percentage of cases—but for the full shape of performance, we use distribution metrics.These metrics, powered by DDSketch, aggregate data from multiple hosts during a flush interval, enabling users to analyze statistical distributions across their entire infrastructure. To visually represent this high-resolution data, we use heatmap visualizations—which provide a means to effectively convey high-cardinality point distributions.In this blog...

6 months ago

engineering on Modern monitoring & security

How We Brought Datadog's Data Visualization to iOS: A Focus on Performance

At Datadog, we’ve been using SwiftUI since day one. We went from initially using it for prototyping and building internal tools, to adopting it in small features, then to building full products!In 2022, we introduced APM Services with its rich data visualization experience to the Datadog mobile application. And for that, we started implementing DogGraphs, an internal graphing library, to bring Datadog’s data visualization to iOS using native technologies like Swift and SwiftUI, as no public library met our needs...

6 months ago

engineering on Modern monitoring & security

.NET Continuous Profiler: Exception and Lock Contention

Previously in this series, we presented a high-level overview of Datadog’s .NET continuous profiler and discussed CPU and wall time profiling. This third part covers exceptions and lock contention profiling, with related sampling strategies. When an application throws too many exceptions, or its threads are waiting too long on the same locks, the impact on the performance is not as straightforward as CPU consumption but could be noticeable.Throwing too many exceptions usually increases the CPU consumption. For example, consider when...

7 months ago

engineering on Modern monitoring & security

Engineering Spotlight: Marie-Laure Bardonnet

In this edition of the Datadog Engineering Spotlight, Austin from the Community team sat down (virtually) with Marie-Laure Bardonnet. She’s a Senior Engineering Manager leading engineering for Datadog’s Log Management team, and was once an intern in the New York office. We talk about her growth as an individual contributor and as a people manager, her passion for mentorship and development, and her engineering-driven approach to problem solving and decision making.This interview has been edited for clarity and length.How has...

7 months ago

engineering on Modern monitoring & security

How We Use Vale to Improve Our Documentation Editing Process

“I notice that you use plain, simple language, short words and brief sentences. That is the way to write English—it is the modern way and the best way.” - Mark TwainJust as development teams adopt linters such as Prettier in their workflows to flag errors and style issues in their source code, documentation teams use linters such as Vale to enforce style guidelines and maintain a standard for clear and concise prose.Crafting prose can be tricky—for example, consider this update...

8 months ago

engineering on Modern monitoring & security

.NET Continuous Profiler: CPU and Wall Time Profiling

The first part of this series introduced the high level architecture of the Datadog .NET continuous profiler. I discussed its initialization and the impact of the .NET runtime (CLR) version to figure out which CLR services to use.The goal of the profiler is to collect different kinds of profiling samples: CPU, wall time, exceptions, lock contention, etc. Each sample contains a call stack, a list of labels for the context (e.g. current thread, span ID) and a vector of values...

8 months ago

engineering on Modern monitoring & security

.NET Continuous Profiler: Under the Hood

The Profiling Engineering team at Datadog develops profiling tools for various runtimes, including Microsoft .NET. This blog post is the first in a series explaining the technical architecture and implementation choices behind our .NET profiler. Along the way, we’ll discuss profiling for CPU, wall time, exceptions, lock contention, and allocations.What is a profiler?Before digging into the details, let’s define what a profiler is: a profiler is a tool that allows you to analyze application performance and method call stacks. While...

10 months ago

engineering on Modern monitoring & security

Scaling Self-Serve Analytics: The Tools Empowering 5,000 Employees

In October at the Crunch Conference in Budapest, Jean-Mathieu Saponaro, Data & Analytics Senior Engineering Manager at Datadog, delivered a presentation about how he and his teams scaled self-serve analytics within Datadog up to the 5,000 employees we have today.Self-serve analytics is the dream of any Data & Analytics team: being able to reach that point where everyone in your company is leveraging data to answer day-to-day questions and make decisions without requiring your team’s help. But how do you...

10 months ago

engineering on Modern monitoring & security

Engineering Spotlight: Jeromy Carriere

In this edition of the Datadog Engineering Spotlight, Rosa from the Community team sat down (virtually) with Jeromy Carriere. He’s SVP, Product Engineering, leading engineering for all of Datadog’s product areas including Infrastructure Monitoring, Log Management, Application Performance Monitoring, Security, Service Management, and User Experience Monitoring. We talk about his relationship with academia, learning from his mistakes, driving the first cloud monitoring offerings at Google, and the future of observability.This interview has been edited for clarity and length.What’s your day-to-day...

12 months ago

engineering on Modern monitoring & security

Leverage user context to debug mobile performance issues with the Instabug Datadog Marketplace offering

As user expectations for mobile apps increase, effective bug remediation involves not only addressing critical incidents as they occur but also proactively handling smaller performance issues in order to ensure a smooth user experience (UX). Instabug helps you understand how users experience your app with crucial mobile performance metrics—such as launch metrics, loading times, and UI hangs—viewable alongside your bug reports. Additionally, with granular release metrics, Instabug expedites incident detection and remediation throughout your entire development lifecycle. This enables you...

about 1 year ago

engineering on Modern monitoring & security

Monitor Google Cloud Vertex AI with Datadog

Vertex AI is Google’s platform offering AI and machine learning computing as a service—enabling users to train and deploy machine learning (ML) models and AI applications in the cloud. In June 2023, Google added generative AI support to Vertex AI, so users can test, tune, and deploy Google’s large language models (LLMs) for use in their applications.We’re pleased to announce that Datadog now integrates with Vertex AI, helping you track the health and performance of your LLM-powered services in production....

about 1 year ago

engineering on Modern monitoring & security

Datadog for US Government Solution Brief

Learn how to use Datadog for real-time insights to move your mission forward.Datadog is a unified observability platform that collects, processes, and analyzes performance data from your entire IT environment, providing the visibility US Government agencies need to advance technology initiatives quickly and securely.In this solution brief, you’ll learn how US government agencies can achieve four key outcomes of unified observability:Faster time to remediationImproved security postureIncreased operational efficiencyOptimized citizen customer experience (Citizen CX)Complete the form to receive the solution brief.

about 1 year ago

engineering on Modern monitoring & security

Automate incident response and security workflows with Blink in the Datadog Marketplace

Security and DevOps engineers often spend a lot of time and effort creating and managing complex, repetitive workflows, such as incident response, honeypotting, recovery and remediation, and more. Blink is a no-code security platform that enables users to create workflow automations, triggers, and self-service apps to streamline processes, better enforce guardrails, and eliminate operational bottlenecks. These capabilities help reduce the time and effort required for your security and DevOps teams to create a more dependable, secure system.Datadog now offers an...

about 1 year ago

engineering on Modern monitoring & security

Monitoring Solutions for Financial Services

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Retail E-Commerce Monitoring

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Digital Media Entertainment Monitoring

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Manufacturing Logistics Monitoring

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Education Monitoring Solutions

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Healthcare Life Sciences Monitoring

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Government Monitoring Solutions

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Monitoring Solutions for Gaming Industry

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Monitoring Solutions for Technology Companies

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

Datadog Report Finds the Serverless Ecosystem Growing Across All Major Clouds

NEW YORK — Datadog, Inc. (NASDAQ: DDOG), the monitoring and security platform for cloud applications, today announced the results of its annual State of Serverless report. The 2023 report—which analyzes telemetry across Datadog’s global customer base—found that the serverless ecosystem continues to grow and evolve, particularly as organizations extend their use of container-based applications hosted in serverless environments. As of this year, more than 70% of Datadog’s AWS customers, 60% of Google Cloud customers and almost 50% of Azure customers are using...

about 1 year ago

engineering on Modern monitoring & security

The State of Serverless

PopulationFor this report, we compiled usage data from thousands of companies in Datadog's customer base. But while Datadog customers cover the spectrum of company size and industry, they do share some common traits. First, they tend to be serious about software infrastructure and application performance. They also skew toward adoption of cloud platforms and services more than the general population. All the results in this article are biased on the fact that the data comes from our customer base, a...

about 1 year ago

engineering on Modern monitoring & security

Visualize service ownership and application boundaries in the Service Map

The complexity of microservice architectures can make it hard to determine where an application’s dependencies begin and end and who manages which ones. This can pose a variety of challenges both in the course of day-to-day operations and during incidents. Lacking a clear picture of the ownership and interplay of your services can impede accountability and cause application development, incident investigations, and onboarding processes to become prolonged and haphazard.To prevent these difficulties, the Datadog Service Map now allows you to...

about 1 year ago

engineering on Modern monitoring & security

How we use Datadog CSM to improve security posture in our cloud infrastructure

In complex cloud environments where the speed of development is accelerated, managing infrastructure and resource configurations can be an overwhelming task—particularly when certifications and compliance frameworks like PCI, HIPAA, and SOC 2 present a lengthy list of requirements. DevOps and engineering teams need to ship code updates at a rapid pace, making it easy for them to accidentally overlook misconfigurations. Meanwhile, security teams often don’t have the context they need to understand resource behavior and ownership, identify the highest-priority vulnerabilities...

about 1 year ago

engineering on Modern monitoring & security

Run Atomic Red Team detection tests in container environments with Datadog’s Workload Security Evaluator

Ensuring your threat detection rules work as intended and provide sufficient coverage for major threats is a critical component of a security program. Red Canary’s Atomic Red Team—an open source library of detection tests that help teams validate the effectiveness of their security measures—has historically been the tool of choice for detection testing. But while the methods in Atomic Red Team are tried and true for traditional Windows and Linux hosts, evaluating detection coverage for containers and cloud environments can...

about 1 year ago

engineering on Modern monitoring & security

Integrate Sigma detection rules with Datadog Cloud SIEM

As organizations grow, they naturally need to analyze logs from more data sources. But as these data sources expand in number and type, it becomes more difficult for teams to scale their security detection rules to keep up with the ever-changing threat landscape. Sigma is an open source project that aims to address this challenge. By leveraging the expertise of the open source community, Sigma enables security teams to implement out-of-the-box rules that cover a wide range of threat scenarios.We’re...

about 1 year ago

engineering on Modern monitoring & security

Changes to Datadog Cloud Security Management

In order to better meet organizations’ specific requirements for securing their environments, we are making changes to our Cloud Security Management product. On August 1, Datadog introduced new offerings in Cloud Security Management: CSM Pro and CSM Enterprise. Alongside Datadog Cloud Workload Security, these distinct packages provide customers with security capabilities tailored to their particular use cases and needs.CSM Pro provides continuous security scanning of your cloud and container environments for misconfigurations and resource vulnerabilities. With CSM Pro, teams across...

about 1 year ago

engineering on Modern monitoring & security

Easily install the Datadog Agent using AWS Systems Manager

AWS Systems Manager (SSM), an end-to-end management solution for AWS resources, provides a marketplace of pre-packaged software scripts for SSM-managed Windows and Linux instances, enabling AWS users to automatically install custom software on large groups of instances.Datadog now offers documents that enable easy, one-click installation of the latest version of our Agent for both Linux and Windows through the AWS SSM marketplace, allowing joint Datadog and AWS users to install the Agent without having to configure the Agent YAML file....

about 1 year ago

engineering on Modern monitoring & security

Key questions to ask when setting SLOs

Many organizations rely on service level objectives (SLOs) to help them gauge the reliability of their products. By setting SLOs that define clear and measurable reliability targets, businesses can ensure they are delivering positive end-user experiences to their customers. Clearly defined SLOs also make it much easier for businesses to understand what tradeoffs they may have to make in order to deliver those specific experiences. For instance, meeting certain SLOs might require more resources, which could drive up costs or...

about 1 year ago

engineering on Modern monitoring & security

​​Key metrics for CoreDNS monitoring

CoreDNS is an open source DNS server that can resolve requests for internet domain names and provide service discovery within a Kubernetes cluster. CoreDNS is the default DNS provider in Kubernetes as of v1.13. Though it can be used independently of Kubernetes, this series will focus on its role in providing Kubernetes service discovery, which simplifies cluster networking by enabling clients to access services using DNS names rather than IP addresses. It’s important to monitor CoreDNS to ensure that elevated...

about 1 year ago

engineering on Modern monitoring & security

​​Tools for collecting metrics and logs from CoreDNS

In Part 1 of this series, we looked at key metrics you should monitor to understand the performance of your CoreDNS servers. In this post, we’ll show you how to collect and visualize these metrics. We’ll also explore how CoreDNS logging works and show you how to collect CoreDNS logs to get even deeper visibility into your Deployment.Collect and visualize CoreDNS metricsThe CoreDNS prometheus plugin exposes metrics in the OpenMetrics format, a text-based standard that evolved from the Prometheus format....

about 1 year ago

engineering on Modern monitoring & security

How to monitor CoreDNS with Datadog

In Part 1 of this series, we introduced you to the key metrics you should be monitoring to ensure that you get optimal performance from CoreDNS running in your Kubernetes clusters. In Part 2, we showed you some tools you can use to monitor CoreDNS. In this post, we’ll show you how you can use Datadog to monitor metrics, logs, and traces from CoreDNS alongside telemetry from the rest of your cluster, including the infrastructure it runs on.We’ll show you...

about 1 year ago

engineering on Modern monitoring & security

Retail E-commerce Solution Brief

Optimize system performance and customer experience with Datadog.Learn how Datadog helps you deliver the best experience to your customers with end-to-end visibility and analytics in one platform.In this solution brief, you’ll learn how Datadog allows retailers and e-commerce vendors to:Be data-driven and customer-obsessedUnderstand the digital buying experienceEnsure website reliability for seasonal promotionsMonitor in-store point-of-sale devicesComplete the form to receive the solution brief.

about 1 year ago

engineering on Modern monitoring & security

OpenAI Solution Brief

Learn how Datadog can monitor, assess, and optimize your organization’s usage of OpenAI’s API.OpenAI is an AI research and development company whose products include the GPT family of large language models.In this solution brief, you’ll learn how Datadog provides critical insights into OpenAI usage patterns and enables you to:Monitor and allocate costs based on token usageAnalyze API response times to troubleshoot and optimize performanceGet insights across multiple AI modelsComplete the form to receive the solution brief.

about 1 year ago

engineering on Modern monitoring & security

Security Analytics Solution Brief

Learn how Datadog's security analytics solutions allow for Dev, Sec, and Ops teams to catch potential threats earlier and improve security posture.Traditional security analytics tools are unable to deliver effective threat detection and investigation for public cloud environments.In this solution brief, you’ll learn how Datadog Cloud Security Platform protects an organization’s production environment and enables you to:Detect security threats with the monitoring data you already collectCost-effectively ingest and analyze your logsGet real-time, out-of-the-box security analyticsUnify visibility for faster triage and...

about 1 year ago

engineering on Modern monitoring & security

Pivotal Solution Brief

Learn how Datadog can be used to monitor applications running on Pivotal Platform.Pivotal Platform, now known as VMware Tanzu Application Service, is a multi-cloud platform for the deployment, management, and continuous delivery of applications, containers, and functions.In this solution brief, you’ll learn how Datadog enables you to get visibility into your Pivotal Platform cluster and enables operators and developers to:Monitor Pivotal Platform deployments in any cloudRapidly detect and resolve performance issuesScale efficiently and meet compliance goalsAnalyze the business impact of...

about 1 year ago

engineering on Modern monitoring & security

Toyota deploys at scale faster and more securely by monitoring AWS with Datadog

TMNA has saved $10 million over two years using Chofer. Part of that savings can be attributed to using Datadog to monitor its underlying infrastructure, supporting services, applications, and security data in a single observability platform. Datadog helps TMNA teams free up time they’d typically spend managing infrastructure or observability so they can spend more time on feature delivery.With these time savings, teams now ship projects in weeks instead of quarterly. In addition, since new hires can easily make sense...

about 1 year ago

engineering on Modern monitoring & security

Digital Experience Monitoring Solution Brief

Learn how Datadog's Digital Experience Monitoring suite can help teams troubleshoot frontend issues, collaborate more efficiently, and deliver a strong user experience.Companies have become increasingly reliant on web and mobile applications to meet their customers’ needs.In this solution brief, you’ll learn how Datadog provides a single source of truth for frontend monitoring data and enables teams to:Run synthetic tests against different environments, devices, browsers, and locationsEliminate data silos for improved UX collaborationPrioritize issues based on severity, frequency, and scopeReduce MTTD...

about 1 year ago

engineering on Modern monitoring & security

Technology Partner Onboarding Guide

Discover the benefits of joining the Datadog Partner Network as a Technology Partner and how to maximize your growth in the Datadog Marketplace.Companies looking to extend their reach and customer base by selling on the Datadog Marketplace may struggle to seamlessly integrate their offerings to meet users’ needs. This involves ensuring value alignment, adhering to integration standards, and effectively distinguishing their solutions in a robust marketplace.Read this Technology Partner Onboarding Guide to learn how to join the DPN as a...

about 1 year ago

engineering on Modern monitoring & security

Datadog Announces Bits, an AI Assistant to Help Engineers Quickly Resolve Application Issues

SAN FRANCISCO, Aug. 3, 2023 /PRNewswire/ — Datadog, Inc. (NASDAQ: DDOG), the monitoring and security platform for cloud applications, today at DASH announced the launch of Bits, a new generative AI-based assistant that learns from customers’ observability data and helps engineers resolve application issues in real time. This capability adds to Datadog’s extensive set of AI features available today. Solving complex performance issues during an incident is challenging and time consuming with the massive volumes of data, documentation, conversations and other information developers...

about 1 year ago

engineering on Modern monitoring & security

Mobile Real User Monitoring

Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities

about 1 year ago

engineering on Modern monitoring & security

How We Migrated Our Acceptance Tests to Use Synthetic Monitoring

The Frontend Developer Experience team strives to improve the lives of 300 frontend engineers at Datadog. We cover build systems, tests, deployments, code health, internal tools, and more—we’re here to remove any friction and pain points from our engineers’ workflows.One such pain point was difficult-to-maintain acceptance tests. This is the story of how we migrated a codebase from flaky, unmanageable acceptance testing with Puppeteer (Chromium Headless Browser) to more robust and maintainable Synthetic tests.Identifying pain points—with dataSince our team is...

over 1 year ago

engineering on Modern monitoring & security

2023-03-08 Incident: A Deep Dive into the Platform-level Recovery

On March 8, 2023, Datadog experienced an outage that affected all services across multiple regions. In a previous post we described how we faced the unexpected.We left off with the realization that we had lost 60 percent of our compute capacity. Armed with this knowledge, our teams knew the first step they needed to take: restore our platform in all affected regions in order to provide applications with enough compute capacity to recover. To get there, teams needed to work...

over 1 year ago

engineering on Modern monitoring & security

2023-03-08 Incident: A Deep Dive into Our Incident Response

In March, Datadog experienced a global outage. It was the first of its kind and called for a massive response that involved several hundred engineers working in shifts over the course of the outage, in addition to many concurrent video calls, chats, workstreams, and customer interactions. Within days, we had compiled hundreds of pages of internal deep dives and after-action reports.Like many of you, as custodians of large-scale, complex systems, we have always lived with the realistic expectation that a...

over 1 year ago

engineering on Modern monitoring & security

Not Just Another Network Latency Issue: How We Unraveled a Series of Hidden Bottlenecks

Getting paged to investigate high-urgency issues is a normal aspect of being an engineer. But none of us expect (or want) to get paged about every single deployment. That was what started happening with an application in our usage estimation service. Regardless of the size of the fix or the complexity of the feature we pushed out, we would get paged about high startup latency—an issue that was never related to our changes. These alerts often resolved on their own,...

over 1 year ago

engineering on Modern monitoring & security

2023-03-08 Incident: A Deep Dive into the Platform-level Impact

On March 8, 2023, Datadog experienced an outage that affected all services across multiple regions. In a separate blog post we describe how our products were impacted.This event was highly unusual from the start. An outage that affects services across multiple regions is rare because our regions don’t have direct connections with one another, and because we roll out changes in one region at a time. An outage that affects services across multiple regions running on distinct cloud providers at...

over 1 year ago

engineering on Modern monitoring & security

Making Fetch Happen - Building a General-purpose Query Render Scheduler

Users expect web applications to be fast and responsive, with smooth scrolling and almost instantaneous rendering. Combining complex UI interactions with frequent data fetching, as many Datadog products do, makes optimizing for good runtime performance a challenge. Dashboards in particular is difficult as, unlike other products, users have complete control over the size of their boards and the complexity of their queries.This post describes how we developed a new query and render scheduler, conceived initially to optimize Dashboard performance but...

over 1 year ago

engineering on Modern monitoring & security

Husky: Exactly-Once Ingestion and Multi-Tenancy at Scale

We introduced Husky in a previous blog post (Introducing Husky, Datadog’s Third-Generation Event Store) as Datadog’s third-generation event store. To recap, Husky is a distributed, time-series oriented, columnar store optimized for streaming ingestion and hybrid analytical and search queries. Husky’s architecture decouples storage and compute so that they can be scaled independently:Datadog's third-gen storage system, Husky. Expand to view full detail.Due to the nature of Datadog’s product, Husky’s storage engine is almost completely optimized around serving large scan and aggregation...

over 1 year ago

engineering on Modern monitoring & security

Performance Improvements in the Datadog Agent Metrics Pipeline

Our aspiration for the Datadog Agent is for it to process the maximum amount of data, very quickly, with as low of a CPU as possible. Striking this balance between performance and efficiency is an ongoing challenge for us. We are constantly searching for ways to optimize processing while using the same or less CPU. Recently, we shipped a change improving the part of the Datadog Agent that computes a unique key for every metric it receives, making it possible...

over 1 year ago

engineering on Modern monitoring & security

DRUIDS, the Design System that Powers Datadog

In 2018, Datadog was in the midst of rapid expansion. We had recently released APM to complement our core infrastructure monitoring product and, with the addition of Log Management and the development of Synthetic and Real User Monitoring, we were becoming a unified data platform.For an enterprise software platform to be successful, the whole has to be greater than the sum of its parts. In Datadog’s case, it means users must be able to connect different types of data, pivot...

about 2 years ago

engineering on Modern monitoring & security

Engineering Spotlight: Tay Nishimura

In this edition of the Datadog Engineering Spotlight, Cecilia from the Community team sat down (virtually) with Tay Nishimura, infrastructure engineer on the Distributed Caching team, to discuss her journey through the tech industry.The past few years have seen an upturn in programs dedicated to guiding women towards careers in tech. But once you’ve begun a career in tech, where do you go? The tech industry is a vast and varied place, and there’s often little guidance on how to...

over 2 years ago

engineering on Modern monitoring & security

Introducing Husky, Datadog's Third-Generation Event Store

This is the story of “Husky”, a new event storage system we built at Datadog.Building a new storage system is a fun and exciting undertaking—and one that shouldn’t be taken lightly. Most importantly, it does not happen in a vacuum. To understand the trade-offs that a new system makes, you need to understand the context: what came before it, and why we decided to build something new.From Metrics to LogsA few years ago, Datadog announced the general availability of its...

over 2 years ago

engineering on Modern monitoring & security

How Datadog's IT Team Automated Account Inactivity and SaaS Spend Management

Employees at companies of all types use dozens and even hundreds of types of commercial software to do their jobs and to develop their product. Many of these services have a monthly recurring cost per user model (Datadog does not follow a monthly recurring cost per user model; for more information about our pricing, see the Datadog Pricing page). While the plethora of tools out there is great for employees, the management and cost of all these different solutions can...

over 2 years ago

engineering on Modern monitoring & security

It's always DNS . . . except when it's not: A deep dive through gRPC, Kubernetes, and AWS networking

This story began when a routine update to one of our critical services caused a rise in errors. It looked like a simple issue—logs pointed to DNS and our metrics indicated that the impact to users was very low. But weeks later, our engineers were still puzzling over dropped packets, looking for clues in kernel code, and exploring the complexities of Kubernetes networking and gRPC client reconnect algorithms. However, no single team was able to fully understand the issue from...

over 2 years ago

engineering on Modern monitoring & security

Using the Dirty Pipe Vulnerability to Break Out from Containers

The Dirty Pipe vulnerability is a flaw in the Linux kernel that allows an unprivileged process to write to any file it can read, even if it does not have write permissions on this file. This primitive allows for privilege escalation, for instance by overwriting the /etc/passwd file with a new admin user.Exploiting Dirty Pipe to add a privileged user to the system by writing to the /etc/passwd fileThis vulnerability could be used for breaking out from unprivileged containers, including...

over 2 years ago

engineering on Modern monitoring & security

Profiling Improvements in Go 1.18

Without a doubt, Go 1.18 is shaping up to be one of the most exciting releases since Go 1.0. You’ve probably heard about major features such as generics and fuzzing, but in this post, I’ll focus on profiling and highlight a few noteworthy improvements to look forward to.A point of personal joy for me is that, as part of my job at Datadog, I was able to contribute to several of the improvements in Go 1.18, which will significantly improve...

over 2 years ago