engineering on Modern monitoring & security
I’ve been working full-time remote or partially remote for the last 10 years. I currently work full-time remote at Datadog, which is remote-capable (another term being “remote-friendly”) but predominately office-based. As a remote employee with an office-based employer, you must understand your company’s stance on the subject to know how to approach being remote, ensuring you remain connected, visible, and effective.This is my short guide to help you excel as a remote worker in this type of setting. While I...
22 days ago
engineering on Modern monitoring & security
It is a trusted premise of software engineering that we build large systems from smaller components, each of which can be designed and tested with a high degree of confidence. Some design problems, though, only become evident at the system level, and the absence of reliable methods for testing system-level issues can sometimes take us by surprise. Decisions that are legitimate at the component level can have unexpected and sometimes dramatic consequences at the system level. Our global outage on...
about 2 months ago
engineering on Modern monitoring & security
Do you know that feeling when the coding is done and the pull request is approved, and you only need a green pipeline for the merge to be complete? Then the dreaded sequence of events occurs: running your test suite takes 20+ minutes and then fails because of some flaky test that has no connection to your changes. You have to run the tests again, wasting a lot of time and energy for nothing.Lengthy, unstable CI pipelines are a common...
2 months ago
engineering on Modern monitoring & security
Writing a postmortem is an essential learning process after an incident is resolved. But documenting important details comprehensively can be cumbersome, especially when responders have already moved on to the next urgent issue. To make this process easier, we implemented a feature in Bits AI that uses large language models (LLMs) to ease the writing process, aiming to retain the engineers’ control without compromising the primary goal of learning while recapping the details of the incident.To implement this solution, we...
4 months ago
engineering on Modern monitoring & security
Datadog collects billions of events from millions of hosts every minute and that number keeps growing and fast. Our data volumes grew 30x between 2017 and 2022. On top of that, the kind of queries we receive from our users has changed significantly. Why? Because our customers have grown in sophistication: they run more complex stacks, want to monitor more data, and run more complex analyses. That, in turn, puts pressure on our timeseries data store.Data stores have a number...
7 months ago
engineering on Modern monitoring & security
When Codiga joined Datadog last year, we integrated our static analyzer product with Datadog’s infrastructure to release Datadog Code Analysis. Codiga’s static analyzer was written in Java and supported code analysis for Python, JavaScript, and TypeScript. It initially used ANother Tool for Language Recognition (ANTLR) for generating an abstract syntax tree (AST) of each language supported by the product.Shortly after the acquisition, we worked to expand support to additional languages, but ran into two major issues:With our existing tools and...
8 months ago
engineering on Modern monitoring & security
In this edition of the Datadog Engineering Spotlight, Tom from the Community team sat down with Ivo Dimitrov, one of our Engineering VPs. Tom and Ivo spoke about Ivo’s career as an engineering manager for several top organizations, his transition from individual contributor to manager, and what excites him today about the distributed data systems his teams are building.This interview has been edited for clarity and length.How did you originally become interested in software engineering?I’ve been in the industry for...
8 months ago
engineering on Modern monitoring & security
In Part 1 of this series, I presented a high-level overview of the architecture, implementation, and initialization of Datadog’s .NET profiler, which consists of several individual profilers that collect data for particular resources. I went on to discuss profiling CPU and wall time in Part 2 and exceptions and lock contention in Part 3, alongside detailed explanations of stack walking across different platforms and upscaling sampled data.This fourth and final part covers memory usage profiling—why it’s useful, how our profiler...
8 months ago
engineering on Modern monitoring & security
How do we surface the rich stories hidden within our users’ observability data? We can use percentiles to communicate performance for a specific percentage of cases—but for the full shape of performance, we use distribution metrics.These metrics, powered by DDSketch, aggregate data from multiple hosts during a flush interval, enabling users to analyze statistical distributions across their entire infrastructure. To visually represent this high-resolution data, we use heatmap visualizations—which provide a means to effectively convey high-cardinality point distributions.In this blog...
9 months ago
engineering on Modern monitoring & security
At Datadog, we’ve been using SwiftUI since day one. We went from initially using it for prototyping and building internal tools, to adopting it in small features, then to building full products!In 2022, we introduced APM Services with its rich data visualization experience to the Datadog mobile application. And for that, we started implementing DogGraphs, an internal graphing library, to bring Datadog’s data visualization to iOS using native technologies like Swift and SwiftUI, as no public library met our needs...
9 months ago
engineering on Modern monitoring & security
Previously in this series, we presented a high-level overview of Datadog’s .NET continuous profiler and discussed CPU and wall time profiling. This third part covers exceptions and lock contention profiling, with related sampling strategies. When an application throws too many exceptions, or its threads are waiting too long on the same locks, the impact on the performance is not as straightforward as CPU consumption but could be noticeable.Throwing too many exceptions usually increases the CPU consumption. For example, consider when...
10 months ago
engineering on Modern monitoring & security
In this edition of the Datadog Engineering Spotlight, Austin from the Community team sat down (virtually) with Marie-Laure Bardonnet. She’s a Senior Engineering Manager leading engineering for Datadog’s Log Management team, and was once an intern in the New York office. We talk about her growth as an individual contributor and as a people manager, her passion for mentorship and development, and her engineering-driven approach to problem solving and decision making.This interview has been edited for clarity and length.How has...
10 months ago
engineering on Modern monitoring & security
“I notice that you use plain, simple language, short words and brief sentences. That is the way to write English—it is the modern way and the best way.” - Mark TwainJust as development teams adopt linters such as Prettier in their workflows to flag errors and style issues in their source code, documentation teams use linters such as Vale to enforce style guidelines and maintain a standard for clear and concise prose.Crafting prose can be tricky—for example, consider this update...
11 months ago
engineering on Modern monitoring & security
The first part of this series introduced the high level architecture of the Datadog .NET continuous profiler. I discussed its initialization and the impact of the .NET runtime (CLR) version to figure out which CLR services to use.The goal of the profiler is to collect different kinds of profiling samples: CPU, wall time, exceptions, lock contention, etc. Each sample contains a call stack, a list of labels for the context (e.g. current thread, span ID) and a vector of values...
11 months ago
engineering on Modern monitoring & security
The Profiling Engineering team at Datadog develops profiling tools for various runtimes, including Microsoft .NET. This blog post is the first in a series explaining the technical architecture and implementation choices behind our .NET profiler. Along the way, we’ll discuss profiling for CPU, wall time, exceptions, lock contention, and allocations.What is a profiler?Before digging into the details, let’s define what a profiler is: a profiler is a tool that allows you to analyze application performance and method call stacks. While...
about 1 year ago
engineering on Modern monitoring & security
In October at the Crunch Conference in Budapest, Jean-Mathieu Saponaro, Data & Analytics Senior Engineering Manager at Datadog, delivered a presentation about how he and his teams scaled self-serve analytics within Datadog up to the 5,000 employees we have today.Self-serve analytics is the dream of any Data & Analytics team: being able to reach that point where everyone in your company is leveraging data to answer day-to-day questions and make decisions without requiring your team’s help. But how do you...
about 1 year ago
engineering on Modern monitoring & security
In this edition of the Datadog Engineering Spotlight, Rosa from the Community team sat down (virtually) with Jeromy Carriere. He’s SVP, Product Engineering, leading engineering for all of Datadog’s product areas including Infrastructure Monitoring, Log Management, Application Performance Monitoring, Security, Service Management, and User Experience Monitoring. We talk about his relationship with academia, learning from his mistakes, driving the first cloud monitoring offerings at Google, and the future of observability.This interview has been edited for clarity and length.What’s your day-to-day...
about 1 year ago
engineering on Modern monitoring & security
As user expectations for mobile apps increase, effective bug remediation involves not only addressing critical incidents as they occur but also proactively handling smaller performance issues in order to ensure a smooth user experience (UX). Instabug helps you understand how users experience your app with crucial mobile performance metrics—such as launch metrics, loading times, and UI hangs—viewable alongside your bug reports. Additionally, with granular release metrics, Instabug expedites incident detection and remediation throughout your entire development lifecycle. This enables you...
over 1 year ago
engineering on Modern monitoring & security
Vertex AI is Google’s platform offering AI and machine learning computing as a service—enabling users to train and deploy machine learning (ML) models and AI applications in the cloud. In June 2023, Google added generative AI support to Vertex AI, so users can test, tune, and deploy Google’s large language models (LLMs) for use in their applications.We’re pleased to announce that Datadog now integrates with Vertex AI, helping you track the health and performance of your LLM-powered services in production....
over 1 year ago
engineering on Modern monitoring & security
Learn how to use Datadog for real-time insights to move your mission forward.Datadog is a unified observability platform that collects, processes, and analyzes performance data from your entire IT environment, providing the visibility US Government agencies need to advance technology initiatives quickly and securely.In this solution brief, you’ll learn how US government agencies can achieve four key outcomes of unified observability:Faster time to remediationImproved security postureIncreased operational efficiencyOptimized citizen customer experience (Citizen CX)Complete the form to receive the solution brief.
over 1 year ago
engineering on Modern monitoring & security
Security and DevOps engineers often spend a lot of time and effort creating and managing complex, repetitive workflows, such as incident response, honeypotting, recovery and remediation, and more. Blink is a no-code security platform that enables users to create workflow automations, triggers, and self-service apps to streamline processes, better enforce guardrails, and eliminate operational bottlenecks. These capabilities help reduce the time and effort required for your security and DevOps teams to create a more dependable, secure system.Datadog now offers an...
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
NEW YORK — Datadog, Inc. (NASDAQ: DDOG), the monitoring and security platform for cloud applications, today announced the results of its annual State of Serverless report. The 2023 report—which analyzes telemetry across Datadog’s global customer base—found that the serverless ecosystem continues to grow and evolve, particularly as organizations extend their use of container-based applications hosted in serverless environments. As of this year, more than 70% of Datadog’s AWS customers, 60% of Google Cloud customers and almost 50% of Azure customers are using...
over 1 year ago
engineering on Modern monitoring & security
PopulationFor this report, we compiled usage data from thousands of companies in Datadog's customer base. But while Datadog customers cover the spectrum of company size and industry, they do share some common traits. First, they tend to be serious about software infrastructure and application performance. They also skew toward adoption of cloud platforms and services more than the general population. All the results in this article are biased on the fact that the data comes from our customer base, a...
over 1 year ago
engineering on Modern monitoring & security
The complexity of microservice architectures can make it hard to determine where an application’s dependencies begin and end and who manages which ones. This can pose a variety of challenges both in the course of day-to-day operations and during incidents. Lacking a clear picture of the ownership and interplay of your services can impede accountability and cause application development, incident investigations, and onboarding processes to become prolonged and haphazard.To prevent these difficulties, the Datadog Service Map now allows you to...
over 1 year ago
engineering on Modern monitoring & security
In complex cloud environments where the speed of development is accelerated, managing infrastructure and resource configurations can be an overwhelming task—particularly when certifications and compliance frameworks like PCI, HIPAA, and SOC 2 present a lengthy list of requirements. DevOps and engineering teams need to ship code updates at a rapid pace, making it easy for them to accidentally overlook misconfigurations. Meanwhile, security teams often don’t have the context they need to understand resource behavior and ownership, identify the highest-priority vulnerabilities...
over 1 year ago
engineering on Modern monitoring & security
Ensuring your threat detection rules work as intended and provide sufficient coverage for major threats is a critical component of a security program. Red Canary’s Atomic Red Team—an open source library of detection tests that help teams validate the effectiveness of their security measures—has historically been the tool of choice for detection testing. But while the methods in Atomic Red Team are tried and true for traditional Windows and Linux hosts, evaluating detection coverage for containers and cloud environments can...
over 1 year ago
engineering on Modern monitoring & security
As organizations grow, they naturally need to analyze logs from more data sources. But as these data sources expand in number and type, it becomes more difficult for teams to scale their security detection rules to keep up with the ever-changing threat landscape. Sigma is an open source project that aims to address this challenge. By leveraging the expertise of the open source community, Sigma enables security teams to implement out-of-the-box rules that cover a wide range of threat scenarios.We’re...
over 1 year ago
engineering on Modern monitoring & security
In order to better meet organizations’ specific requirements for securing their environments, we are making changes to our Cloud Security Management product. On August 1, Datadog introduced new offerings in Cloud Security Management: CSM Pro and CSM Enterprise. Alongside Datadog Cloud Workload Security, these distinct packages provide customers with security capabilities tailored to their particular use cases and needs.CSM Pro provides continuous security scanning of your cloud and container environments for misconfigurations and resource vulnerabilities. With CSM Pro, teams across...
over 1 year ago
engineering on Modern monitoring & security
AWS Systems Manager (SSM), an end-to-end management solution for AWS resources, provides a marketplace of pre-packaged software scripts for SSM-managed Windows and Linux instances, enabling AWS users to automatically install custom software on large groups of instances.Datadog now offers documents that enable easy, one-click installation of the latest version of our Agent for both Linux and Windows through the AWS SSM marketplace, allowing joint Datadog and AWS users to install the Agent without having to configure the Agent YAML file....
over 1 year ago
engineering on Modern monitoring & security
Many organizations rely on service level objectives (SLOs) to help them gauge the reliability of their products. By setting SLOs that define clear and measurable reliability targets, businesses can ensure they are delivering positive end-user experiences to their customers. Clearly defined SLOs also make it much easier for businesses to understand what tradeoffs they may have to make in order to deliver those specific experiences. For instance, meeting certain SLOs might require more resources, which could drive up costs or...
over 1 year ago
engineering on Modern monitoring & security
CoreDNS is an open source DNS server that can resolve requests for internet domain names and provide service discovery within a Kubernetes cluster. CoreDNS is the default DNS provider in Kubernetes as of v1.13. Though it can be used independently of Kubernetes, this series will focus on its role in providing Kubernetes service discovery, which simplifies cluster networking by enabling clients to access services using DNS names rather than IP addresses. It’s important to monitor CoreDNS to ensure that elevated...
over 1 year ago
engineering on Modern monitoring & security
In Part 1 of this series, we looked at key metrics you should monitor to understand the performance of your CoreDNS servers. In this post, we’ll show you how to collect and visualize these metrics. We’ll also explore how CoreDNS logging works and show you how to collect CoreDNS logs to get even deeper visibility into your Deployment.Collect and visualize CoreDNS metricsThe CoreDNS prometheus plugin exposes metrics in the OpenMetrics format, a text-based standard that evolved from the Prometheus format....
over 1 year ago
engineering on Modern monitoring & security
In Part 1 of this series, we introduced you to the key metrics you should be monitoring to ensure that you get optimal performance from CoreDNS running in your Kubernetes clusters. In Part 2, we showed you some tools you can use to monitor CoreDNS. In this post, we’ll show you how you can use Datadog to monitor metrics, logs, and traces from CoreDNS alongside telemetry from the rest of your cluster, including the infrastructure it runs on.We’ll show you...
over 1 year ago
engineering on Modern monitoring & security
Optimize system performance and customer experience with Datadog.Learn how Datadog helps you deliver the best experience to your customers with end-to-end visibility and analytics in one platform.In this solution brief, you’ll learn how Datadog allows retailers and e-commerce vendors to:Be data-driven and customer-obsessedUnderstand the digital buying experienceEnsure website reliability for seasonal promotionsMonitor in-store point-of-sale devicesComplete the form to receive the solution brief.
over 1 year ago
engineering on Modern monitoring & security
Learn how Datadog's Digital Experience Monitoring suite can help teams troubleshoot frontend issues, collaborate more efficiently, and deliver a strong user experience.Companies have become increasingly reliant on web and mobile applications to meet their customers’ needs.In this solution brief, you’ll learn how Datadog provides a single source of truth for frontend monitoring data and enables teams to:Run synthetic tests against different environments, devices, browsers, and locationsEliminate data silos for improved UX collaborationPrioritize issues based on severity, frequency, and scopeReduce MTTD...
over 1 year ago
engineering on Modern monitoring & security
Learn how Datadog can monitor, assess, and optimize your organization’s usage of OpenAI’s API.OpenAI is an AI research and development company whose products include the GPT family of large language models.In this solution brief, you’ll learn how Datadog provides critical insights into OpenAI usage patterns and enables you to:Monitor and allocate costs based on token usageAnalyze API response times to troubleshoot and optimize performanceGet insights across multiple AI modelsComplete the form to receive the solution brief.
over 1 year ago
engineering on Modern monitoring & security
Learn how Datadog can be used to monitor applications running on Pivotal Platform.Pivotal Platform, now known as VMware Tanzu Application Service, is a multi-cloud platform for the deployment, management, and continuous delivery of applications, containers, and functions.In this solution brief, you’ll learn how Datadog enables you to get visibility into your Pivotal Platform cluster and enables operators and developers to:Monitor Pivotal Platform deployments in any cloudRapidly detect and resolve performance issuesScale efficiently and meet compliance goalsAnalyze the business impact of...
over 1 year ago
engineering on Modern monitoring & security
Learn how Datadog's security analytics solutions allow for Dev, Sec, and Ops teams to catch potential threats earlier and improve security posture.Traditional security analytics tools are unable to deliver effective threat detection and investigation for public cloud environments.In this solution brief, you’ll learn how Datadog Cloud Security Platform protects an organization’s production environment and enables you to:Detect security threats with the monitoring data you already collectCost-effectively ingest and analyze your logsGet real-time, out-of-the-box security analyticsUnify visibility for faster triage and...
over 1 year ago
engineering on Modern monitoring & security
TMNA has saved $10 million over two years using Chofer. Part of that savings can be attributed to using Datadog to monitor its underlying infrastructure, supporting services, applications, and security data in a single observability platform. Datadog helps TMNA teams free up time they’d typically spend managing infrastructure or observability so they can spend more time on feature delivery.With these time savings, teams now ship projects in weeks instead of quarterly. In addition, since new hires can easily make sense...
over 1 year ago
engineering on Modern monitoring & security
Discover the benefits of joining the Datadog Partner Network as a Technology Partner and how to maximize your growth in the Datadog Marketplace.Companies looking to extend their reach and customer base by selling on the Datadog Marketplace may struggle to seamlessly integrate their offerings to meet users’ needs. This involves ensuring value alignment, adhering to integration standards, and effectively distinguishing their solutions in a robust marketplace.Read this Technology Partner Onboarding Guide to learn how to join the DPN as a...
over 1 year ago
engineering on Modern monitoring & security
SAN FRANCISCO, Aug. 3, 2023 /PRNewswire/ — Datadog, Inc. (NASDAQ: DDOG), the monitoring and security platform for cloud applications, today at DASH announced the launch of Bits, a new generative AI-based assistant that learns from customers’ observability data and helps engineers resolve application issues in real time. This capability adds to Datadog’s extensive set of AI features available today. Solving complex performance issues during an incident is challenging and time consuming with the massive volumes of data, documentation, conversations and other information developers...
over 1 year ago
engineering on Modern monitoring & security
Infrastructure Infrastructure Monitoring Network Performance Monitoring Network Device MonitoringLogsApplications Application Performance Monitoring Universal Service MonitoringSecurity Application Vulnerability Management Application Security Management Cloud Security ManagementDigital Experience Browser Real User Monitoring Mobile Real User MonitoringSoftware DeliveryPlatform Capabilities
over 1 year ago
engineering on Modern monitoring & security
The Frontend Developer Experience team strives to improve the lives of 300 frontend engineers at Datadog. We cover build systems, tests, deployments, code health, internal tools, and more—we’re here to remove any friction and pain points from our engineers’ workflows.One such pain point was difficult-to-maintain acceptance tests. This is the story of how we migrated a codebase from flaky, unmanageable acceptance testing with Puppeteer (Chromium Headless Browser) to more robust and maintainable Synthetic tests.Identifying pain points—with dataSince our team is...
over 1 year ago
engineering on Modern monitoring & security
On March 8, 2023, Datadog experienced an outage that affected all services across multiple regions. In a previous post we described how we faced the unexpected.We left off with the realization that we had lost 60 percent of our compute capacity. Armed with this knowledge, our teams knew the first step they needed to take: restore our platform in all affected regions in order to provide applications with enough compute capacity to recover. To get there, teams needed to work...
over 1 year ago
engineering on Modern monitoring & security
In March, Datadog experienced a global outage. It was the first of its kind and called for a massive response that involved several hundred engineers working in shifts over the course of the outage, in addition to many concurrent video calls, chats, workstreams, and customer interactions. Within days, we had compiled hundreds of pages of internal deep dives and after-action reports.Like many of you, as custodians of large-scale, complex systems, we have always lived with the realistic expectation that a...
over 1 year ago
engineering on Modern monitoring & security
Getting paged to investigate high-urgency issues is a normal aspect of being an engineer. But none of us expect (or want) to get paged about every single deployment. That was what started happening with an application in our usage estimation service. Regardless of the size of the fix or the complexity of the feature we pushed out, we would get paged about high startup latency—an issue that was never related to our changes. These alerts often resolved on their own,...
over 1 year ago
engineering on Modern monitoring & security
On March 8, 2023, Datadog experienced an outage that affected all services across multiple regions. In a separate blog post we describe how our products were impacted.This event was highly unusual from the start. An outage that affects services across multiple regions is rare because our regions don’t have direct connections with one another, and because we roll out changes in one region at a time. An outage that affects services across multiple regions running on distinct cloud providers at...
over 1 year ago
engineering on Modern monitoring & security
Users expect web applications to be fast and responsive, with smooth scrolling and almost instantaneous rendering. Combining complex UI interactions with frequent data fetching, as many Datadog products do, makes optimizing for good runtime performance a challenge. Dashboards in particular is difficult as, unlike other products, users have complete control over the size of their boards and the complexity of their queries.This post describes how we developed a new query and render scheduler, conceived initially to optimize Dashboard performance but...
over 1 year ago
engineering on Modern monitoring & security
We introduced Husky in a previous blog post (Introducing Husky, Datadog’s Third-Generation Event Store) as Datadog’s third-generation event store. To recap, Husky is a distributed, time-series oriented, columnar store optimized for streaming ingestion and hybrid analytical and search queries. Husky’s architecture decouples storage and compute so that they can be scaled independently:Datadog's third-gen storage system, Husky. Expand to view full detail.Due to the nature of Datadog’s product, Husky’s storage engine is almost completely optimized around serving large scan and aggregation...
almost 2 years ago
engineering on Modern monitoring & security
Our aspiration for the Datadog Agent is for it to process the maximum amount of data, very quickly, with as low of a CPU as possible. Striking this balance between performance and efficiency is an ongoing challenge for us. We are constantly searching for ways to optimize processing while using the same or less CPU. Recently, we shipped a change improving the part of the Datadog Agent that computes a unique key for every metric it receives, making it possible...
almost 2 years ago
engineering on Modern monitoring & security
In 2018, Datadog was in the midst of rapid expansion. We had recently released APM to complement our core infrastructure monitoring product and, with the addition of Log Management and the development of Synthetic and Real User Monitoring, we were becoming a unified data platform.For an enterprise software platform to be successful, the whole has to be greater than the sum of its parts. In Datadog’s case, it means users must be able to connect different types of data, pivot...
over 2 years ago
engineering on Modern monitoring & security
In this edition of the Datadog Engineering Spotlight, Cecilia from the Community team sat down (virtually) with Tay Nishimura, infrastructure engineer on the Distributed Caching team, to discuss her journey through the tech industry.The past few years have seen an upturn in programs dedicated to guiding women towards careers in tech. But once you’ve begun a career in tech, where do you go? The tech industry is a vast and varied place, and there’s often little guidance on how to...
over 2 years ago
engineering on Modern monitoring & security
This is the story of “Husky”, a new event storage system we built at Datadog.Building a new storage system is a fun and exciting undertaking—and one that shouldn’t be taken lightly. Most importantly, it does not happen in a vacuum. To understand the trade-offs that a new system makes, you need to understand the context: what came before it, and why we decided to build something new.From Metrics to LogsA few years ago, Datadog announced the general availability of its...
over 2 years ago
engineering on Modern monitoring & security
Employees at companies of all types use dozens and even hundreds of types of commercial software to do their jobs and to develop their product. Many of these services have a monthly recurring cost per user model (Datadog does not follow a monthly recurring cost per user model; for more information about our pricing, see the Datadog Pricing page). While the plethora of tools out there is great for employees, the management and cost of all these different solutions can...
over 2 years ago
engineering on Modern monitoring & security
This story began when a routine update to one of our critical services caused a rise in errors. It looked like a simple issue—logs pointed to DNS and our metrics indicated that the impact to users was very low. But weeks later, our engineers were still puzzling over dropped packets, looking for clues in kernel code, and exploring the complexities of Kubernetes networking and gRPC client reconnect algorithms. However, no single team was able to fully understand the issue from...
almost 3 years ago
engineering on Modern monitoring & security
The Dirty Pipe vulnerability is a flaw in the Linux kernel that allows an unprivileged process to write to any file it can read, even if it does not have write permissions on this file. This primitive allows for privilege escalation, for instance by overwriting the /etc/passwd file with a new admin user.Exploiting Dirty Pipe to add a privileged user to the system by writing to the /etc/passwd fileThis vulnerability could be used for breaking out from unprivileged containers, including...
almost 3 years ago
engineering on Modern monitoring & security
Without a doubt, Go 1.18 is shaping up to be one of the most exciting releases since Go 1.0. You’ve probably heard about major features such as generics and fuzzing, but in this post, I’ll focus on profiling and highlight a few noteworthy improvements to look forward to.A point of personal joy for me is that, as part of my job at Datadog, I was able to contribute to several of the improvements in Go 1.18, which will significantly improve...
almost 3 years ago