Huey
September 18, 2024

Slack Engineering

Advancing Our Chef Infrastructure

At Slack, we manage tens of thousands of EC2 instances that host a variety of services, including our Vitess databases, Kubernetes workers, and various components of the Slack application. The majority of these instances run on some version of Ubuntu, while a portion operates on Amazon Linux. With such a vast infrastructure, the critical question arises: how do we efficiently provision these instances and deploy changes across them? The solution lies in a combination of internally-developed services, with Chef playing...

1 day ago

Slack Engineering

Engineering with Empathy: My Journey to Understanding the User Experience

“What are your goals for this quarter?” It’s the question every manager asks, and one that often prompts a flurry of technical objectives and project milestones. Jumping into this internship, I knew my answer. I wanted to practice making informed decisions on my project, since that was one of the challenges I faced last summer. As an intern, I struggled to form a strong opinion without as much context as my team members, and I thought that this decision-making prowess...

19 days ago

Slack Engineering

Unified Grid: How We Re-Architected Slack for Our Largest Customers

All software is built atop a core set of assumptions. As new code is added and new use-cases emerge, software can become unmoored from those assumptions. When this happens, a fundamental tension arises between revisiting those foundational assumptions—which usually entails a lot of work—or trying to support new behavior atop the existing architecture. The latter approach is usually advised, to save time and reduce risk. However, there are times when it’s worth revising the core architecture of a large software...

about 2 months ago

Slack Engineering

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

Slack Data Engineering recently underwent data workload migration from AWS EMR 5 (Spark 2/Hive 2 processing engine) to EMR 6 (Spark 3 processing engine). In this blog, we will share our migration journey, challenges, and the performance gains we observed in the process. This blog aims to assist Data Engineers, Data Infrastructure Engineers, and Product Managers who may be considering migrating to EMR 6/Spark 3. In Data Engineering, our primary objective is to support internal teams—such as Product Engineering, Machine...

3 months ago

Slack Engineering

Proactive Measures Against Password Breaches and Cookie Hijacking

At Slack, we’re committed to security that goes beyond the ordinary. We continuously strive to earn and maintain user trust by safeguarding critical components integral to every user’s experience. From passwords to session cookies, and tokens to webhooks, we prioritize protecting everything essential to how users log into the platform and remain authenticated. Through proactive measures and innovative automations that leverage cutting-edge threat intelligence, we’re dedicated to shielding users from potential breaches, cookie hijacking malware, and inadvertent exposure of sensitive...

3 months ago

Slack Engineering

Catching Compromised Cookies

Slack uses cookies to track session states for users on slack.com and the Slack Desktop app. The ever-present cookie banners have made cookies mainstream, but as a quick refresher, cookies are a little piece of client-side state associated with a website that is sent up to the web server on every request. Websites use this piece of information to inject state into the inherently stateless protocol of HTTP. At Slack, that means every time you sign into a workspace, your...

3 months ago

Slack Engineering

Advanced Rollout Techniques: Custom Strategies for Stateful Apps in Kubernetes

In a previous blog post—A Simple Kubernetes Admission Webhook—I discussed the process of creating a Kubernetes webhook without relying on Kubebuilder. At Slack, we use this webhook for various tasks, like helping us support long-lived Pods (see Supporting Long-Lived Pods), and today, I delve once more into the topic of long-lived Pods, focusing on our approach to deploying stateful applications through custom resources managed by Kubebuilder. Lack of control Many of our teams at Slack use StatefulSets to run their...

3 months ago

Slack Engineering

Balancing Old Tricks with New Feats: AI-Powered Conversion From Enzyme to React Testing Library at Slack

In the world of frontend development, one thing remains certain: change is the only constant. New frameworks emerge, and libraries can become obsolete without warning. Keeping up with the ever-changing ecosystem involves handling code conversions, both big and small. One significant shift for us was the transition from Enzyme to React Testing Library (RTL), prompting many engineers to convert their test code to a more user-focused RTL approach. While both Enzyme and RTL have their own strengths and weaknesses, the...

4 months ago

Slack Engineering

How Women Lead Data Engineering at Slack

The Data Engineering team is responsible for Slack’s data lake, analytics dashboards, and other data services. The team’s mission is to empower users to leverage data to make decisions quickly, accurately, and easily. Slack’s data lake grew in size from sub-petabyte to over 100 petabytes in recent years and it now spans millions of tables. As the complexity of managing this data grew, so did a diverse team of Slack engineers dedicated to supporting the ecosystem.  We have strong female...

4 months ago

Slack Engineering

How We Built Slack AI To Be Secure and Private

At Slack, we’ve long been conservative technologists. In other words, when we invest in leveraging a new category of infrastructure, we do it rigorously. We’ve done this since we debuted machine learning-powered features in 2016, and we’ve developed a robust process and skilled team in the space. Despite that, over the past year we’ve been blown away by the increase in capability of commercially available large language models (LLMs) — and more importantly, the difference they could make for our...

5 months ago

Slack Engineering

The Scary Thing About Automating Deploys

Most of Slack runs on a monolithic service simply called “The Webapp”. It’s big – hundreds of developers create hundreds of changes every week. Deploying at this scale is a unique challenge. When people talk about continuous deployment, they’re often thinking about deploying to systems as soon as changes are ready. They talk about microservices and 2-pizza teams (~8 people). But what does continuous deployment mean when you’re looking at 150 changes on a normal day? That’s a lot of...

8 months ago

Slack Engineering

Our Journey Migrating to AWS IMDSv2

We are heavy users of Amazon Compute Compute Cloud (EC2) at Slack — we run approximately 60,000 EC2 instances across 17 AWS regions while operating hundreds of AWS accounts. A multitude of teams own and manage our various instances. The Instance Metadata Service (IMDS) is an on-instance component that can be used to gain an insight to the instance’s current state. Since it first launched over 10 years ago, AWS customers used this service to gather useful information about their...

9 months ago

Slack Engineering

Building Custom Animations in the Workflow Builder

Slack users have more power than ever to automate routine tasks and processes, saving themselves time each day. Workflow Builder, a task automation tool built into Slack, has continued to improve since its launch back in 2019. Along with various new steps and triggers, we built a new sidebar section for all available workflow steps. These steps are now accessible to users without having to open a modal. Before After The enhancement of the Slack Platform, coupled with smart and...

10 months ago

Slack Engineering

Managing Slack Connect

Slack Connect, AKA shared channels, allows communication between different Slack workspaces, via channels shared by participating organizations. Slack Connect has existed for a few years now, and the sheer volume of channels and external connections has increased significantly since the launch. The increased volume introduced scaling problems, but also highlighted that not all external connections are the same, and that our customers have different relationships with their partners. We needed a system that allowed us to customize each connection, while...

10 months ago

Slack Engineering

The Query Strikes Again

On Thursday, 12 Oct. 2022, the EMEA part of the Datastores team — the team responsible for Slack’s database clusters — was having an onsite day in Amsterdam, the Netherlands. We’re sitting together for the first time after new engineers had joined the team, when suddenly a few of us were paged: There was an increase in the number of failed database queries. We stopped what we were doing and staged-in to solve the problem. After investigating the issue with...

10 months ago

Slack Engineering

Unleashing Impact at Slack’s Data Engineering Internship

Introduction Ever wondered what it’s like to intern as a software engineer at Slack? Picture yourself on the famous Ohana floor—the 61st floor of the Salesforce Tower in San Francisco— it is one of many privileges we had as interns. Not only did our experience with Slack’s Data Engineering team let us step onto the tech frontier, but it also gave us the chance to gain invaluable engineering experience and forge relationships across teams. Most importantly, we got to contribute...

12 months ago

Slack Engineering

Executing Cron Scripts Reliably At Scale

Cron scripts are responsible for critical Slack functionality. They ensure reminders execute on time, email notifications are sent, and databases are cleaned up, among other things. Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome. These issues lead us to design and...

12 months ago

Slack Engineering

My Summer Return Internship @ Slack: A Guide on Building on Past Experiences

Embarking on a journey  Stepping out of SFO with the familiarity of the fogginess of the city, my story at Slack unfolds once again. As a return intern, I found myself prepped for another exciting summer, and this opportunity encompassed a renewed sense of anticipation — a mix between known pathways and new adventures. Returning to an internship can often feel like slipping back into a familiar routine, much like riding an old bike. However, this time around, the gears...

about 1 year ago

Slack Engineering

Traffic 101: Packets Mostly Flow

  Slack handles billions of inbound network requests per day, all of which traverse through our edge network and ingress load balancing tiers. In this blog post, we’ll talk about how a request flows — from a Slack’s user perspective — across the vast ether of the network to reach AWS and then Slack’s internal services. Let’s dive in! How packets flow Our edge network consists of a set of globally-distributed edge regions or AWS datacenters that we call edge...

about 1 year ago

Slack Engineering

Slack’s Migration to a Cellular Architecture

Summary In recent years, cellular architectures have become increasingly popular for large online services as a way to increase redundancy and limit the blast radius of site failures. In pursuit of these goals, we have migrated the most critical user-facing services at Slack from a monolithic to a cell-based architecture over the last 1.5 years. In this series of blog posts, we’ll discuss our reasons for embarking on this massive migration, illustrate the design of our cellular topology along with...

about 1 year ago

Slack Engineering

Service Delivery Index: A Driver for Reliability

Customer-first: Moving from Hero Engineering to Reliability Engineering From the beginning, Slack has always had a strong focus on the customer experience, and customer love is one of our core values. Slack has grown from a small team to thousands of employees over the years and this customer love has always included a focus on service reliability. In a small startup, it’s manageable to have a reactive reliability focus. For example, one engineer can troubleshoot and solve a systemic issue...

about 1 year ago

Slack Engineering

Real-time Messaging

Did you know that ground stations transmit signals to satellites 22,236 miles above the equator in geostationary orbits, and that those signals are then beamed down to the entire North American subcontinent? Satellite radios today serve hundreds of channels across 9,540,000 square miles. Unless you’re working at a secret military facility, deep underground, you can enjoy satellite radio everywhere.  Just like the satellites, Slack sends millions of messages every day across millions of channels in real time all around the...

over 1 year ago

Slack Engineering

Tracing Notifications

Notifications are a key aspect of the Slack user experience. Users rely on timely notifications of mentions and DMs to keep on top of important information. Poor notification completeness erodes the trust of all Slack users.  Notifications flow through almost all the systems in our infrastructure. As illustrated in Figure 1 below, a notification request flows through the webapp (our application logic and web / Desktop client monorepo), job queue, push service, and several third-party services before hitting our iOS,...

over 1 year ago

Slack Engineering

Technology Lifecycle

This blog post discusses the strategies that Slack uses to manage the lifecycle (development, support, and eventual retirement) of infrastructure projects, through the lens of the migration through three successive internal “platform” offerings. Our challenges Circa 2020, our Cloud Engineering team (now evolved into multiple teams responsible for narrower aspects) was responsible for managing our Infrastructure-as-a-Service providers, compute environments like Chef and Kubernetes, and a large variety of related systems. This team had a broad remit, often with multiple systems...

over 1 year ago

Slack Engineering

Hakana: Taking Hack Seriously

TL; DR: We’re announcing a new open source type checker for Hack, called Hakana. Slack launched in 2014, built with a lot of love and also a lot of PHP code. We started migrating to a different language called Hack in 2016. Hack was created by Facebook after they had struggled to scale their operations with PHP. It offered more type-safety than PHP, and it came with an interpreter (called HHVM) that could run PHP code faster than PHP’s own...

over 1 year ago

Slack Engineering

It’s a non-transitive R class world

After Duplo modularization, we noticed that the task producing a transitive R class was taking a significant amount of time to execute. To eliminate this task altogether, and since the non-transitive R class is advertised to have up to 40% incremental build time improvement, we decided to migrate our codebase to use it. If you’re not familiar with nonTransitiveRClass, previously known as namespacedRclass, it’s an Android Gradle Plugin flag that enables namespacing R classes so that every module’s R class only...

over 1 year ago

Slack Engineering

What We Learned from Building GovSlack

Slack launched GovSlack in July 2022. With GovSlack, government agencies, and those they work with, can enable their teams to seamlessly collaborate in their digital headquarters, while keeping security and compliance at the forefront. Using GovSlack includes the following benefits: Supports key government security standards, such as FedRAMP High, DoD IL4, and ITAR Runs in AWS GovCloud data centers Enables external collaboration with other GovSlack-using organizations through Slack Connect Provides access to your own set of encryption keys for advanced...

over 1 year ago