At Slack, we manage tens of thousands of EC2 instances that host a variety of services, including our Vitess databases, Kubernetes workers, and various components of the Slack application. The majority of these instances run on some version of Ubuntu, while a portion operates on Amazon Linux. With such a vast infrastructure, the critical question arises: how do we efficiently provision these instances and deploy changes across them? The solution lies in a combination of internally-developed services, with Chef playing...
1 day ago
“What are your goals for this quarter?” It’s the question every manager asks, and one that often prompts a flurry of technical objectives and project milestones. Jumping into this internship, I knew my answer. I wanted to practice making informed decisions on my project, since that was one of the challenges I faced last summer. As an intern, I struggled to form a strong opinion without as much context as my team members, and I thought that this decision-making prowess...
19 days ago
All software is built atop a core set of assumptions. As new code is added and new use-cases emerge, software can become unmoored from those assumptions. When this happens, a fundamental tension arises between revisiting those foundational assumptions—which usually entails a lot of work—or trying to support new behavior atop the existing architecture. The latter approach is usually advised, to save time and reduce risk. However, there are times when it’s worth revising the core architecture of a large software...
about 2 months ago
Slack Data Engineering recently underwent data workload migration from AWS EMR 5 (Spark 2/Hive 2 processing engine) to EMR 6 (Spark 3 processing engine). In this blog, we will share our migration journey, challenges, and the performance gains we observed in the process. This blog aims to assist Data Engineers, Data Infrastructure Engineers, and Product Managers who may be considering migrating to EMR 6/Spark 3. In Data Engineering, our primary objective is to support internal teams—such as Product Engineering, Machine...
3 months ago
At Slack, we’re committed to security that goes beyond the ordinary. We continuously strive to earn and maintain user trust by safeguarding critical components integral to every user’s experience. From passwords to session cookies, and tokens to webhooks, we prioritize protecting everything essential to how users log into the platform and remain authenticated. Through proactive measures and innovative automations that leverage cutting-edge threat intelligence, we’re dedicated to shielding users from potential breaches, cookie hijacking malware, and inadvertent exposure of sensitive...
3 months ago
Slack uses cookies to track session states for users on slack.com and the Slack Desktop app. The ever-present cookie banners have made cookies mainstream, but as a quick refresher, cookies are a little piece of client-side state associated with a website that is sent up to the web server on every request. Websites use this piece of information to inject state into the inherently stateless protocol of HTTP. At Slack, that means every time you sign into a workspace, your...
3 months ago
In a previous blog post—A Simple Kubernetes Admission Webhook—I discussed the process of creating a Kubernetes webhook without relying on Kubebuilder. At Slack, we use this webhook for various tasks, like helping us support long-lived Pods (see Supporting Long-Lived Pods), and today, I delve once more into the topic of long-lived Pods, focusing on our approach to deploying stateful applications through custom resources managed by Kubebuilder. Lack of control Many of our teams at Slack use StatefulSets to run their...
3 months ago
In the world of frontend development, one thing remains certain: change is the only constant. New frameworks emerge, and libraries can become obsolete without warning. Keeping up with the ever-changing ecosystem involves handling code conversions, both big and small. One significant shift for us was the transition from Enzyme to React Testing Library (RTL), prompting many engineers to convert their test code to a more user-focused RTL approach. While both Enzyme and RTL have their own strengths and weaknesses, the...
4 months ago
The Data Engineering team is responsible for Slack’s data lake, analytics dashboards, and other data services. The team’s mission is to empower users to leverage data to make decisions quickly, accurately, and easily. Slack’s data lake grew in size from sub-petabyte to over 100 petabytes in recent years and it now spans millions of tables. As the complexity of managing this data grew, so did a diverse team of Slack engineers dedicated to supporting the ecosystem. We have strong female...
4 months ago
At Slack, we’ve long been conservative technologists. In other words, when we invest in leveraging a new category of infrastructure, we do it rigorously. We’ve done this since we debuted machine learning-powered features in 2016, and we’ve developed a robust process and skilled team in the space. Despite that, over the past year we’ve been blown away by the increase in capability of commercially available large language models (LLMs) — and more importantly, the difference they could make for our...
5 months ago
Most of Slack runs on a monolithic service simply called “The Webapp”. It’s big – hundreds of developers create hundreds of changes every week. Deploying at this scale is a unique challenge. When people talk about continuous deployment, they’re often thinking about deploying to systems as soon as changes are ready. They talk about microservices and 2-pizza teams (~8 people). But what does continuous deployment mean when you’re looking at 150 changes on a normal day? That’s a lot of...
8 months ago
We are heavy users of Amazon Compute Compute Cloud (EC2) at Slack — we run approximately 60,000 EC2 instances across 17 AWS regions while operating hundreds of AWS accounts. A multitude of teams own and manage our various instances. The Instance Metadata Service (IMDS) is an on-instance component that can be used to gain an insight to the instance’s current state. Since it first launched over 10 years ago, AWS customers used this service to gather useful information about their...
9 months ago
Slack users have more power than ever to automate routine tasks and processes, saving themselves time each day. Workflow Builder, a task automation tool built into Slack, has continued to improve since its launch back in 2019. Along with various new steps and triggers, we built a new sidebar section for all available workflow steps. These steps are now accessible to users without having to open a modal. Before After The enhancement of the Slack Platform, coupled with smart and...
10 months ago
Slack Connect, AKA shared channels, allows communication between different Slack workspaces, via channels shared by participating organizations. Slack Connect has existed for a few years now, and the sheer volume of channels and external connections has increased significantly since the launch. The increased volume introduced scaling problems, but also highlighted that not all external connections are the same, and that our customers have different relationships with their partners. We needed a system that allowed us to customize each connection, while...
10 months ago
On Thursday, 12 Oct. 2022, the EMEA part of the Datastores team — the team responsible for Slack’s database clusters — was having an onsite day in Amsterdam, the Netherlands. We’re sitting together for the first time after new engineers had joined the team, when suddenly a few of us were paged: There was an increase in the number of failed database queries. We stopped what we were doing and staged-in to solve the problem. After investigating the issue with...
10 months ago
Introduction Ever wondered what it’s like to intern as a software engineer at Slack? Picture yourself on the famous Ohana floor—the 61st floor of the Salesforce Tower in San Francisco— it is one of many privileges we had as interns. Not only did our experience with Slack’s Data Engineering team let us step onto the tech frontier, but it also gave us the chance to gain invaluable engineering experience and forge relationships across teams. Most importantly, we got to contribute...
12 months ago
Cron scripts are responsible for critical Slack functionality. They ensure reminders execute on time, email notifications are sent, and databases are cleaned up, among other things. Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome. These issues lead us to design and...
12 months ago
Embarking on a journey Stepping out of SFO with the familiarity of the fogginess of the city, my story at Slack unfolds once again. As a return intern, I found myself prepped for another exciting summer, and this opportunity encompassed a renewed sense of anticipation — a mix between known pathways and new adventures. Returning to an internship can often feel like slipping back into a familiar routine, much like riding an old bike. However, this time around, the gears...
about 1 year ago
Slack handles billions of inbound network requests per day, all of which traverse through our edge network and ingress load balancing tiers. In this blog post, we’ll talk about how a request flows — from a Slack’s user perspective — across the vast ether of the network to reach AWS and then Slack’s internal services. Let’s dive in! How packets flow Our edge network consists of a set of globally-distributed edge regions or AWS datacenters that we call edge...
about 1 year ago
Summary In recent years, cellular architectures have become increasingly popular for large online services as a way to increase redundancy and limit the blast radius of site failures. In pursuit of these goals, we have migrated the most critical user-facing services at Slack from a monolithic to a cell-based architecture over the last 1.5 years. In this series of blog posts, we’ll discuss our reasons for embarking on this massive migration, illustrate the design of our cellular topology along with...
about 1 year ago
Customer-first: Moving from Hero Engineering to Reliability Engineering From the beginning, Slack has always had a strong focus on the customer experience, and customer love is one of our core values. Slack has grown from a small team to thousands of employees over the years and this customer love has always included a focus on service reliability. In a small startup, it’s manageable to have a reactive reliability focus. For example, one engineer can troubleshoot and solve a systemic issue...
about 1 year ago
Did you know that ground stations transmit signals to satellites 22,236 miles above the equator in geostationary orbits, and that those signals are then beamed down to the entire North American subcontinent? Satellite radios today serve hundreds of channels across 9,540,000 square miles. Unless you’re working at a secret military facility, deep underground, you can enjoy satellite radio everywhere. Just like the satellites, Slack sends millions of messages every day across millions of channels in real time all around the...
over 1 year ago
Notifications are a key aspect of the Slack user experience. Users rely on timely notifications of mentions and DMs to keep on top of important information. Poor notification completeness erodes the trust of all Slack users. Notifications flow through almost all the systems in our infrastructure. As illustrated in Figure 1 below, a notification request flows through the webapp (our application logic and web / Desktop client monorepo), job queue, push service, and several third-party services before hitting our iOS,...
over 1 year ago
This blog post discusses the strategies that Slack uses to manage the lifecycle (development, support, and eventual retirement) of infrastructure projects, through the lens of the migration through three successive internal “platform” offerings. Our challenges Circa 2020, our Cloud Engineering team (now evolved into multiple teams responsible for narrower aspects) was responsible for managing our Infrastructure-as-a-Service providers, compute environments like Chef and Kubernetes, and a large variety of related systems. This team had a broad remit, often with multiple systems...
over 1 year ago
TL; DR: We’re announcing a new open source type checker for Hack, called Hakana. Slack launched in 2014, built with a lot of love and also a lot of PHP code. We started migrating to a different language called Hack in 2016. Hack was created by Facebook after they had struggled to scale their operations with PHP. It offered more type-safety than PHP, and it came with an interpreter (called HHVM) that could run PHP code faster than PHP’s own...
over 1 year ago
After Duplo modularization, we noticed that the task producing a transitive R class was taking a significant amount of time to execute. To eliminate this task altogether, and since the non-transitive R class is advertised to have up to 40% incremental build time improvement, we decided to migrate our codebase to use it. If you’re not familiar with nonTransitiveRClass, previously known as namespacedRclass, it’s an Android Gradle Plugin flag that enables namespacing R classes so that every module’s R class only...
over 1 year ago
Slack launched GovSlack in July 2022. With GovSlack, government agencies, and those they work with, can enable their teams to seamlessly collaborate in their digital headquarters, while keeping security and compliance at the forefront. Using GovSlack includes the following benefits: Supports key government security standards, such as FedRAMP High, DoD IL4, and ITAR Runs in AWS GovCloud data centers Enables external collaboration with other GovSlack-using organizations through Slack Connect Provides access to your own set of encryption keys for advanced...
over 1 year ago