Innovation
How We Built: An Early-Stage Recommendations Team
Shayak Banerjee, Vijay Pappu
At Peloton, our journey to personalize the user experience started in late 2019 and recently crossed the 3 year mark. In our earlier posts, we spoke about building an early-stage recommender system from available cloud computing solutions, as well as an early-stage recommendation model to quickly deliver value to our users. However, neither the systems nor the models would have been possible without our early-stage team to put it all together. In this post, we reflect on some of the learnings about building a new recommendations team and what that entails in terms of personnel, processes and culture.
The Unique Nature of ML Systems
In order to understand why a machine learning team needs specialized skill sets and processes in addition to software engineering, it is important to understand some of the fundamental challenges posed by machine learning (ML) based systems.
Machine learning systems typically have more moving parts than most back-end software systems. They tend to have batch pipelines of various data processing / modeling / inference tasks strung together by orchestration tools and a variety of offline data storage solutions. They also have an online serving component which is built around a scalable compute solution, and usually involves online storage via databases or caches. This variety extends into monitoring solutions with systems, data and model monitoring in both online and offline worlds.
Machine learning systems have two modes of failure – the applications and the data. Most back-end applications enforce schemas at the application layer in order to prevent data failures. But ML systems pull in data from disparate sources and are typically not in charge of data quality for any of these. This means it can fail in pulling together all these data sources, or fail while generating new data of its own.
Machine learning systems deal with a black box, which is the model. Even when things are apparently working, it can often be difficult to explain why certain results occur, or how we can modify this black box to improve outcomes.
In addition to standard service guarantees such as availability, scalability and uptime, ML systems also have to provide guarantees on the quality of output (e.g. a computer vision system might have to guarantee classification accuracy of 96%)
In order to build and maintain such complex systems, we not only needed a team that had the right blend of skill sets, but we also had to establish the right processes to empower such a team to march forward and deliver on our ML initiatives.
The Personalization Team Skillset
The following diagram summarizes the five core competencies that we saw necessary to build our team in this early stage.

Figure 1: The core competencies necessary for Personalization
Software Engineering – The foundational competency and the must-have skill. Coding for understandability, safety, performance and observability are our guiding principles. Additionally, everyone needed to be comfortable with code committing, linting, testing, reviewing and deployment.
Data - Whether it is building scalable data pipelines to train our models, or answering ad-hoc questions about recommendations, the ability to work with data at scale is an essential skill for our team. Answering questions with data often involves some engineering to obtain and process all the data in the first place. Due to our small team size, a single Data competency covering both science and engineering was our necessity.
Infrastructure - While we had the foundation of a compute platform and a nascent data platform, there are also various ML platform-specific components that we had to build out early. Examples include Airflow, Spark tooling, GPU tooling and streaming pipelines. This required a certain amount of infrastructure expertise.
Product & Analytics - The efficacy of ML systems is determined by its measurable impact on the users, and is typically estimated with experimentation. The team needed to not only have product thinkers who could identify the right problems to tackle with recommendations, but also experiment thinkers to design and analyze experiments.
Machine Learning - Of course, ML systems do contain models, which requires skills with training and inference. This meant building some competencies around neural network frameworks, model training and evaluation to iterate on such models.
Learnings about Team Processes & Culture
Just like machine learning algorithms, our team processes have been iterative in nature and we have learned along the way about things that work, and things that don’t. Every company and its platform is unique, but below we list some learnings that we hope generalize to most early ML teams.
Generalists > Specialists. In the early days, we had to deliver value quickly with a small team. In such a scenario, we found generalists who could cut across most of the above five competencies to be valuable. More specifically, analysis work would be blocked by permission issues (an infrastructure problem), or infrastructure work would be blocked by CUDA errors (a modeling problem). Generalists were good at unblocking themselves when such hurdles arose.
As an example, we reached a point where we needed production-sized datasets for development work. However we needed this data to be anonymized so that user details were obfuscated. Rather than getting blocked on the Platform team’s bandwidth to facilitate this, we wrote our own Spark jobs which performed a simple one-way hash of user details to produce these datasets. It took less than a week, but enabled every engineer to be more confident about non production data. By delivering value, generalists improved the team’s credibility and maturity, which enabled supporting more specialization down the line.
Setting software engineering standards. As much as we love notebooks for analysis and research, they were not fit for our production systems. We adopted standard DevOps practices in our everyday workflows. Every team member had to be rigorous enough to add unit tests for their code changes and test their changes prior to development, in addition to being knowledgeable about how to deploy their artifacts, monitor for failures and rollback in the event of failure. We also invested in automation for testing and deployment to reduce manual toil. Our team set high standards on code review, requiring code to be readable, performant and having built-in observability. This rigor helped us move forward in a more stable manner and also allowed the codebase to be understandable and contributable by everyone.
Encouraging communication. The triumvirate of code + data + model comes with a cost – a high amount of information needs to be communicated about these. Also, at an early stage with a small team trying to provide value to multiple parts of the product, it is very likely that a single person will work on an entire project end-to-end, with some review help from peers. This has the potential to create knowledge silos, which are detrimental, especially if that single person leaves.
To deal with this, we deliberately did not restrict our standups to one minute updates, and rather encouraged discussions, as long as they could be timeboxed. We also started a dedicated knowledge share forum (which continues till date) to document and disseminate information about our systems, model and data.
Focusing on the foundations. Building out the first iteration of a product experience based on machine learning takes an inordinately large amount of time. The building blocks put in place at this first phase typically persist a long time. These include the data pipelines, the model, the infrastructure and the experimentation tooling. Making some long-term choices here helps the team run with stability for longer.
As an example, we built one of our first microservices to serve new users recommendations on the app in late 2020. It took us a few months extra than anticipated to build this, but this microservice has scaled well to 10X the traffic, has caused us minimal operational overhead and has allowed us to experiment with ease. As an added bonus, we built out our first impressions logging pipeline as part of this project, which has served as a foundational infrastructure element and is now used by all our services.
Being practical about choice of technologies. The world of machine learning is evolving fast, and every year sees the introduction of new models and new technologies. While it is tempting to try them, at an early stage it is difficult to support researching them. It is typically a multi year journey to build an ML Platform which allows easy iteration, so in the meantime making judicious choices is important. We have found it most prudent to start with proven models and technologies for the simple reason that these are more productionizable. Battle-tested tools like Airflow and Spark have served us well, while we ran into several hurdles when adopting brand-new technologies like HugeCTR.
Keeping up with state-of-the-art. Just because we cannot try new models and algorithms does not mean we should not be aware of them. On the contrary, keeping up with the recent changes in recommender systems – from neural collaborative filtering to DLRM and contextual bandits – has helped us gain a better understanding of where the current limitations in our models are, and where we see opportunities for improvement in the future. We organize a paper-reading session monthly, and every member of the team is encouraged to take courses, attend conferences and subscribe to publications, as well as contribute back to the world via writing papers or articles.
Striking the balance between the short-term and long-term. An early-stage team lives between a rock and a hard place. On the one hand we are tasked with shipping value quickly and early to users, but on the other hand we require a large platform effort to provide strong foundations. It usually requires team leadership to allow the team to find the right balance between building for the long-term while delivering value in the short–term. As external conduits of their team, they have to place the team’s efforts in context and advocate for time to build for the future.
As an example, we chose our first model to be collaborative filtering-based, because we recognized we needed to build a feature store to do any form of content-based filtering. This allowed us to start recommending content to users and get feedback, which led to a second iteration of the model which included richer features and investment into building a feature store.
Building an operational mindset. Our team’s responsibility travels all the way from the design phase to the maintenance of the system. Investing in operationalizing systems, as early as when they are being launched, has always paid dividends. For most of our systems, we have had production readiness reviews, both internal to the team and externally with our SREs as consultants. As part of readiness, some things we ensure are that services have been load tested, have adequate visibility built-in, have fallbacks in case of failure and are supported with documentation such as runbooks and rollback plans. At times, we have had to ship fast and have cut corners on this process, and it has invariably invoked on-call heartache.
Below is an example of a service’s latency that we modified to add some personalization. We had done some simple estimation to predict the service latency would not degrade by more than 25-30 ms. In practice, we observed a 20ms degradation in the median latency, but ~180 ms degradation in the 95th percentile, which caused several alerts to trigger for the team that owned this service. A proper load test would have surfaced this issue prior to launch. Part of the operational mindset is also to set aside dedicated time to pay down this incurred tech debt. We try to pick up one operational task per sprint of work to ensure ongoing maintenance, instead of one-time big efforts.

Figure 2: Service latency before (left) and after (right) changes shows the degradation at p95
Using on-call rotations effectively. Keeping in line with the operational mindset above, we recognize that ML systems have several failure modes – the code, the data and the model. Every engineer on our team goes on-call for a week at a time in a rotation. We have found this to be effective not only for triaging and fixing failures when they occur, but also for learning purposes. The on-call engineer learns in a hands-on manner while fixing issues. Post-mortems after incidents typically help surface systemic issues and long-term trends, which helps the team prioritize the components of our stack which are most stressed. For example, in late 2021, we were running into several issues with provisioning GPUs on AWS to train our models. For nearly 2 weeks, our on-call engineers were responding to almost daily failures. After a quick post-mortem with our SRE team, we decided to reserve GPU instances, which over time has proved to lead to more stability in our pipelines.
Budgeting time for investigations. Data can be messy and confounding, but ML systems absolutely rely on it. We often receive requests from team members, such as “why did I receive this recommendation?”. Similarly, while working on a particular task, a teammate would often observe a quirk with the data. During these times, it would be easy to create a ticket for investigation and put it in the backlog, but we have found it prudent to budget some time for these investigations. There are two examples we had in the early days which proved the value:
While iterating on a new model, we realized that walking classes were popular on the tread, but we were not recommending them. A simple rule change to include these classes increased our homescreen recommendation conversion by 10%
We received a request from a user to understand why previously taken classes were being recommended on the bike homescreen. Investigation led us down to a change in upstream datasets that had not been communicated to us. Similarly, when we fixed this, we observed an uptick in conversion from recommendations.
Being flexible about task tracking. One of the central principles of DevOps is to make tasks visible and estimate the amount of time it takes to deliver a particular task. This facilitates planning – if everything has a reasonable time estimate and all tasks for a project are listed, we can gauge the end-date of a project. For ML engineering, this is surprisingly hard. A seemingly trivial data-related task can uncover issues with the data that become a large body of work. A model improvement task may require several iterations of training or model tuning until it delivers on the goal. Here are a few examples:
When productionizing our context-aware recommendations model in 2021, we realized the model training and inference pipelines would require persistent volumes. It took us an extra few weeks to get this infrastructure in place.
A training pipeline was failing due to Spark errors. We had to dig deep to realize that this was caused due to data skew from certain users, which likely represented a common shared account e.g. used by multiple people in the same apartment building. Diagnosing and fixing this issue caused the project to slip by several weeks.
Even though we use Agile methodologies for our team, it remains an open question if this is the right fit for a Machine Learning team, or whether we need a different methodology that allows for more uncertainty in task delivery.

(a)

(b)
Figure 3: A typical tasks commitment vs. completion chart for (a) an adjacent software engineering team (b) our recommendations team, shows that task estimation is much more difficult for recommendations, with many tasks taking longer than expected.
In Conclusion
Our goal with this article was to share back some of our learnings from building an early-stage ML team. It has been a challenging but rewarding journey. Over time, we have established trust with our product partners by delivering product value. We have also built trust with our platform engineering partners by doing so with stability and consistency. It has not always been smooth sailing, we have failed plenty along the way, but always learned from these failures. We have also tried to make decisions as a collective, with input and feedback from everyone. Several of the choices above were made with healthy discussion amongst the team.
As we write this article, our team has grown from the initial 4 engineers to 16 in a span of two years. We are excited to continue to grow the team and evolve the personalization of our product to continue to positively impact the lifestyles of our members. We have started to specialize roles on the team – in data, infrastructure and science. For example, our Applied Scientist has brought a higher level of mathematical rigor to the team. Our initial value addition to the product has resulted in expanded scope and a complete digital product – the Peloton App – being built with Personalization at its core.
We would love your feedback and your own experiences building your ML team. Together, with collective knowledge, we go far.