Innovation
Working Towards a More Accessible Member Experience

Peloton is committed to providing an immersive and accessible experience for our Members to help create the best Peloton experience for all our Members. In the last few years, we’ve released features like the Screen Reader to make sure Members who are blind or have low vision receive audio feedback when they interact with the Peloton Bike touchscreen.
Providing accessible content to our Deaf and Hard of hearing Members is also a priority and was actually one of the most requested features by Members. We started working on live subtitles a few years ago and it was something that we’re really proud to roll out and continue to improve.
How it began
Our streaming engineering team was excited with the challenge in front of them. Once the team began the process of evaluating what is required to add subtitles to our content, they realized that changes were needed in our recording studios. The instructor audio feed was intermixed with in class music. This made the process of transcribing subtitles extremely difficult. The studio team separated instructor audio and music into three audio streams: original mix, more music, and instructor only.
Sounds easy on paper, but the change involved upgrading hardware and software in the Peloton studios. These studio changes took some time to implement but were key for the project. The effort of separating the instructor and music audio into different streams, not only helped our subtitles effort, but expressed as a new feature for our Members - the option to listen to more music, more instructor or the default mix.

Then, using the instructor only audio feed, the team was able to partner with a transcription vendor to produce subtitles. Subtitle support for our OnDemand video library was released to the general public in the fall of 2018!

This was a great accomplishment, but Peloton was still missing an important component: subtitles for live classes. Members communicated how important this feature is: in fact, one Member told us that when her husband takes a live class, she actually helps by signing the entire live class to him.
Subtitles are helpful for people who aren’t are deaf or hard of hearing too. We heard from Members that many of them find subtitles helpful when following instructor cues on resistance and cadence or just keeping up with the class in general.
Live Subtitles
The development phase and engineering challenges
The live subtitles project started in the Fall of 2020. We assembled a small cross functional team of software engineers, product analysts, and research analysts. This team broke down the project into three main phases: (1) transcribing subtitles in real time with automated speech recognition software, (2) integrating the resulting subtitles into our video streams, and (3) collecting feedback via an internal beta group.
Automated Speech Recognition (ASR)
Over the past few years, our streaming engineering team has been evaluating ASR software periodically and didn't feel the technology was mature enough for our needs until now. The accuracy of third party ASR services has continued to trend upward. Automated speech recognition software can never be as accurate as a human editor, but it's getting closer and closer. Microsoft provides a foundation to build custom ASR on top of base models. We used this function to customize for Peloton specific and workout related speech. We are also using Microsoft APIs and tools to constantly train and update the models used in speech recognition.
Even with a third party ASR service, there was a lot of customization involved since Peloton classes have a language of their own!
The challenges we faced can be categorized into two groups:
Fine-tuning Language Models Workout lingo is different from the standard vernacular, or everyday spoken language. On top of that, each fitness discipline has its own unique vocabulary. Given these challenges, no general purpose ASR was going to be much help. Conveniently for us, the data needed to fine-tune these language models is already available to us through our on-demand subtitle feature, which includes human reviewed transcriptions.
Inverse Text Normalization (ITN) While it’s great for an ASR to learn our lingo, it’s not nearly as important as catching target metrics and countdowns, the core of our Peloton Bike and Peloton Tread classes. If you don’t know where and when to set your resistance and cadence, then you’re not getting as much out of the class! Unfortunately, this doesn’t work so well with out-of-the box technology. It’s pretty straightforward for any ASR to recognize phrases like “cadence to fifty sixty”, “incline to five oh six oh”, or “in five, four, three, two, one”, but the conversion from lexical text to numerical text (Ex. “five” to “5”) is where we run into some issues.
Let’s say we’re given a five digit number such as “five four three two one”. While to us it’s obvious it’s a countdown, a large majority of ASR use cases would treat five digit numbers as zip codes, hence joining the phrase into a single number by default. Another example is “resistance to twenty thirty. Are we talking about 2030 the year or 20-30 the range? This ambiguity is why we can’t rely on out of the box solutions without some personalization to the model.
To ensure each of these issues are addressed, tests were developed for each problem based on the word error rate (WER) metric. This helped set a baseline for accuracy and allowed us to measure and quantify improvements as we tweaked the various customized speech models. As the project progressed, the voice team has been continually improving on our original custom speech models.
API Latency & Security
Not only did we have to think about accuracy, but we needed to think about latency and security. Accurate subtitles would not be of much use if they appeared much after the audio cue. We decided to work with Microsoft Azure Cognitive Services for their implementation of automated speech recognition. This meant calling their APIs and transferring our audio data over the cloud. We were concerned about the latency of these calls as audio data can sometimes be quite large in size. However, we found that the Azure APIs were deployed in many data centers globally allowing us to point to locations that were geographically close to our own servers. On the Peloton end, we did latency testing for dozens of live classes to deem that our subtitles service combined with calls to Azure APIs did not add latency to our live streams.
Safety of our class data was also a major consideration when building this feature. Because the data would be on a non-Peloton platform, we needed to be sure our data would be transferred and stored securely. The APIs were secured in transit with TLS and behind authentication also Azure does not store the audio data permanently, but rather only ephemerally for the duration of the API call, so we were reassured that our data was secure.
Video Streams
So now we had subtitles generated for live classes but needed to integrate them into our live video streams. There were two options: open a socket on each device and have continuous back and forth communication with Peloton servers delivering subtitles in real time (this is similar to how you get subtitles during virtual meetings) or integrate the subtitles into our Http Live Streaming (HLS) video and deliver them as part of our streaming payload to clients. Both strategies have pros and cons, but we felt the idea of opening a socket on clients was not practically feasible. We would have had to make changes across all our clients and any future client. The solution of utilizing HLS’s support for subtitles seemed easier to scale and the better option for Peloton as it would be supported natively by HLS players.
This strategy had one obvious stumbling block. Would the subtitle generation be fast enough to keep up with a live stream? In order to have a webvtt file (the standard format for subtitles on a HLS stream) for a segment of audio, that audio segment needs to be generated, shipped out over the network, processed by the ASR, and finally the resulting webvtt file uploaded to our Content Delivery Network. So when thinking about an ultra low latency live stream, this strategy cannot work. However, our live streams operate with a slight delay. This leaves a small window where the processing has a chance to complete. Our early testing showed that the delay in our video streams was good enough to transcribe the speech and add back to the stream.
We came across another problem: almost no online documentation exists in regard to delivering subtitles via an HLS stream for live events. There is plenty of documentation about video on demand, but almost nothing about live streams. Our live HLS stream is a sliding window of ten video and audio segments. This means client video players download and buffer these segments. As the stream plays, the video player periodically checks for updates on the CDN and downloads new segments that are available, discarding older ones no longer referenced in the m3u8 HLS playlist. We thought we only had to match this behavior when uploading our subtitle segments and correctly keep the subtitle playlist in sync with the audio and video ones respectively. We matched the behavior and could see our subtitle segments being downloaded by players, but ran into another problem: nothing was showing on screen. After researching online and finding no information and no hints, we spoke with some colleagues in the streaming industry. Based on that discussion we realized that our subtitle files were missing some timing information to line them up with the video stream. We needed to add a timestamp metadata header to the subtitles files so that video players know when to display the text correctly. Once the header information was added, subtitles started appearing and it was time to celebrate.

Getting it out to production
Internal Beta
The next phase of this project was gathering feedback from beta users. We knew we had subtitles working for live streams and the accuracy of those subtitles was pretty good, but not perfect. We wanted to get opinions from others in terms of where improvements had to be made and in what order. Even though during testing and evaluation we knew what areas of workout related terms were having issues, we made every issue the same priority to ensure the best Member experience. We first asked our Peloton team members to help us test the live subtitles feature. Even though everyone at the company is a Member, they provided us with honest feedback which was invaluable and helped us prioritize improvements. We worked with Microsoft and the Peloton voice team to improve our custom speech models and post processing.
Current State
After working on this project for a few months, we had accomplished our main objectives and the feature is live! Now the team is working on improving accuracy of our custom models, some scalability issues, and adding metrics and monitoring to ensure consistent performance - and of course we’re always gathering feedback from our community, especially Members who are deaf and hard of hearing to make sure the feature is offering the best experience.

Reflections
This project was a great example of the agile process. A disparate group of engineers and analysts worked as a team to accomplish a specific goal. Most of us didn't know each other that well before this project, and also were not familiar with some internal details of current subtitles technologies, but together we overcame technical hurdles #togetherwegofar. We worked around some crazy calendars of various team Members. We had fun. Time for the next challenge!