From a small eCommerce model to a Fortune 500 SaaS company: A Comprehensive Case Study on How Netflix Leveraged Digital Transformation
by Abimbola Bismarck | Jan 16, 2023 | Business | 0 comments
Get an in-depth look at the strategies, technologies, and best practices used to grow Netflix into a streaming giant. Discover what you can do to replicate their success and stay ahead of the competition.
In 1997, a small company was founded by two men named Reed Hastings and Marc Randolph in a small town in California named Scotts Valley.
The true story started when Marc discovered how DVD—a new product invented in Japan—would send the timely VHS packing.
So after brainstorming on what they could do with this new DVD product and successfully pitching the idea to Reed Hastings, both Marc and Reed walked into a record store in Santa Cruz, California, and purchased a CD. Then they proceeded to mail the CD to Reed’s house across town.
When the CD arrived intact, Reed and Hastings knew they had struck gold with this idea; and on August 29, 1997, Marc Randolph and Reed Hastings launched a subscription-based business model that afforded customers unlimited content for $19.95 (£16.31) with no due dates and late fees.
They would call it Netflix , and it would go on to epitomise the true essence of digital transformation over the years while changing the course of TV culture forever.
But to understand how Netflix went from a small eCommerce business model to a giant Software-As-A-Service company, the question you should ask is:
What Is Digital Transformation, and How Did It Work for Netflix?
To answer both questions, let’s consider the “what?” before the “how?”
What Exactly Is Digital Transformation?
To put it succinctly, “digital transformation” is the process of shifting away from a more traditional business model and toward those that use more cutting-edge digital technology and processes.
Contrary to popular belief, digital transformation is a strategy implemented through the use of new and advanced technologies to bring about developmental changes in the way business is conducted, improve the customer experience, and scale a business model.
How Did It Work For Netflix?
Netflix is a great example of how digital transformation has worked for a company. It has proven that digital transformation isn’t a sprint; it is a marathon. A journey of digital evolution over the course of time
Since Netflix was just a DVD rental service, it has had to resort to physical stores and mailing DVDs to serve its customers.
However, when it shifted its focus to streaming services, it was able to reach an even wider audience and expand its reach globally.
The success of Netflix can be attributed to its early adoption of digital transformation methods that allowed it to make the most of new technologies and maximise profits.
To fully understand how Netflix leverages digital transformation, we must first ask:
What Steps Has Netflix Taken in Their Digital Transformation Journey?
Netflix developed a recommendation algorithm.
Before Netflix introduced streaming services, they were using a recommendation system called Cinematch for their DVD rental service.
However, the transition to video-on-demand would mean that they could replicate that for their streaming services and even make it better.
To do this, they;
Used available cache data from their former system of customers and users to execute a proper recommendation system.
Used metadata to categorise movies under similar or the same genres to make recommendations easier for users.
Used A/B testing to test and improve their customer experience.
Netflix used cloud computing to improve their storage processes.
One of the perks of going digital for any business is that you get unlimited storage for data. And for a company like Netflix with an outrageous amount of data to store, cloud computing was the smart choice to proceed.
Netflix integrated with Amazon Web Services to manage this issue.
Netflix’s Pioneering Approach to Monetizing Content and Generating Revenue
Before Netflix, subscription-based content wasn’t a popular model. The emergence of Netflix disrupted the industry and sent Blockbuster, which was using the traditional method of disrupting content and generating revenue, into bankruptcy.
It was a classic illustration of how businesses can be left in the dust when they fail to evolve with the times. Also discussed is how businesses can scale through effective digital transformation.
Netflix pioneered a subscription-based model for DVD rentals and eventually revamped the idea into a streaming service business model.
Today, Netflix’s pioneering approach to monetizing content and generating revenue has only just evolved.
Based on the desired video quality, Netflix offers three different pricing tiers: basic, standard, and premium. Typically, the first month of the subscription is free.
Basic with the normal resolution is $7.99 a month, but only one device may utilise it simultaneously.
On the other side, paying $13.99 a month gets you access to Ultra HD streaming on four devices in addition to HD content on two devices.
The Use of Data Analytics by Netflix to Improve Decision-Making
With the help of data analytics, Netflix has been able to gain new insights into their customers’ likes and dislikes, helping them make more informed decisions when it comes to content development and marketing strategies.
Data analytics have played a pivotal role in Netflix’s digital transformation, giving them an edge over its competitors.
This is evident from the fact that Netflix uses advanced data analytics techniques such as predictive modelling, sentiment analysis, customer segmentation, and natural language processing to gain deeper insights into customer behaviour.
By understanding their customers better, they have been able to create personalised recommendations and optimise content for each customer segment.
Monumental Netflix Milestones To Illustrate Their Digital Transformation Journey
The 1990s–2000s: At the time that Netflix launched, Blockbuster was already the leading giant in the movie rental business.
However, a subscription-based DVD rental service was a genius business model in the 90s and maybe in the early 2000s. But this didn’t stop Netflix from facing a financial crisis due to, the high cost of shipping DVDs through the US postal service.
Netflix solved this logistics problem by cutting ties with postal services and developing a web-based chain with warehouses for DVD distribution.
This move catapulted Netflix subscribers from around 300,000 to 6.3 million in 2006 and generated $80 million in the process.
2007: What makes Netflix’s story admirable and worthy of imitation is how well it adapts to changing times. Which is what digital transformation is all about.
Netflix, seeing how the world of technology keeps on advancing and how much speed the 21st century brought to internet users around the world, launched the next facet of their business model—the video-on-demand (VOD) model.
And for the first time ever, Netflix reimagined their identity and introduced streaming services into its business model.
2012: Five years later, after introducing and focusing on streaming services, Netflix starts creating original content for their subscribers.
2013: In August, Netflix launches user-profiles and makes this service available to all users.
2015-2016: Netflix reaches monumental milestones of 50 and 130 countries, respectively, in both 2015 and 2016.
2021: 24 years after it was founded in a small town in California, Netflix hits 209 million subscribers in 190 countries while generating an annual income of over $25 billion.
How Can Digital Socius Help You With Your Business’s Digital Transformation Journey?
Digital transformation is not a goal; it is a journey.
So it is no surprise that businesses have been struggling to keep up with the rapidly changing digital landscape, often feeling overwhelmed and confused by the many options that are available.
As a business owner, it can be hard to stay on top of the latest trends, know what digital solutions are best for your company, and figure out how to get started with digital transformation.
Digital Socius is here to help! Our team of experts can guide you through the entire journey of digital transformation, from assessing your current situation to setting goals and creating a plan of action to reach them.
We’ll provide you with tailored strategies that are specific to your business, so you can move forward confidently and quickly.
Join our many satisfied and returning clients, and let us walk you through the journey step by step and show you how to get it right the first time. Click here to book a free discovery call to get started.
search here
Recent posts.
- Looking for the Best Mailchimp Alternatives? Here’s What Top Businesses Are Switching To (And Why You Should Too!)
- The Best CRM for Shopify to grow your Ecommerce Business fast
- Zoho Workplace vs Google Workspace: Which Should I Choose?
- November 2024
- October 2024
- September 2024
- August 2024
- February 2024
- January 2024
- December 2023
- October 2023
- September 2023
- August 2023
- February 2023
- January 2023
- December 2022
- November 2022
- September 2022
- September 2019
- Digital Transformation
- Uncategorized
- Entries feed
- Comments feed
- WordPress.org
Serverless Case Study – Netflix
A couple of days ago we published a case study on how Coca-Cola North America handles their vending machine’s systems with serverless . Today we’re going to talk about another titan that turned to serverless. As you may have guessed from the title, we are going to be talking about Netflix.
Netflix is a streaming service founded in 1997 and, believe it or not, started out as a Blockbuster alternative for renting and selling DVDs through the mail. Yeah, it was such a long time ago. And while they are still renting about 3 million DVD’s a year they are also the number one video streaming platform for TV-shows and Movies.
Netflix delivers 10 billion hours of videos to 125 million customers every quarter and to serve that kind of audience they use a wide range of highly complex infrastructure that relies mostly on AWS. Imagine what the servers that run Netflix look like? Petabytes of data in hundreds of thousands of files changed daily, served millions of customers in 55 countries.
At the moment Netflix has moved completely to the AWS cloud infrastructure and while a full seven years to make the move from their own data center might seem a long time for most people, they wanted to make sure that the problems they were facing while using the self-managed data center would not get imported into the cloud so they ended up basically rewriting every aspect of their service to make Netflix a true cloud-native application. You can read more about the journey to the cloud in an article written by Yury Izrailevsky , vice president of cloud platform engineering.
So how does Netflix make use of Serverless
Publishers upload thousands of files to Netflix on a daily basis and every bit of those files need to be encoded and sorted before they end up being streamed to the user. Once the files get uploaded to S3, Amazon triggers an event calling an AWS Lambda function that splits the video into 5-minute chunks that get encoded into 60 different parallel streams that Netflix needs. Once the last part of the video gets processed they get aggregated and deployed using a series of rules and events.
Another way that Netflix uses AWS Lambda is for their backup system. As thousands of files get changed and modified on a daily basis Lambdas are checking if the files need to be backed up, they check the validity and integrity of the files, and if anything fails they can backtrack to the source of the problem and restart the process.
In the space of security, Netflix has thousands of processes that stop and start instances all the time and they use Lambda to validate that each no instance is constructed and configured in accordance with the system’s rules and regulation. They also use Lambda to create alerts and shutdown in the event of unauthorized access.
Next came efficiency improvements using better production monitoring and dashboards. This information was based on the events system that Netflix built for Lambda, through which events trigger validations to ensure that the configuration fits real-world needs.
The last step was to remove the responsibility of the servers that manage all of Netflix’s media. When Lambda is responsible for the server deployment, compliance, and configuration, Netflix can be confident that provisioning processes and responding to new business needs are fully handled.
Amazon Kinesis Streams processes multiple terabytes of log data each day, yet events show up in our analytics in seconds. We can discover and respond to issues in real-time, ensuring high availability and a great customer experience. — John Bennett Senior Software Engineer, Netflix
To start reaping the true serverless benefits like Netflix today, sign up to Dashbird’s serverless observability platform .
- Clean and easy-to-understand user interface
- No latency added to the function execution time
- Great support staff
- Support for Java, Node.js, Python
- Start working with your data immediately
- Pre-configured error and threat alarms and custom alarms
- Aggregated real-time observability for AWS services
- Well-Architected insights and actionable suggestions for improving users’ architecture
Read our blog
Making serverless applications reliable and bug-free
In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.
ANNOUNCEMENT: new pricing and the end of free tier
Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.
4 Tips for AWS Lambda Performance Optimization
In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.
Made by developers for developers
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
What our customers say
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly . We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving . It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us .
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good , and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs . Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs . We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.
End-to-end observability and real-time error tracking for AWS applications.
Arun’s Substack
Share this post.
AWS Case Study: Netflix’s Automated Tagging Strategy for Cost Optimization
Demonstrated aws series for tagging resources.
Introduction
In the dynamic world of cloud computing, managing costs and resource utilization efficiently is a critical requirement for companies operating at scale. Netflix, a global leader in streaming services, operates one of the largest and most complex cloud infrastructures in the world. As Netflix’s infrastructure grew in size and complexity, so did the challenges associated with cost management and visibility into resource consumption.
Thanks for reading Arun’s Substack! Subscribe for free to receive new posts and support my work.
To address this, Netflix developed an automated tagging strategy to control costs and gain better visibility into their cloud resource usage. This case study explores how Netflix implemented this strategy and the key benefits it provided to the organization.
Challenges Faced by Netflix
Netflix operates entirely on Amazon Web Services (AWS) , which enables the company to scale its services dynamically based on demand. However, with such an extensive cloud infrastructure, Netflix faced several challenges related to cost management and resource visibility:
Complex Infrastructure : With thousands of EC2 instances, numerous services, and regions, managing costs became increasingly challenging.
Lack of Visibility : Engineering teams struggled to track costs and resource consumption for each department and project.
Inefficient Resource Allocation : The lack of a standardized resource tagging strategy led to difficulties in identifying underutilized resources.
To overcome these challenges, Netflix aimed to improve the visibility, accountability, and efficiency of its cloud infrastructure using an automated tagging strategy.
Why Tagging Matters in AWS
Tagging is a key feature in AWS that allows users to assign metadata to resources. Tags consist of key-value pairs and can be used to identify, categorize, and organize resources based on user-defined attributes. For example, Netflix’s tags include keys such as:
Environment : Identifies the environment (e.g., development, staging, production).
Application : Indicates the application or service that owns the resource (e.g., app=streaming ).
Owner : Specifies the team or individual responsible for the resource.
Cost Center : Maps the resource to a specific cost center for billing and budgeting purposes.
Effective tagging helps organizations allocate costs, enforce security policies, and manage resources efficiently. For a large-scale organization like Netflix, automated tagging became essential to achieve the desired level of granularity in resource management.
Netflix’s Approach to Automated Tagging
Netflix adopted an automated tagging strategy to streamline cost allocation and resource management. The company utilized several AWS services and custom-built automation tools to achieve this:
Tagging Policy and Standardization Netflix established a standardized tagging policy that all engineering teams were required to follow. This policy outlined the following rules:
Every resource created must have specific tags, such as “Environment” , “Application” , “Owner” , and “Cost Center” .
Each tag key and value had a defined naming convention to ensure consistency across the organization.
By enforcing this policy, Netflix ensured that every resource could be easily identified, categorized, and tracked.
Automation with AWS Lambda and Custom Scripts Netflix used AWS Lambda functions to automate the tagging of newly created resources. When a new EC2 instance, RDS database, or other AWS service was created, a Lambda function would automatically apply the required tags based on predefined policies.
Here’s an example of a Lambda function for automated tagging:
This function automatically assigns standard tags to every EC2 instance based on Netflix’s organizational policy.
Enforcement with AWS Config and Policies To ensure that resources remained compliant with Netflix’s tagging policy, the company leveraged AWS Config . AWS Config is a service that monitors the configuration of AWS resources and checks for compliance with predefined rules.
Netflix used AWS Config rules to enforce the following:
Every resource must have the required tags.
Resource tags must follow the company’s standardized naming conventions.
In case a non-compliant resource was detected, AWS Config triggered a Lambda function to either notify the engineering team or automatically apply the missing tags .
Cost Allocation and Reporting Netflix used the detailed tagging information to create cost allocation reports . By categorizing resources based on applications, departments, and environments, the finance team gained granular visibility into cloud spending. This enabled better budgeting and cost forecasting.
AWS Cost Explorer was used to visualize costs across different teams, environments, and services. Netflix also built custom dashboards to provide stakeholders with insights into resource consumption and spending trends.
Benefits of Netflix’s Automated Tagging Strategy
By implementing an automated tagging strategy, Netflix achieved several key benefits:
Improved Cost Visibility : The standardized tags allowed Netflix to track costs at a granular level. Engineering teams could identify which resources were driving costs and allocate budgets accordingly.
Better Accountability : By tagging resources with owner information, Netflix ensured that every resource had an accountable owner. This encouraged teams to review and manage their resources efficiently.
Efficient Resource Allocation : The tagging strategy enabled Netflix to identify underutilized resources and terminate them, leading to significant cost savings.
Streamlined Auditing and Compliance : AWS Config and custom scripts helped Netflix enforce tagging policies and monitor compliance in real-time. This reduced the manual effort required for audits.
Real-World Impact and Example
One notable example of the impact of Netflix’s tagging strategy was during a peak traffic period. Netflix engineers identified that certain EC2 instances were running at full capacity, while others were underutilized. By using tags to filter resources based on application and environment, the team was able to reassign workloads efficiently, leading to a 30% reduction in EC2 spending during that period.
Additionally, Netflix’s tagging strategy enabled the finance team to prepare accurate cost forecasts for upcoming product launches and feature releases. This level of insight into cloud spending allowed Netflix to optimize its infrastructure proactively.
Key Takeaways for Implementing Tagging Strategies
The success of Netflix’s automated tagging strategy offers valuable lessons for organizations looking to implement similar solutions:
Define a Clear Tagging Policy : Establish a clear and comprehensive tagging policy that outlines the required tags, naming conventions, and rules for resource tagging.
Automate Tagging Wherever Possible : Use AWS Lambda functions or custom scripts to automate the tagging process and ensure compliance with organizational policies.
Leverage AWS Config for Enforcement : Utilize AWS Config to enforce tagging policies and detect non-compliant resources in real-time.
Use Tagging for Cost Allocation : Categorize resources based on tags to gain better visibility into costs and create cost allocation reports.
Continuously Monitor and Refine : Regularly review and update the tagging strategy to ensure it aligns with evolving business needs and cloud environments.
Netflix’s automated tagging strategy serves as a powerful example of how organizations can improve cost visibility and resource management in the cloud. By standardizing tags, automating the tagging process, and leveraging AWS tools for enforcement, Netflix was able to allocate costs accurately, hold teams accountable, and optimize resource utilization.
If you’re looking to implement a similar strategy in your organization, consider establishing a clear tagging policy, automating the tagging process, and continuously monitoring compliance. By doing so, you can gain better visibility into cloud spending, improve resource efficiency, and achieve cost savings in the long run.
Feedback & Comments are Welcome
Feel free to leave your comments and questions below! I would greatly appreciate your thoughts and feedback on this case study. If you’re interested in applying similar strategies in your organization, or if you just want to say hi, connect with me on LinkedIn, Twitter, Reddit, or via email at [email protected].
I am currently seeking opportunities as an SRE, DevOps, Platform Engineering, Infrastructure Engineering, Performance Engineering, Cloud Economics, and Architecture projects, as well as Freelance gigs! Please contact me if you are interested in collaborating on projects or working together.✨
Let’s have an impact together!
Discussion about this post
Ready for more?
The Evolution of Container Usage at Netflix
Netflix Technology Blog
Netflix TechBlog
Containers are already adding value to our proven globally available cloud platform based on Amazon EC2 virtual machines. We’ve shared pieces of Netflix’s container story in the past ( video , slides ), but this blog post will discuss containers at Netflix in depth . As part of this story, we will cover Titus: Netflix’s infrastructural foundation for container based applications. Titus provides Netflix scale cluster and resource management as well as container execution with deep Amazon EC2 integration and common Netflix infrastructure enablement.
This month marks two major milestones for containers at Netflix. First, we have achieved a new level of scale, crossing one million containers launched per week. Second, Titus now supports services that are part of our streaming service customer experience. We will dive deeper into what we have done with Docker containers as well as what makes our container runtime unique.
History of Container Growth
Amazon’s virtual machine based infrastructure (EC2) has been a powerful enabler of innovation at Netflix. In addition to virtual machines, we’ve also chosen to invest in container-based workloads for a few unique values they provide. The benefits, excitement and explosive usage growth of containers from our developers has surprised even us.
While EC2 supported advanced scheduling for services, this didn’t help our batch users. At Netflix there is a significant set of users that run jobs on a time or event based trigger that need to analyze data, perform computations and then emit results to Netflix services, users and reports. We run workloads such as machine learning model training, media encoding, continuous integration testing, big data notebooks and CDN deployment analysis jobs many times each day. We wanted to provide a common resource scheduler for container based applications independent of workload type that could be controlled by higher level workflow schedulers. Titus serves as a combination of a common deployment unit (Docker image) and a generic batch job scheduling system. The introduction of Titus has helped Netflix expand to support the growing batch use cases.
With Titus, our batch users are able to put together sophisticated infrastructure quickly due to having to only specify resource requirements. Users no longer have to deal with choosing and maintaining AWS EC2 instance sizes that don’t always perfectly fit their workload. Users trust Titus to pack larger instances efficiently across many workloads. Batch users develop code locally and then immediately schedule it for scaled execution on Titus. Using containers, Titus runs any batch application letting the user specify exactly what application code and dependencies are needed. For example, in machine learning training we have users running a mix of Python, R, Java and bash script applications.
Beyond batch, we saw an opportunity to bring the benefits of simpler resource management and a local development experience for other workloads. In working with our Edge, UI and device engineering teams, we realized that service users were the next audience. Today, we are in the process of rebuilding how we deploy device-specific server-side logic to our API tier leveraging single core optimized NodeJS servers. Our UI and device engineers wanted a better development experience, including a simpler local test environment that was consistent with the production deployment.
In addition to a consistent environment, with containers developers can push new application versions faster than before by leveraging Docker layered images and pre-provisioned virtual machines ready for container deployments. Deployments using Titus now can be done in one to two minutes versus the tens of minutes we grew accustomed to with virtual machines.
The theme that underlies all these improvements is developer innovation velocity.
Both batch and service users can now experiment locally and test more quickly. They can also deploy to production with greater confidence than before. This velocity drives how fast features can be delivered to Netflix customers and therefore is a key reason why containers are so important to our business.
Titus Details
We have already covered what led us to build Titus. Now, let’s dig into the details of how Titus provides these values. We will provide a brief overview of how Titus scheduling and container execution supports the service and batch job requirements as shown in the below diagram.
Titus handles the scheduling of applications by matching required resources and available compute resources.
Titus supports both service jobs that run “forever” and batch jobs that run “until done”.
Service jobs restart failed instances and are autoscaled to maintain a changing level of load. Batch jobs are retried according to policy and run to completion.
Titus offers multiple SLA’s for resource scheduling. Titus offers on-demand capacity for ad hoc batch and non-critical internal services by autoscaling capacity in EC2 based on current needs. Titus also offers pre-provisioned guaranteed capacity for user facing workloads and more critical batch. The scheduler does both bin packing for efficiency across larger virtual machines and anti-affinity for reliability spanning virtual machines and availability zones. The foundation of this scheduling is a Netflix open source library called Fenzo .
Titus’s container execution, which runs on top of EC2 VMs, integrates with both AWS and Netflix infrastructure. We expect users to use both virtual machines and containers for a long time to come so we decided that we wanted the cloud platform and operational experiences to be as similar as possible. In using AWS we choose to deeply leverage existing EC2 services. We used Virtual Private Cloud (VPC) for routable IPs rather than a separate network overlay. We leveraged Elastic Network Interfaces (ENIs) to ensure that all containers had application specific security groups. Titus provides a metadata proxy that enables containers to get a container specific view of their environment as well as IAM credentials. Containers do not see the host’s metadata (e.g., IP, hostname, instance-id). We implemented multi-tenant isolation (CPU, memory, disk, networking and security) using a combination of Linux, Docker and our own isolation technology.
For containers to be successful at Netflix, we needed to integrate them seamlessly into our existing developer tools and operational infrastructure. For example, Netflix already had a solution for continuous delivery, Spinnaker . While it might have been possible to implement rolling updates and other CI/CD concepts in our scheduler, delegating this feature set to Spinnaker allowed for our users to have a consistent deployment tool across both virtual machines and containers. Another example is service to service communication. We avoided reimplementing service discovery and service load balancing. Instead we provided a full IP stack enabling containers to work with existing Netflix service discovery and DNS (Route 53) based load balancing. In each of these examples, a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams provide.
Using existing systems comes at the cost of augmenting these systems to work with containers in addition to virtual machines. Beyond the examples above, we had to augment our telemetry, performance autotuning, healthcheck systems, chaos automation, traffic control, regional failover support, secret management and interactive system access. An additional cost is that tying into each of these Netflix systems has also made it difficult to leverage other open source container solutions that provide more than the container runtime platform.
Running a container platform at our level of scale (with this diversity of workloads) requires a significant focus on reliability. It also uncovers challenges in all layers of the system. We’ve dealt with scalability and reliability issues in the Titus specific software as well as the open source we depend on (Docker Engine, Docker Distribution, Apache Mesos, Snap and Linux). We design for failure at all levels of our system including reconciliation to drive consistency between distributed state that exists between our resource management layer and the container runtime. By measuring clear service level objectives (container launch start latency, percentage of containers that crash due to issues in Titus, and overall system API availability) we have learned to balance our investment between reliability and functionality.
A key part of how containers help engineers become more productive is through developer tools. The developer productivity tools team built a local development tool called Newt (Netflix Workflow Toolkit) . Newt helps simplify container development both iteratively locally and through Titus onboarding. Having a consistent container environment between Newt and Titus helps developer deploy with confidence.
Current Titus Usage
We run several Titus stacks across multiple test and production accounts across the three Amazon regions that power the Netflix service.
When we started Titus in December of 2015, we launched a few thousand containers per week across a handful of workloads. Last week, we launched over one million containers. These containers represented hundreds of workloads. This 1000X increase in container usage happened over a year timeframe, and growth doesn’t look to be slowing down.
We run a peak of 500 r3.8xl instances in support of our batch users. That represents 16,000 cores of compute with 120 TB of memory. We also added support for GPUs as a resource type using p2.8xl instances to power deep learning with neural nets and mini-batch.
In the early part of 2017, our stream-processing-as-a-service team decided to leverage Titus to enable simpler and faster cluster management for their Flink based system. This usage has resulted in over 10,000 service job containers that are long running and re-deployed as stream processing jobs are changed. These and other services use thousands of m4.4xl instances.
While the above use cases are critical to our business, issues with these containers do not impact Netflix customers immediately. That has changed as Titus containers recently started running services that satisfy Netflix customer requests.
Supporting customer facing services is not a challenge to be taken lightly. We’ve spent the last six months duplicating live traffic between virtual machines and containers. We used this duplicated traffic to learn how to operate the containers and validate our production readiness checklists. This diligence gave us the confidence to move forward making such a large change in our infrastructure.
The Titus Team
One of the key aspects of success of Titus at Netflix has been the experience and growth of the Titus development team. Our container users trust the team to keep Titus operational and innovating with their needs.
We are not done growing the team yet. We are looking to expand the container runtime as well as our developer experience. If working on container focused infrastructure excites you and you’d like to be part of the future of Titus check out our jobs page .
Andrew Spyker , Andrew Leung and Tim Bozarth
On behalf of the entire Titus development team
Published in Netflix TechBlog
Learn about Netflix’s world class engineering efforts, company culture, product developments and more.
Written by Netflix Technology Blog
Learn more about how Netflix designs, builds, and operates our systems and engineering organizations
Responses ( 7 )
Riccardo Casero
over 7 years ago
Dave Cremins
over 1 year ago
We implemented multi-tenant isolation (CPU, memory, disk, networking and security) using a combination of Linux, Docker and our own isolation technology.
More from Netflix Technology Blog and Netflix TechBlog
Netflix’s Distributed Counter Abstraction
By: rajiv shringi, oleksii tkachuk, kartik sathyanarayanan.
Introducing Netflix’s TimeSeries Data Abstraction Layer
By rajiv shringi, vinay chella, kaidan fullerton, oleksii tkachuk, joey lynch.
Java 21 Virtual Threads - Dude, Where’s My Lock?
Getting real with virtual threads.
Content Drive
How we organize and share billions of files in netflix studio, recommended from medium.
Stackademic
Abdur Rahman
Python is No More The King of Data Science
5 reasons why python is losing its crown.
How I Am Using a Lifetime 100% Free Server
Get a server with 24 gb ram + 4 cpu + 200 gb storage + always free.
Coding & Development
Natural Language Processing
Jessica Stillman
Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right
Jeff bezos’s morning routine has long included the one-hour rule. new neuroscience says yours probably should too..
Vishal Lokam
I just spent 5+ hours reading Netflix engineering blogs, so you don’t have to
My learning while exploring the netflix tech blogs.
Code Like A Girl
Nidhi Jain 👩💻
7 Productivity Hacks I Stole From a Principal Software Engineer
Golden tips and tricks that can make you unstoppable.
Text to speech
COMMENTS
Netflix wanted to remove any single point of failure from its system. AWS offered highly reliable databases, storage, and redundant data centers. Netflix wanted cloud computing, so it wouldn't have to build big unreliable monoliths anymore. Netflix wanted to become a global service without building its own datacenters.
More than 36 million Netflix members worldwide view streamed content and access Netflix features delivered via cloud technology that the company has been developing since 2009. Netflix operates on a cloud platform based on Amazon Web Services (AWS). Over the years, Netflix engineers have developed numerous cloud tools and technologies, which ...
Netflix used cloud computing to improve their storage processes. One of the perks of going digital for any business is that you get unlimited storage for data. And for a company like Netflix with an outrageous amount of data to store, cloud computing was the smart choice to proceed. Netflix integrated with Amazon Web Services to manage this issue.
Benefits of the cloud. It took Netflix seven years to complete the migration to the cloud. In 2016, the last remaining data centres used by the streaming service were shut down. In its place was a new cloud infrastructure running all of Netflix's computing and storage needs, from customer information to recommendation algorithms.
At the moment Netflix has moved completely to the AWS cloud infrastructure and while a full seven years to make the move from their own data center might seem a long time for most people, they wanted to make sure that the problems they were facing while using the self-managed data center would not get imported into the cloud so they ended up ...
In the dynamic world of cloud computing, managing costs and resource utilization efficiently is a critical requirement for companies operating at scale. Netflix, a global leader in streaming services, operates one of the largest and most complex cloud infrastructures in the world. ... This case study explores how Netflix implemented this ...
Containers are already adding value to our proven globally available cloud platform based on Amazon EC2 virtual machines. We've shared pieces of Netflix's container story in the past (video, slides), but this blog post will discuss containers at Netflix in depth.As part of this story, we will cover Titus: Netflix's infrastructural foundation for container based applications.
Cloud computing: Netflix adopted cloud computing early on, which allowed it to scale its streaming service quickly and cost-effectively. This was a key enabler of the company's growth. Data analytics: Netflix uses data analytics to personalize its content offerings and build a highly engaged customer base.
Most cloud migration efforts seem to lie somewhere in the middle of the two extremes and strike a balance between reducing costs and improving management while keeping risks to a minimum. Prominent examples of cloud migration success stories include companies such as Netflix, Capital One, Hearst, Unilever, Airbnb, Condé Nast and Johnson & Johnson.