Good Code Design From Linux/Kernel

Learn how Linux/FFmpeg C partial codebase is organized to be extensible and act as if it were meant to have “polymorphism”. Specifically, we’re going to briefly explore how Linux concept of everything is a file works at the source code level as well as how FFmpeg can add support fast and easy for new formats and codecs.

diagram_components

 

Good software design – Introduction

To write useful and long term maintainable software we tend to look out for patterns and group them into abstractions and it seems that’s the case for devs behind Linux and FFmpeg too.

Software design

When we’re creating software, we’re building data structures and defining their behaviors and dependencies. The way we create and link them can be seen as the design/architecture of the software.

Let’s say we’re building a media framework that encodes/decodes video and audio. The codecs AV1, H264, HEVC, and AAC all do some common operations and if we can provide a generic abstraction that holds these common operations and data we can use this concept instead of relying on the concrete idea of what a specific codec does.

Through the years many developers noticed that software with a good design is a good idea that pays off as software grows in complexity.

This is one of the ideas behind the good design for software, to rely on components that are weakly linked and with boundaries around what it should do.

Ruby

Maybe it’s easier to see all these concepts in practice. Let’s code a quick pseudo media stream framework that provides encoding and decoding for several codecs.

This pseudo-code in ruby tries to recreate what we’re discussing above, there is an implicit concept here of what operations a codec must have, in this case, the operations are encode and decode. Since ruby is a dynamically typed language any class can present these two operations and act as a codec for us.

Developers sometimes may use the words: contract, API, interface, behavior and operations as synonyms.

This design might be considered good because if we want to add a new codec we just need to provide an implementation and add it to the list, even the list could be built in a dynamic way but the idea is that this code seems easy to extend and maintain because it tries to keep link between the components weak (low coupling) and each component does only what it should do (cohese).

Rails framework even enforce some way to organize the code, it adopts the model-view-controller (MVC) architecture

Golang

When we go (no pun intended) to a statically typed language like golang we need to be more formal, describing the required types but it’s still doable.

The interface type in golang is much more powerful than Java’s similar construct because its definition is totally disconnected from the implementation and vice versa. We could even make each codec a ReadWriter and use it all around.

Clang

In the C language we still can create the same behavior but it’s a little bit different.

Code inspired by https://www.bottomupcs.com/abstration.xhtml

We first define the abstract operations (functions in this case) in a generic struct and then we fill it with the concrete code, like the av1 decoder and encoder real code.

Many other languages have somewhat similar mechanisms to dispatch methods or functions as if they were part of an agreed protocol and then the system integration code can deal only with this high-level abstractions.

Linux Kernel – Everything is a file

Have you ever heard the expression everything is a file in Linux? The idea is to have a common interface for all kinds of resources in Linux, for instance, Linux handles network socket, special files (like /proc/cpuinfo) or even USB devices as files.

This is a powerful idea that can make easy to write or use programs for linux since we can rely in a set of well known operations from this abstraction called file. Let’s see this in action:

This only is possible because the concept of a file (data structure and operations) was design to be one of the main way to communicate among sub-systems. Here’s a gist of the file_operations’ API.

The struct file_operations define what one should expect from a concept of what file can do.

Here we can see the directory implementation of these operations for the ext4 file system.

And even the cpuinfo proc files is done over this abstraction. When you’re operating files under linux you’re actually dealing with the VFS system, this system delegates to the proper implementation file implemenation.

Screen Shot 2019-08-21 at 10.14.07 AM

Source: https://ops.tips/blog/what-is-that-proc-thing/

FFmpeg – Formats

Here’s an overview of FFmpeg flow/architecture that shows that the internal componets are linked mostly to the abstract concepts like AVCodec but not directly to their implemenation, H264, AV1 or etc.

remuxing_libav_components
FFmpeg architecture view from transmuxing flow

 

For the input files, FFmpeg creates a struct called AVInputFormat that is implemented by any format (video container) that wants to be used as an input. MKV files fill this structure with its implementation as the MP4 format too.

 

This design allows new codecs, formats, and protocols to be integrated and released easier. DAV1d (an av1 open-source implementation) was integrated into FFmpeg May this year and you can follow along the commit diff to see how easy it was. In the end, it needs to register itself as an available codec and follow the expected operations.

No matter the language we use we can (or at least try to) build a software with low coupling and high cohesion in mind, these two basic properties can allow you to build easier to maintain and extend software.

Advertisements

How to build a distributed throttling system with Nginx + Lua + Redis

graph

At the last Globo.com’s hackathon, Lucas Costa and I built a simple Lua library to provide a distributed rate measurement system that depends on Redis and run embedded in Nginx but before we explain what we did let’s start by understanding the problem that a throttling system tries to solve and some possible solutions.

Suppose we just built an API but some users are doing too many requests abusing their request quota, how can we deal with them? Nginx has a rate limiting feature that is easy to use:

This nginx configuration creates a zone called mylimit that limits a user, based on its IP, to be able to only do a single request per minute. To test this, save this config file as nginx.conf and run the command:

We can use curl to test its effectiveness:

screen shot 2019-01-25 at 9.51.19 pm

As you can see, our first request was just fine, right at the start of the minute 50, but then our next two requests failed because we were restricted by the nginx limit_req directive that we setup to accept only 1 request per minute. In the next minute we received a successful response.

This approach has a problem though, for instance, a user could use multiple cloud VM’s and then bypass the limit by IP. Let’s instead use the user token argument:

There is another good reason to avoid this limit by IP approach, many of your users can be behind a single IP and by rate limiting them based on their IP, you might be blocking some legit uses.

Now a user can’t bypass by using multiple IPs, its token is used as a key to the limit rate counter.

screen shot 2019-01-25 at 10.22.00 pm

You can even notice that once a new user requests the same API, the user with token=0xCAFEE, the server replies with success.

Since our API is so useful, more and more users are becoming paid members and now we need to scale it out. What we can do is to put a load balancer in front of two instances of our API. To act as LB we can still use nginx, here’s a simple (workable) version of the required config.

Now to simulate our scenario we need to use multiple containers, let’s use docker-compose to this task, the config file just declare three services, two acting as our API and the LB.

Run the command docker-compose up and then in another terminal tab simulate multiple requests.

When we request http://localhost:8080 we’re hitting the lb instance.

screen shot 2019-01-25 at 10.58.25 pm

It’s weird?! Now our limit system is not working, or at least not properly. The first request was a 200, as expected, but the next one was also a 200.

It turns out that the LB needs a way to forward the requests to one of the two APIs instances, the default algorithm that our LB is using is the round-robin which distributes the requests each time for a server going in the list of servers as a clock.

The Nginx limit_req stores its counters on the node’s memory, that’s why our first two requests were successful.

And if we save our counters on a data store? We could use redis, it’s in memory and is pretty fast.

screen shot 2019-01-25 at 11.28.41 pm

But how are we going to build this counting/rating system? This can be solved using a histogram to get the average, a leaky bucket algorithm or a simplified sliding window proposed by Cloudflare.

To implement the sliding window algorithm it’s actually very easy, you will keep two counters, one for the last-minute and one for the current minute and then you can calculate the current rate by factoring the two minutes counters as if they were in a perfectly constant rate.

To make things easier, let’s debug an example of this algorithm in action. Let’s say our throttling system allows 10 requests per minute and that our past minute counter for a token is 6 and the current minute counter is 1 and we are at the second 10.

last_counter * ((60 current_second) / 60) + current_counter
6 * ((60 10) / 60) + 1 = 6 # the current rate is 6 which is under 10 req/m

To store the counters we used three simple (O(1)) redis operations:

  • GET to retrieve the last counter
  • INCR to count the current counter and retrieve its current value.
  • EXPIRE to set an expiration for the current counter, since it won’t be useful after two minutes.

We decided to not use MULTI operation therefore in theory some really small percentage of the users can be wrongly allowed, one of the reasons to dismiss the MULTI operation was because we use a lua driver redis cluster without support but we use pipeline and hash tags to save 2 extra round trips.

Now it’s the time to integrate the lua rate sliding window algorithm into nginx.

You probably want to use the access_by_lua phase instead of the content_by_lua from the nginx cycle.

The nginx configuration is uncomplicated to understand, it uses the argument token as the key and if the rate is above 10 req/m we just reply with 403. Simple solutions are usually elegant and can be scalable and good enough.

The lua library and this complete example is at Github and you can run it locally and test it without great effort.

Use URL.createObjectURL to make your videos start faster

faster-start-up

During our last hackathon, we wanted to make our playback to start faster. Before our playback starts to show something to the final users we do around 5 to 6 requests (counting some manifests) and our goal was to cut as much as we can.

Screen Shot 2018-08-10 at 8.55.20 PM

The first step was very easy, we just inverted the code logic from the client side to the server side and then we injected the prepared player on the page.

Pseudo Ruby server side code:

some_api = get("http://some.api/v/#{@id}/playlist")
other_api = get("http://other.api/v/#{@some_api.id}/playlist")
# ...
@final_uri = "#{protocol}://#{domain}/#{path}/#{manifest}"

Pseudo JS client side code:

new Our.Player({source: {{ @final_uri }} });

Screen Shot 2018-08-10 at 8.57.13 PM

Okay, that’s nice but can we go further? Yes, how about to embed our manifests into our page?! It turns out that we can do that with the power of URL.createObjectURL, this API gives us an URL for a JS blob/object/file.

// URL.createObjectURL is pretty trivial
// to use and powerfull as well
 var blob = new Blob(["#M3U8...."]
            , {type: "application/x-mpegurl"});
 var url = URL.createObjectURL(blob);

Pseudo Ruby server side code:

some_api = get("http://some.api/v/#{@id}/playlist")
other_api = get("http://other.api/v/#{@some_api.id}/playlist")
# ...
@final_uri = "#{protocol}://#{domain}/#{path}/#{manifest}"
@main_manifest = get(@final_uri)
@sub_manifests = @main_manifest
                 .split_by_uri
                 .map {|uri| get(uri)}

Pseudo JS client side code:

  var mime = "application/x-mpegurl";
  var manifest = {{ @main_manifest }};
  var subManifests = {{ @sub_manifests }};
  var subManifestsBlobURL = subManifest
                           .splitByURL()
                           .map(objectURLFor(content, mime));
  var finalMainManifest = manifest
                          .splitByLine()
                          .map(content.replace(id, subManifestsBlobURL[id]))
                          .joinWithLines();

  function objectURLFor(content, mime) {
    var blob = new Blob([content], {type: mime});
    return URL.createObjectURL(blob);
  }

  new Our.Player({
    src: objectURLFor(finalMainManifest, mime)
  })

Screen Shot 2018-08-10 at 8.57.43 PM

We thought we were done but then we came up with the idea of doing the same process for the first video segment, the page now will weight more but the player would almost play instantaneously.

// for regular text manifest we can use regular Blob objects
// but for binary data we can rely on Uint8Arrary
var segment = new Uint8Array({{ segments.first }});

By the way, our player is based on Clappr and this particular test was done with hls.js playback which does use the fetch API to get the video segments, fetching this created URL works just fine.

The animated gif you see at the start of the post was done without the segment on the page optimization. And we just ignored the possible side effects on the player ABR algorithm (that could think it has a high bandwidth due to the fast manifest fetch).

Finally, we can make it even faster using the MPEG Dash and its template timeline format, we can use shorter segments sizes and we can tune the ABR algorithm to be initially faster.

Behind the scenes of live streaming the FIFA World Cup 2018

Screen Shot 2018-07-13 at 5.01.13 PM

Globo.com, the digital branch of Globo Group, had the rights to do the online live streaming of the FIFA World Cup 2018  for the entire Brazilian national territory.

We already did this in the past and I think that sharing the experience may be useful for the curious minds that want to learn more about the digital live streaming ecosystem as well as for the people interested in how Brazil infrastructure and user’s demand behave in an event with this scale.

Before the event – Road to the world cup

In average, we usually ingest and process about 1TB of video and users fetches around 1PB every single day. Even before the World Cup started, the live stream of a single soccer match had a peak of more than 500K simultaneous users with more than 400k requests per second.

When comparing these numbers to previous events such as the Olympic Games or the FIFA World Cup 2014 we can see an exponential evolution in demand.

Screen Shot 2018-07-13 at 5.00.44 PM

Back in 2014, Globo.com CDN was equipped with 20Gbps network interfaces. Now, the nodes were upgraded with 40Gbs, 50Gbs, and 100Gbps NICs. Processors were also upgraded enabling us to deliver 84Gbps on a single machine as part of the preparation for the World Cup.

I’m glad to say that the Linux/kernel fine-tune required was minimal since the newer kernel versions are very well tuned by default.

Screen Shot 2018-07-13 at 5.00.59 PM

We broke the simultaneous users record set by 2014 FIFA world cup way before the first 2018 World Cup matches. We also noticed an increase in the overall bitrate which likely point that the Internet infrastructure in Brazil improved significally in the past four years.

Plataform overview – The strategy 1:1:1

Let’s not focus on the workflow before the video arrives at our ingest encoders. Just think that it’s coming from Russia’s stadiums and reaching our ingest encoders directly. With this simplification in place, we can assume that there are basically two different users of interacting with the video platform: the ones producing the video and others consuming in the other end.

Screen Shot 2018-07-07 at 4.36.18 PM

Consumers of the video are the visitors of our internet properties and they watch the live content throughout Globo.com video player, which is responsible for requesting video content to Globo.com’s CDN or one of our CDN partners.

Globo.com player is based on Clappr, an open source HTML5 player that uses hls.js and shaka as its core playback engines.

Globo.com CDN nodes are mostly built on top of  OSS projects such as Linux, Nginx (nginx-lua), Lua Programming Language and redis. Our origin is made of multiple ingest points and a mix of solutions such as FFmpeg, Elemental and  OBS. A Cassandra cluster is also deployed with the responsibility of storing and manipulating video segments.

OSS projects play a key role in all the initiatives we have within our technology and engineering teams. We also rely a lot on dozens of open source libraries and we try as much as we can to give stuff back to the community.

If you want to know how this architecture works you can learn from the awesome post: Globo.com’s live video platform for the 2014 FIFA World Cup

Constrained by bandwidth – Control the ball

The truth is: the Internet is physically limited, it doesn’t matter if you got more servers, in the end, if a group of users have a link to us of 10Gb/s that’s all we stream to them.

Or we can explore external CDNs more pops but I hope you got the idea! 🙂

In a big event, such as the World Cup, there will be some congestions on the link between our CDN and the final users, how we tackle this problem (of a limited bandwidth) can be divided into three levels:

  1. OS :: TCP congestion control – the lowest level to control the connection, when it’s saturated, this control is applied to each user.
  2. Player :: ABR algorithm – it watches metrics such as network speed, CPU load, frame drop among others to decide whether it should adapt to a better or the worst bitrate quality.
  3. Server :: group bitrate control – when we identify that a group of users, which uses the same link, are using a link that is about to saturate, we can try to help the player to use to a lower bitrate and accommodate more users.

During the event – Goals

Even before the knockout stage, we were able to beat all of our previous records, serving about 1.2M simultaneous users during this match. Our live CDN delivered, at its peak, about 700K requests/s and our worst response time was half a second for a 4 seconds video segment.

Some of our servers were able to reach (peak) 37Gb/s in bandwidth. We also delivered the 4K live streaming using HEVC with a delay of around 25 seconds.

We are constantly evolving the platform and looking at the bleeding edge technologies such as AV1. With the help of the open source community and the growing amount of talents on our technology teams, we hope to keep beating records and delivering the best experience to our users.

References