Ateme has consistently won awards for its ground-breaking solutions for improving video-viewing experiences. But what are the technologies that power the future of video, and who are the people inventing them? In this series of blog articles, we meet some of the brightest brains at Ateme. These are our PhDs who invent new ways of compressing and delivering video. And we find out what they are working on now. One field Ateme is investing in is artificial intelligence (AI). In this first article of the series, Sébastien Pelurson, PhD — a Research Engineer at Ateme — shares what he is working on, what the challenges are, and what all this means for viewers.
What is your role at Ateme?
I’m working as a data scientist in Ateme’s research and innovation team. My role is to use artificial intelligence technologies to improve video coding and services.
What in particular have you been working on?
I’m mainly working on video coding improvement using AI. I’ve been working on multiple applications. These range from saliency areas, prediction, and semantic segmentation to image denoising and video frame interpolation. In each of these fields, my job is to design, train, and evaluate models that can meet the various constraints of the use cases we target, and then be used in a production environment. As a researcher, I also aim to improve the state of the art by designing more effective solutions.
What is AI and how does it apply to the video industry?
Artificial intelligence refers to technologies that allow computers to perform tasks that have so far required human intelligence. While this field has existed for several decades, for example through expert systems, it has come to prominence with the development of deep learning. With this kind of technology, the machine learns to accomplish tasks from data, without following explicit instructions. It gradually learns to improve its accuracy by identifying useful patterns in data.
Unlike machine learning, which needs to define handcrafted features, deep learning takes advantage of the artificial neural network concept. This makes it possible to ingest unstructured data such as text or images. Thus, useful features are no longer designed manually, but are learned by the models. The field of deep learning has grown fast thanks to powerful hardware such as Graphical Processing Units (GPUs), large-scale datasets, and efficient model architectures. Today, it enables performances that would be impossible to reach with traditional solutions. This is the case for tasks related to computer vision and natural-language processing.
AI has had an impact on many fields in our daily life, and the video industry is no exception. Think about recommendation systems or the creation of highlights on VOD platforms. More specifically, AI is also having an impact on the video coding field. It can be used to improve existing traditional codecs, helping encoders in their decision processes, or during the pre/post processing steps, for example to remove artifacts. More recently, there has also been focus on the creation of new codecs entirely based on deep-learning technologies.
What are the challenges of applying artificial intelligence to the video industry?
The main challenge is to design efficient AI models that meet industrial constraints. AI research has been very active for the past ten years. So many solutions exist for a wide range of tasks, but they can rarely be used as-is in an industrial environment.
This is due to a number of reasons. The first one is the complexity of the models. Deep learning models become more and more efficient from year to year, thanks specifically to new architecture design. This efficiency comes at the cost of an ever-increasing complexity. For instance, for the same image-classification task, one of the first deep-learning models presented in 2012 had 60 million parameters. Recent architectures such as Transformers have more than 2 billion. So, well-performing, state-of-the-art solutions cannot be used in very constrained environments such as live video coding.
Also, most state-of-the-art models are trained and evaluated on public datasets so that results are comparable and reproducible. But in order to perform well, models need to be trained on data that are similar to those that will be used during the inference steps. This is known as “data distribution similarity.” If distributions are not similar, one cannot predict how the model will perform in a production environment. So, the training dataset has a huge impact on model performances. Building a specific dataset is very time-consuming, especially for supervised learning that requires the annotation of every dataset sample.
What have you achieved in this field at Ateme?
My main focus has been on understanding scenes and adapting the visual quality for the image’s regions of interest. The idea is to detect and preserve or improve the quality of areas the viewers will focus on in a video sequence. This work is based on the foveation mechanism of the human visual system. This enables us to capture only visually important regions in high resolution. In return, this gives little attention to peripheral regions and visualizing them in low resolution.
We recently showed that using a saliency prediction model, we can reduce the bitrate by between 6% and almost 30% while maintaining the same visual quality. This is the case in both a preprocessing filter to simplify the input sequence and in a rate-control module. In order to run at an acceptable speed, the model we used is based on a light encoder/decoder architecture. It has also been trained on a public dataset. Some limitations due to this choice have been highlighted by our evaluations on specific content.
Creating a new saliency dataset is very complex.
It requires the use of an eye-tracking system on many subjects and must cover all the kinds of content that could be used in video coding. For this reason, we decided to improve our solution by designing a different model architecture. Our latest solution uses a multitask learning strategy. This solves multiple tasks at the same time using different datasets for each one. And it has the advantage of limiting the impact of the biases present in each of the specific datasets, by extracting patterns from each dataset and allowing the model to better generalize. Moreover, using this approach on similar tasks, such as saliency prediction and semantic segmentation, allows both outputs to be merged to further improve predictions.
This approach leads to models that are a bit more complex. We then worked with model optimization techniques such as quantification, fusion operation, and pruning to speed them up for a specific target platform.
What does artificial intelligence change for viewers?
Even without AI, the main goal is always to improve the quality of experience for viewers. It’s critical to offer new services or improve the bitrate/distortion tradeoff.
Better coding efficiency leads to either better perceived quality, or a lower bitrate consumption. New codecs aim to improve coding efficiency. However, the installed park of receivers is not always compatible with them, so adoption can take a long time. That is why existing codecs need to be improved. Today, some AI solutions have shown improvements in some coding tools that had not been achieved with traditional algorithms. This kind of approach therefore seems to be a promising path for improving existing codec.
The most visible result for viewers is in the services. The most popular ones are recommendation systems. These allow streaming platforms to suggest personalized content to users. But AI can also be used by these platforms to choose new content. For example, this is the case when predicting if a particular video will be appreciated by users. Other tasks than can be addressed with AI include creating highlights and generating subtitles.
AI technologies have been evolving very quickly over the last decade. Adoption of new solutions in specific fields such as the video industry may take time. This is due to the challenges mentioned above. But AI technologies have already demonstrated their relevance by reaching performance levels never achieved before on certain tasks. An example is end-to-end deep-learning video coding. It is already competing with state-of-the-art fixed image coders, after only a few years of research.