KEYWORDS: Video coding, Video, Power consumption, Energy efficiency, Video compression, Open source software, Scalable video coding, Video processing, Clocks
In this paper, we present a methodology for benchmarking the coding efficiency and energy efficiency of software and hardware video transcoding implementations. This study builds upon our previous work, which focused on software encoders such as x264, x265, libvpx, vvenc, and SVT-AV1. We have since added a closed-source video software encoder implementation, EVE-VP9, as well as Meta’s MSVP VP9 encoder as a hardware representative, and expanded the test set to include a wider variety of test content in our analysis. To ensure a fair comparison between software and hardware encoders, we normalize the video encoding efficiency to energy used in watt-hours. Our proposed test methodology includes a detailed description of the process for measuring compression efficiency and energy consumption. We summarize limitations of our methodology and identify future opportunities for improvement.
Videos uploaded to Meta's Family-of-Apps are transcoded into multiple bitstreams of various codec formats, resolutions and quality to provide the best video quality across the wide variety of devices and connection bandwidth constraints. On Facebook alone, there are more than 4 billion video views per day and to address the video processing at this scale, we needed a video processing solution that can deliver the best video quality possible, with the shortest amount of encoding time — all while being energy efficient, programmable, and scalable. In this paper, we present, Meta Scalable Video Processor (MSVP) that can do video processing at on-par quality compared to SW solutions but at a small fraction of the compute time and energy. Each MSVP ASIC can offer a peak SIMO (Single Input Multiple Output) transcoding performance of 4K at 15fps at the highest quality configuration and can scale up to 4K at 60fps at the standard quality configuration. This performance is achieved at ~10W of PCIe module power. We achieved a throughput gain of ~9x for H.264 when compared against libx264 SW encoding. For VP9, we achieved a throughput gain of ~50x when compared with libVPX speed 2 preset. Key components of MSVP transcoding include video decode, scalar, encoding and quality metric computation. In this paper, we go over ASIC architecture of MSVP, design of individual components and compare the perf/W vs quality against standard industry used SW encoders.
Image compression has experienced a new revolution with the success of deep learning, which yields superior rate-distortion (RD) performance against traditional codecs. Yet, high computational complexity and energy consumption are the main bottlenecks that hinder the practical applicability of deep-learning-based (DL-based) codecs. Inspired by the neural network’s hierarchic structure yet with lower complexity, we propose a new lightweight image coding framework and name it the ”Green Image Codec” (GIC) in this work. First, GIC down-samples an input image into several spatial resolutions from fine-to-coarse grids and computes image residuals between two adjacent grids. Then, it encodes the coarsest content, interpolates content from coarse-tofine grids, encodes residuals, and adds residuals to interpolated images for reconstruction. All coding steps are implemented by vector quantization while all interpolation steps are conducted by the Lanczos interpolation. To facilitate VQ codebook training, the Saab transform is applied for energy compaction and, thus, dimension reduction. A simple rate-distortion optimization (RDO) is developed to help select the coding parameters. GIC yields an RD performance that is comparable with BPG at significantly lower complexity.
AV1 is the first generation of royalty-free video coding standards developed by the Alliance for Open Media (AOM). Since it was released in 2018, it has gained great adoption in the industry. Major services providers, such as YouTube and Netflix, have started streaming AV1 encoded content. Even though more and more vendors have started to implement HW AV1 decoders in their products, to enable AV1 playback on a broader range of devices, especially mobile devices, software decoders with very good performance are still critical. For this purpose, VideoLAN created Dav1d, a portable and highly optimized AV1 software decoder. The decoder implements all AV1 bitstream features. Dataflow is organized to allow various decoding stages (bitstream parsing, pixel reconstruction and in-loop postfilters) to be executed directly after each other for the same superblock row, allowing memory to stay in cache for most common frame resolutions. The project includes more than 200k lines of platformspecific assembly optimizations, including Neon optimizations for arm32/aarch64, as well as SSE, AVX2 (Haswell) and AVX512 (Icelake/Tigerlake) for x86[3] to create optimal performance on most popular devices. For multi-threading, Dav1d uses a generic task-pool design, which splits decoding stages in mini-tasks. This design allows multiple decoding stages to execute in parallel for adjacent tiles, superblock rows and frames, and keeps common thread-counts (2-16) efficiently occupied on multiple architectures with minimal memory or processing overhead. To test the performance of Dav1d on real devices, a set of low-end to high-end android mobile devices are selected to conduct benchmarking tests. To simulate real-time playback with display, VLC video player application with dav1d integration is used. Extensive testing is done using a wide range of video test vectors at various resolutions, bitrates, and framerates. The benchmarking and analysis are conducted to get the insights of single and multithreading performance, impact of video coding tools, CPU utilization and battery drain. Overall AV1 real-time playback of 720p 30fps @ 2Mbps is feasible for low-end devices with 4 threads and 1080p 30fps @ 4Mbps is feasible for high-end and mid-range devices with 4 threads using Dav1d decoder.
KEYWORDS: Computer programming, Video, Video coding, Data modeling, Statistical modeling, Quantization, Machine learning, Statistical analysis, Principal component analysis, Video compression, Video processing, Low bit rate video
In the era of COVID-19 pandemic, videos are very important to the billions of people staying and working at home. Two-pass video encoding allows for a refinement of parameters based on statistics obtained from the first pass. Given the variety of characteristics in user-generated content, there is opportunity to make this refinement optimal for this type of content. We show how we can replace the traditional models used for rate control in video coding with better prediction models with linear and nonlinear model functions. Moreover, we can utilize these first-pass statistics to further refine the traditional encoding recipes that are typically used for all input video sequences. Our work can provide much-needed bitrate savings for many different encoders, and we highlight it by testing on typical Facebook video content.
Block-based discrete cosine transform (DCT) and quantization matrices on YCbCr color channels play key roles in JPEG and have been widely used other standards in last three decades. In this work, we propose a new image coding method, called DCST. It adopts data-driven color transform and spatial transform based on statistical properties of pixels and machine learning. To match with the data-driven forward transform, we propose a method to design the quantization table based on human visual system (HVS). Furthermore, to efficiently compensate the quantization error, a machine learning based optimal inverse transform is proposed. Performance of our new design is verified using Kodak image dataset based on libjpeg. Our pipeline outperforms JPEG with a gain of 0.5738 in BD-PSNR (or a reduction of 9.5713 in BD-rate) range from 0.2 to 3bpp.
Software video encoders that have been developed based on the AVC, HEVC, VP9, and AV1 video coding standards have provided improved compression efficiency but at the cost of large increases in encoding complexity. As a result, there is currently no software video encoder that provides competitive quality-cycles tradeoffs extending from the AV1 high-quality range to the AVC low-complexity range. This paper describes methods based on the dynamic optimizer (DO) approach to further improve the SVT-AV1 overall quality-cycles tradeoffs for high-latency Video on Demand (VOD) applications. First the performance of the SVT-AV1 encoder is evaluated using the conventional DO approach, and then using the combined DO approach that accounts for all the encodings being considered in the selection of the encoding parameters. A fast parameter selection approach is then discussed. The latter allow for up to a 10x reduction in the complexity of the combined DO approach with minimal BD-rate loss.
This paper describes FB-MOS metric that measures video quality at scale in Facebook ecosystem. As the quality of uploaded UGC source itself varies widely, FB-MOS consists of both a no-reference component to assess input (upload) quality and a full-reference component, based on SSIM, to assess quality preserved in the transcoding and delivery pipeline. Note that the same video may be watched on a variety of devices (Mobile/laptop/TV) in varying network conditions that cause quality fluctuations; moreover, the viewer can switch between in-line view and full-screen view during the same viewing session. We show how FB-MOS metric accounts for all this variation in viewing condition while minimizing the computation overhead. Validation of this metric on FB-content has shown that SROCC is 0.9147 using internally selected videos. The paper also discusses some of the optimizations to reduce metric computation complexity and scale the complexity in proportion to video popularity.
With the recent development of video codecs, compression efficiency is expected to improve by at least 30% over their predecessors. Such impressive improvements come with a significant, typically orders of magnitude, increase in computational complexity. At the same time, objective video quality metrics that correlate increasingly better with human perception, such as SSIM and VMAF, have been gaining popularity. In this work, we build on the per-shot encoding optimization framework that can produce the optimal encoding parameters for each shot in a video, albeit carrying itself another significant computational overhead. We demonstrate that, with this framework, a faster encoder can be used to predict encoding parameters that can be directly applied to a slower encoder. Experimental results show that we can approximate within 1% the optimal convex hull, with significant reduction in complexity. This can significantly reduce the energy spent on optimal video transcoding.
The video quality assessment (VQA) technology has attracted a lot of attention in recent years due to an increasing demand of video streaming services. Existing VQA methods are designed to predict video quality in terms of the mean opinion score (MOS) calibrated by humans in subjective experiments. However, they cannot predict the satisfied user ratio (SUR) of an aggregated viewer group. Furthermore, they provide little guidance to video coding parameter selection, e.g. the Quantization Parameter (QP) of a set of consecutive frames, in practical video streaming services. To overcome these shortcomings, the just-noticeable-difference (JND) based VQA methodology has been proposed as an alternative. It is observed experimentally that the JND location is a normally distributed random variable. In this work, we explain this distribution by proposing a user model that takes both subject variabilities and content variabilities into account. This model is built upon user’s capability to discern the quality difference between video clips encoded with different QPs. Moreover, it analyzes video content characteristics to account for inter-content variability. The proposed user model is validated on the data collected in the VideoSet. It is demonstrated that the model is flexible to predict SUR distribution of a specific user group.
We present a new methodology that allows for more objective comparison of video codecs, using the recently published Dynamic Optimizer framework. We show how this methodology is relevant primarily to non-real time encoding for adaptive streaming applications and can be applied to any existing and future video codecs. By using VMAF, Netflix’s open-source perceptual video quality metric, in the dynamic optimizer, we offer the possibility to do visual perceptual optimization of any video codec and thus produce optimal results in terms of PSNR and VMAF. We focus our testing using full-length titles from the Netflix catalog. We include results from practical encoder implementations of AVC, HEVC and VP9. Our results show the advantages and disadvantages of different encoders for different bitrate/quality ranges and for a variety of content.
The visual Just-Noticeable-Difference (JND) metric is characterized by the detectable minimum amount of two visual stimuli. Conducting the subjective JND test is a labor-intensive task. In this work, we present a novel interactive method in performing the visual JND test on compressed image/video. JND has been used to enhance perceptual visual quality in the context of image/video compression. Given a set of coding parameters, a JND test is designed to determine the distinguishable quality level against a reference image/video, which is called the anchor. The JND metric can be used to save coding bitrates by exploiting the special characteristics of the human visual system. The proposed JND test is conducted using a binary-forced choice, which is often adopted to discriminate the difference in perception in a psychophysical experiment. The assessors are asked to compare coded image/video pairs and determine whether they are of the same quality or not. A bisection procedure is designed to find the JND locations so as to reduce the required number of comparisons over a wide range of bitrates. We will demonstrate the efficiency of the proposed JND test, report experimental results on the image and video JND tests.
We propose a coding scheme that performs vector quantization (VQ) of the wavelet transform coefficients of an image. The proposed scheme uses different vector dimensions for different wavelet subbands and also different codebook sizes so that more bits are assigned to those subbands that have more energy. Another element of the proposed method is that the vector codebooks used are obtained recursively from the image that is to be compressed. By ordering the bit-stream properly, we can maintain the embedding property, since the wavelet coefficients are ordered according to their energy. Preliminary numerical experiments are presented to demonstrate the performance of the proposed method.
In this research, we improve Shapiro's EZW algorithm by performing the vector quantization (VQ) of the wavelet transform coefficients. The proposed VQ scheme uses different vector dimensions for different wavelet subbands and also different codebook sizes so that more bits are assigned to those subbands that have more energy. Another feature is that the vector codebooks used are tree-structured to maintain the embedding property. Finally, the energy of these vectors is used as a prediction parameter between different scales to improve the performance. We investigate the performance of the proposed method together with the 7 - 9 tap bi-orthogonal wavelet basis, and look into ways to incorporate loseless compression techniques.
The generalized Lloyd algorithm (GLA) plays an important role in the design of vector quantizers (VQ) for lossy data compression, and in feature clustering for pattern recognition. In the VQ context, this algorithm provides a procedure to iteratively improve a codebook and results in a local minimum which minimizes the average distortion function. We present a set of ideas that provide the basis for the acceleration of the GLA, some of which are equivalent to the exhaustive nearest neighbor search and some that may trade-off performance for the execution speed. More specifically, we use the maximum distance initialization technique in conjunction with either the partial distortion method or the fast tree-structured nearest neighbor encoding or the candidate-based constrained nearest neighbor search. As it is shown by the numerical experiments, all these methods provide significant improvement of the execution time of the GLA, in most cases together with an improvement of its performance. This improvement is of the order of 0.4 dB in the MSE, 15% in the entropy and more than 100 times in the execution time for some of the results presented.
The technique of mapping an array of gray levels to some arrangement of dots such that it renders the desired gray levels is called halftoning. In this research, we present a refinement of our previously proposed new digital halftoning algorithm to achieve this goal based on an approach called the recursive multiscale error diffusion. Our main assumption is that the resulting intensity from a raster of dots is in proportion to the number of dots on that raster. In analogy, the intensity of the corresponding region of the input image is simply the integral of the (normalized) gray level over the region. The two intensities should be matched as much as possible. Since the area of integration plays an important role to how successful the matching of the two intensities can be, and since the area of integration corresponds to different resolutions (therefore to different viewing distances), we address the problem of matching the intensities, as much as possible for every resolution. We propose a new quality criterion for the evaluation of halftoned images, called local intensity distribution, that stems from the same principle i.e., how close the average intensities of the input and output images match for different resolutions. Advantages of our method include very good performance, both in terms of visual quality and when measured by the proposed quality criterion, versatility, and ease of hardware implementation.
The technique of mapping a given gray level to some arrangement of dots such that it renders the desired gray level is called halftoning. In this research, we propose a new digital halftoning algorithm to achieve this goal based on an approach called the recursive multiscale error diffusion. Our main assumption is that the resulting intensity from a raster of dots is in proportion to the number of dots on that raster. In analogy, the intensity of the corresponding region of the input image is simply the integral of the (normalized) gray level over the region. The two intensities should be matched as much as possible. It is shown that the area of integration plays an important role to how successful the matching of the two intensities can be, and since the area of integration corresponds to different resolutions (therefore to different viewing distances), we address the problem of matching the intensities, as much as possible for every resolution. Advantages of our method include very good performance, versatility and ease of hardware implementation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.