|
|
1.INTRODUCTIONThe traditional approach for comparing video codecs has been focused on their compression efficiency using reference software encoders, without considering the complexity of the implementations. However, as the number of video coding standards as well as implementations increases, it has become more complicated to evaluate all of them on a level playing field, considering the vastly different complexity across them. Notable examples include popular open-source software encoder implementations of the AVC,1 HEVC,2 VP9,3 and AV14 video coding standards, namely: x264,5 x265,6 libvpx,7 libaom8 and SVT-AV1.9 On top of that, modern video encoder implementations typically have different features that can be turned on and off, often offered through a set of presets consisting of curated feature configurations, which provide different levels of trade-off between compression efficiency and computational complexity. Thus there is a growing need to include complexity as an additional dimension in codec comparisons. This is becoming especially critical as practical implementations of video codecs are now available in open-source offerings and are more representative of production-level deployment usages. In 2018, the Dynamic Optimizer framework10, 11 was introduced to compare video codecs fairly in different encoding resolutions and quality levels, which better reflects the most popular adaptive streaming use case. In 2019, a compression efficiency and computational complexity analysis was presented,12 where comparison was made on several known open-source encoder implementations including x264, libvpx, and libaom. The Dynamic Optimizer and software encoding cycle cost were then utilized to compare the performance of software video encoding implementations.13–15 However, as video codecs mature and hardware-based implementations become more prevalent, the existing software video codec evaluation methodology, which relies heavily on the amount of CPU cycles consumed during encoding, cannot be directly applied to hardware encoders. Therefore, a new methodology is proposed in this work that combines the Dynamic Optimizer framework with an energy measure (watt-hour) to be able to properly compare hardware and software codecs. In Sec. 2, we first describe the methodology proposed, and how the axes would be measured in principle. Then we demonstrate how the methodology can be executed practically with detailed configurations, testing and measurement setups in Sec. 3. Challenges, limitations, as well as considerations regarding evaluating the large scale adaptive streaming use case, are then discussed in Sec. 4. Finally we present the executed results in Sec. 5 on selected software encoder implementations as well as the Meta Scalable Video Processor (MSVP),16, 17 the dedicated video ASIC developed by Meta. Finally with the conclusion in Sec. 6. 2.EXPLAINING THE METHODOLOGYThis methodology is focused on an offline framework that would represent codec performance as seen in Video On Demand (VOD) use-cases at Meta as described in.18 A simplified view of the transcoding pipeline can be found in Fig. 1. A very similar pipeline has been reported by other VOD streaming providers in the past. More details will be discussed in the upcoming section. The methodology proposed in this work aims at comparing the encoding compression efficiency of different implementations while taking into consideration the normalized power consumption of the compute component being utilized. As such, the comparison is done across the following two axes: compression efficiency, and the energy usage, which we describe separately in the following. We will first start by comparing the energy efficiency of the core encoding component and we would expand the energy efficiency analysis to the rest of the transcoding components (decoding, scaling, quality measurements). 2.1Compression efficiency axisSeveral approaches on how compression efficiency can be aggregated were presented in our previous work:14 conventional convex hull approach, combined convex hull approach, and restricted discrete combined convex hull approach. We opt to use the third method: discrete combined convex hull approach, and create an adaptive bitrate (ABR) ladder of 8 lanes targeting quality specific quality levels as measured by VMAF19 and SSIM.20 A quick refresh of the construction of these lanes can be found in Fig. 2. 2.2Energy usage axisIn recent codec comparisons, we have employed the Linux utility ”/usr/bin/time -v” to record the user and system time of the encoding pipeline, when ensuring the system is fully loaded with encoding jobs. This measure has been found to be strongly correlated with energy consumption, as demonstrated previously.14 Additionally, this metric is neutral and does not rely on implementation-specific reporting. The time utility calculates the number of cycles used by a process, which essentially tracks the CPU usage of the process. This is quite applicable for software-based implementations. However, for hardware encoding, the processing is performed on specialized hardware components rather than the general purpose CPU. Thus, any cycles captured would not be comparable across hardware and software implementations. Note that even if we could capture the hardware cycle count, it cannot be compared directly with CPUs or with other different hardwares, since they represent different HW architectures, system clock speeds and power consumption per cycle. To address this, we propose to use the power consumption of each component processing the software or hardware implementations. To normalize the comparison, we multiplied the power consumption by the time it takes to run a fixed set of encodings, which is effectively measuring the energy used. This energy measure is also commonly used for electrical bills, electronics consumption factors, and electric vehicle electricity usage. Most importantly, energy consumption is the dominant factor used to determine the operational-expense (OpEx) cost of a service running on Meta’s data centers. 3.EXECUTING THE METHODOLOGYIn order to demonstrate how the methodology can be adopted and executed, we have set up a series of tests comparing the compression and energy efficiencies of a variety of publicly available software and hardware implementations. This section outlines the testing setup utilized to construct the 8-lane bitrate ladder along with the BD-rate and power utilization to be measured. 3.1Content SelectionFor this test, we use the following pool of clips to start with: We encode all 322 clips with 8 CRF values ranging from 23 - 63 and 5 resolutions (1080p down to 288p) using the SVT-AV1 encoder13 in preset 8, and compute the VMAF score at original resolutions. K-means clustering was performed on the produced RD (rate-distortion) points. The closest one to the center of each cluster is selected. We thus select a diverse set of contents without having to use the full list. As a result, we ended up selecting a total of 200 clips split across the following buckets: Looking at the resulting test set above we have also noticed that the 30fps content represents more than two thirds where the rest is split across 60, 25, and 24 fps. The full distribution by duration is shown in Table. 1. Table 1.Framerate Distribution of the Selected Video Test Set
3.2The Transcoding PipelineTo perform transcoding on the selected video clips, we have followed our previous work.14 A quick refresher on the pipeline can be in Fig. 3. The steps are described briefly as follows.
3.3Selecting the ResolutionsWe have reduced the resolution selection due to certain hardware encoders not supporting some resolutions used previously. As such the following resolution set has been selected: We have opted to pre-downscale the content using the FFmpeg25 Lanczos filter with alpha parameter set to 5 for all encoders and not utilize the scalars that are available in the hardware System-on-Chip (SoCs) to make sure we restrict the comparison to the encoder’s compression efficiency. A sample FFmpeg command line to perform such downscaling is shown here: 3.4Selecting the EncodersWe have chosen a wide selection of software encoder representation representing all codecs, as shown in the following: Example encoding commands used with each encoder can be found in Appendix A. For hardware encoders, we have selected the Meta Scalable Video Processor (MSVP) VP9 encoder. A picture of the cards used can be seen in Fig. 4. 3.5Selecting the Encoding ParametersWe choose the same selected range of CRFs that would align the quality with that achieved by SVT-AV1. The per encoder CRFs are shown as follows for each encoder implementation:
3.6Measuring the Compression EfficiencyTo assess the compression efficiency, we employed the combined restricted dynamic optimizer method and utilized SVT-AV1 M0 as a quality reference. In this instance, we had a mix of frame rates and a slightly different set of resolutions (1920x1080, 1080x1080, and 1080x1920). We chose to measure the dynamic optimizer per pixel count bucket, combining 1920x1080 and 1080x1920 into one bucket and considering 1080x1080 separately. This is also done for each class of frame rate. We then combined the final results by adding up the file size per video for all buckets and using a weighted average for the quality to obtain the final number. For example, calculating the M0 preset results of SVT-AV1 in VMAF would involve the following steps:
The above quality vs file size points are used as a reference to compare to any other encoder preset points and measure the BD-rate difference between the two sets of points. We then perform the same operation for each encoder implement and each preset. An example of the VMAF RD-curves comparing the ABR ladders constructed for SVT-AV1 presets M10 to M0 can be found in Fig. 5. 3.7Measuring and Estimating the Energy ConsumptionWe first make sure that the platform is utilized as much as possible (either on CPU or hardware card). Doing so in SW pipelines is relatively straightforward by using utilities such as “parallel”28 which dispatches a desired number of parallel tasks and monitors their completion in order to launch new ones, once some of them finish. For HW pipelines, this is more complicated, since each HW vendor has their own mechanism to interface and dispatch tasks to it, but we in general followed the best practice recommended by such HW vendors. Even with that, there is no practical way to ensure full, 100% utilization of these HW video transcoding cards and more is needed to provide a, hopefully, unified framework that allows users to achieve maximum resource utilization. The maximum power usage of each platform is used and the time it takes to finish all encoding jobs is measured. We then multiply the watts measure (e.g. 12 Watts for the MSVP HW encoder) by the wall clock to finish the encodings (e.g. 1 hour) to get the watts-hour measure per preset (e.g. 12 watt-hour). 3.7.1Power numbers utilized for energy usageIn this case, the following power measurements have been used for different platforms:
The wall clock measurement was done considering initially only the encoding time and not considering any other part of the pipeline. We have then added the other parts of the pipeline (decoding, scaling, quality metrics) in a separate set of results. 4.CHALLENGESSeveral challenges were encountered when we applied the methodology on other publicly available hardware encoders. We have opted not to present the results given these challenges in order not to misrepresent the performance of these offerings. We describe them in three categories: loading the system, accurate and representative power measurement, and a variety of use cases. 4.1Loading the CardUnlike the live broadcasting use case, it is common to schedule a wide variety of encoding jobs with different duration and resolutions in a video-on-demand use case. There can be shorter duration as well as lower resolution encodings, both from source or the derived encodings. In our benchmarking on hardware encoders, it appears that some can run into bottlenecks in scheduling and pushing encoding jobs through the card. Specifically, we have seen some performance discrepancies when running exclusively a longer 1080p sequence vs. several shorter sequences with mixed resolutions. Frequent starting up and shutting down the encoding sessions for different video sequences could introduce non-trivial overhead, which prevents us from fully utilizing the hardware engines. Helper utilities are indeed available to monitor the utilization of different hardware blocks. However, we often need to perform trial-and-error on the more optimal concurrency on different systems, by trying out different concurrent sessions and closely monitoring the utilizations as well as success rate manually. Different resolutions could also need a different number of concurrent sessions to be able to maximize utilization. Encoding failures sometimes can be observed when concurrency is pushed too high. 4.2Accurate and Representative Power MeasurementIt has been a challenge to fairly represent the power consumption or measurement between CPU and hardware in terms of video encoding. There are several factors. Most commercially available hardware encoders have components other than video encoding. Each system may also contain a different mix of components, which all have a variety of different capabilities. These components can consume power even when sitting idle. In the proposed methodology, we would load the system as much as possible and take the maximum power usage of each platform as the consumption, which does still include the idle components, but there is no easy way to separate them out. Furthermore, system configurations can quite significantly change the performance. Each hardware encoder in the real world would have to be hosted by a system with CPU, memory, and other components. The choice of these components, as well as the density of the hardware video encoder (i.e., how many such HW accelerator cards are connected on a single host),, which can vary by platform, could affect the normalized power consumption quite noticeably. Even in the case of using a CPU platform, the choice of rest of the system components would also change the consumption. In this benchmarking, we opted to take the entire system for the CPU platform, considering that a hardware encoder card does contain certain general purpose processors as well as on-board memory. While on the other hand, we only look at the maximum power usage of the card itself in this study. We believe that this indeed gives a fairly good representation, given the limitations we mentioned above. 4.3Covering Use CasesMajor use cases for video encoders are primarily Live and Video-on-demand (VOD). VOD use case at Meta involves full transcoding, high-quality scaling, and quality metrics evaluated at specified viewports.23 Besides decoding and encoding, such a pipeline includes high quality downscaler, high quality upscaler, as well as diverse sets of quality metrics. We have found that many existing solutions have different quality levels in the downscaler; do not necessarily include a high quality up-scaler; and often times only with limited selection of quality metrics. Furthermore, any one of the different available hardware blocks could become the bottleneck, which highly depends on the use case. With that in mind, we opted to present the execution of the methodology with a set of encoder-only experiments to eliminate any other factors, such as different length filter taps in downscaler. However, we believe that to accurately capture and evaluate large scale adaptive streaming use cases, the full pipeline should be taken into account. 5.RESULTSThe executed results are presented in this section with graphs. Each color on the graph represents a particular codec implementation, and each point corresponds to a particular preset for that implementation. The y-axis shows the average VMAF/SSIM BD-Rate of each preset compared to SVT-AV1 M0. The x-axis (in logarithmic scale) is generated by multiplying the wall clock time required to complete all tests, by the power consumption of the system, 235W for the CPU workloads and 12W for the MSVP card. The results in Fig. 6 show that there is indeed an advantage of hardware video encoder over the state-of-the-art software implementations and systems. That is, at similar compression efficiency, hardware video encoders consume much less energy, even when comparing VP9, an earlier generation codec, to the latest AV1. We do see though that the energy savings advantage is reduced as we go to higher quality presets of the MSVP VP9 encoder. Furthermore, as stated in the previous section, in real world large scale deployment, the full pipeline must be accounted for. This includes the cost of the full transcoding and quality metrics pipeline: decoding, downscaling, encoding, upscaling, and quality metrics calculation. In this case, the transcoding pipeline including upscaling and VMAF calculations measured each 16 frames can be done in MSVP without much impact to its energy usage.17 While in the software pipeline, the components other than encoding can consume a larger portion of the pipeline especially when operating at a high speed encoding preset. Following the same methodology described above and assuming a full utilization of the CPU, we have measured the wall clock of each of the transcoding components, other than encoding, by taking the following steps:
Considering the full cost of the pipeline, we now see an energy savings of the hardware implementation that goes anywhere from ~2x to ~5x. The results are shown in Fig. 7. 6.CONCLUSIONIn this work, we present a methodology of evaluating and comparing software and hardware video encoders, on different platforms, in a level playing field. This extends our previous work on comparing software implementations fairly and representative of large scale production use cases. We demonstrate how this methodology can be executed in practice by employing it on various software encoders as well as a dedicated video ASIC. We also present what the results would look like when the full transcoding pipeline is taken into consideration, for better matching large scale deployment use cases. The results further show that, with optimized hardware design, a previous generation video codecs can still outperform software implementations of the latest generation codecs on the energy vs. coding efficiency plane. We highlight how the integration of high-quality downscaling, upscaling and state-of-the-art video quality metrics (such as SSIM and VMAF) in a HW video transcoding card can bring significant savings over SW solutions, where many times those additional components account for a large portion of complexity/power requirements. Several challenges on evaluating across different platforms have also been presented. With the exception of Meta’s MSVP HW transcoder, which was designed with the Video-On-Demand use case in mind, other HW solutions lack a proper (and ideally unified) interface that allows for maximum loading of HW encoders in a reliable way. We believe that the proposed methodology provides a framework to evaluate as well as aid in developing the next generation of video codec implementations, whether they are in software or hardware. AppendicesAPPENDIX A.SAMPLE ENCODING COMMAND
ACKNOWLEDGMENTSThe authors of the paper would like to thank P’sao Pollard-Flamand from the Intel team for the help developing and updating the transcoding scripts and contributing them to the open source under the AOM testing resources repository.29 REFERENCESITU-T Rec, H.264 and ISO/IEC 14496-10 (AVC),
“Advanced video coding for generic audiovisual services,”
(2003). Google Scholar
ITU-T Rec, H.265 and ISO/IEC 23008-2 (HEVC),
“High efficiency video coding,”
(2013). Google Scholar
Grange, A., de Rivaz, P., and Hunt, J.,
“VP9 bitstream and decoding process specification,”
(2016). Google Scholar
de Rivaz, P. and Haughton, J.,
“AV1 bitstream & decoding process specification,”
(2019). Google Scholar
“x264, code repository - open-source AVC encoder software,”
https://github.com/mirror/x264 Google Scholar
“x265, code repository - open-source HEVC encoder software,”
https://bitbucket.org/multicoreware/x265_git Google Scholar
“libvpx, code repository - open-source VP9 encoder/decoder software,”
https://github.com/webmproject/libvpx Google Scholar
“libaom, code repository - open-source AV1 encoder/decoder software,”
https://aomedia.googlesource.com/aom Google Scholar
“SVT-AV1, code repository - scalable video technology for av1 (svt-av1 encoder),”
https://gitlab.com/AOMediaCodec/SVT-AV1. Google Scholar
Katsavounidis, I.,
“The NETFLIX tech blog: Dynamic optimizer — a per-ceptual video encoding optimization framework,”
(2018) https://netflixtechblog.com/dynamic-optimizer-a-perceptual-video-encoding-optimization-framework-e19f1e3a277f Google Scholar
Katsavounidis, I. and Guo, L., Tescher, A. G.,
“Video codec comparison using the dynamic optimizer framework,”
10752 266
–281 Applications of Digital Image Processing XLIInternational Society for Optics and Photonics, SPIE
(2018). Google Scholar
Katsavounidis, I.,
“Encoding vs. compute efficiency in video coding, video @Scale 2019,”
(2019) https://atscaleconference.com/events/video-scale-2019/ Google Scholar
Kossentini, F., Guermazi, H., Mahdi, N., Nouira, C., Naghdinezhad, A., Tmar, H., Khlif, O., Worth, P., and Amara, F. B., Tescher, A. G. and Ebrahimi, T.,
“The SVT-AV1 encoder: overview, features and speed-quality tradeoffs,”
11510 1151021 Applications of Digital Image Processing XLIIIInternational Society for Optics and Photonics, SPIE
(2020). Google Scholar
Wu, P.-H., Katsavounidis, I., Lei, Z., Ronca, D., Tmar, H., Abdelkafi, O., Cheung, C., Amara, F. B., and Kossentini, F., Tescher, A. G. and Ebrahimi, T.,
“Towards much better SVT-AV1 quality-cycles tradeoffs for VOD applications,”
Applications of Digital Image Processing XLIV, 11842 118420T International Society for Optics and Photonics, SPIE
(2021). Google Scholar
Cobianchi, G., Meardi, G., Poularakis, S., Walisiewicz, A., Abdelkafi, O., Amara, F. B., Kossentini, F., Stejerean, C., and Tmar, H., Tescher, A. G. and Ebrahimi, T.,
“Enhancing SVT-AV1 with LCEVC to improve quality-cycles trade-offs and enhance sustainability of VOD transcoding,”
Applications of Digital Image Processing XLV, 12226 122260S International Society for Optics and Photonics, SPIE
(2022). Google Scholar
Katsavounidis, I., Lalgudi, H., Lan, J., Reddy, H., and Chen, Y.,
“MSVP: Meta’s first ASIC for video transcoding,”
https://ai.meta.com/blog/meta-scalable-video-processor-MSVP/ Google Scholar
Reddy, H. M., Chen, Y., Lan, J., Katsavounidis, I., Anandharengan, B., Lalgudi, H. G., Alaparthi, S., Hua, G., Chuang, H.-C., Wu, P.-H., Lei, Z., Mastro, A., Petersen, C., Chaudhari, G., Prakash, P., Regunathan, S., Reddy, S., Venkatapuram, P., Rao, V., Noru, K., Bjorlin, A., Zeile, M., Lewis, A., Singh, A., Sunil, A., Chen, C.-C., Lin, C.-F., Chen, C., Sundar, D. P., Jayaraman, D., Ucar, H., Li, H., Singh, J., Liu, J. C. C., Rachamreddy, K. R., Sriadibhatla, K., Datla, K., Berg, L. V. D., Feng, L., Jampani, P., Moola, R., Mallya, R., Jha, S., Pan, S., Srinivasan, S., Vaduganathan, V., Zha, X., Wang, Z., Sengottuvel, A. K., Alluri, B., Oshin, B., Kanumetta, C., Sahin, E., Athaide, J. M., Wu, J., Kurapati, K. C., Manthati, K., Thottempudi, K., Chennamsetti, R. R., Jagannath, K. R., Arvapalli, S., Kala, T., Wang, T., Chopda, P., Gandhi, K., Ramesh, A., Gupta, R., Fadnavis, S., Qassoud, A., Friedt, C., Li, F., Gao, H., Lee, J., Dixit, M., Ugaji, S., Karuturi, T., Xie, X., Narasimha, A., Jakka, B., Dodds, B., Yang, J., Skandakumaran, K., Modi, M., Modi, P., Stejerean, C., Ronca, D., Wang, H., Pham, N., Lu, L., Shen, H., Ning, J., Narayanan, K., Chen, L., Avidan, N., Arnold, W., Xu, F., Patil, G., Balan, V., and Grandhi, S. D., Tescher, A. G. and Ebrahimi, T.,
“Efficient video processing at scale using MSVP,”
Applications of Digital Image Processing XLVI, 12674 1267414 International Society for Optics and Photonics, SPIE
(2023). Google Scholar
“How meta brought av1 to reels, https://engineering.fb.com/2023/02/21/video-engineering/av1-codec-facebook-instagram-reels/. Google Scholar
Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., and Manohara, M.,
“The NETFLIX techblog: Toward a practical perceptual video quality metric,”
(2016) https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652 Google Scholar
Zhou Wang, Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P.,
“Image quality assessment: from error visibility to structural similarity,”
IEEE Transactions on Image Processing, 13
(4), 600
–612
(2004). https://doi.org/10.1109/TIP.2003.819861 Google Scholar
“The Consumer Digital Video Library, https://www.cdvl.org/ Google Scholar
“YouTube-UGC Dataset, https://media.withyoutube.com/ Google Scholar
Regunathan, S. L., Wang, H., Zhang, Y., Liu, Y. R., Wolstencroft, D., Reddy, S., Stejerean, C., Gandhi, S., Chen, M., Sethi, P., Puntambekar, A., Coward, M., and Katsavounidis, I., Tescher, A. G. and Ebrahimi, T.,
“Efficient measurement of quality at scale in Facebook video ecosystem,”
Applications of Digital Image Processing XLIII, 11510 69
–80 International Society for Optics and Photonics, SPIE
(2020). Google Scholar
Bjontegaard, G.,
“Calculation of average PSNR differences between RD-curves,”
ITU-T Q.6/SG16 VCEG 13th meeting,
(2001). Google Scholar
“FFmpeg, https://ffmpeg.org/ Google Scholar
“EVE-VP9 encoder, https://www.twoorioles.com/eve-vp9 Google Scholar
.” link:“VVenC, the fraunhofer versatile video encoder - code repository, https://github.com/fraunhoferhhi/vvenc Google Scholar
Tange, O.,
“Gnu parallel 20230822 (’chandrayaan’),”
GNU Parallel is a general parallelizer to run multiple serial command line programs in parallel without changing them., Google Scholar
“Testing Resources for the AOMedia Codec Projects, https://gitlab.com/AOMediaCodec/aom-testing/-/tree/master/scripts/content-adaptive-streaming-pipeline-scripts Google Scholar
|