Light field video is a promising technology for delivering the required six-degrees-of-freedom for natural content in virtual reality. Already existing multi-view coding (MVC) and multi-view plus depth (MVD) formats, such as MV-HEVC and 3D-HEVC, are the most conventional light field video coding solutions since they can compress video sequences captured simultaneously from multiple camera angles. 3D-HEVC treats a single view as a video sequence and the other sub-aperture views as gray-scale disparity (depth) maps. On the other hand, MV-HEVC treats each view as a separate video sequence, which allows the use of motion compensated algorithms similar to HEVC. While MV-HEVC and 3D-HEVC provide similar results, MV-HEVC does not require any disparity maps to be readily available, and it has a more straightforward implementation since it only uses syntax elements rather than additional prediction tools for inter-view prediction. However, there are many degrees of freedom in choosing an appropriate structure and it is currently still unknown which one is optimal for a given set of application requirements. In this work, various prediction structures for MV-HEVC are implemented and tested. The findings reveal the trade-off between compression gains, distortion and random access capabilities in MVHEVC light field video coding. The results give an overview of the most optimal solutions developed in the context of this work, and prediction structure algorithms proposed in state-of-the-art literature. This overview provides a useful benchmark for future development of light field video coding solutions.
Steered Mixture-of-Experts (SMoE) is a novel framework for representing multidimensional image modalities. In this paper, we propose a coding methodology for SMoE models that is readily extendable to any dimensional SMoE model, thus representing any image modality of any dimension. We evaluate the coding performance of SMoE models of light field video, a 5D image modality, i.e. time, two angular, and two spatial dimensions. The coding consists of the exploiting the redundancy between the parameters of SMoE models, i.e. a set of multivariate Gaussian distributions. We compare the performance of three multi-view HEVC (MV-HEVC) configurations that differ in terms of random access. Each subaperture view from the light field video is interpreted as a single view in MV-HEVC. Experiments validate that excellent coding performance compared to MV-HEVC for low- to midrange bitrates in terms of PSNR and SSIM with bitrate savings up to 75%.
Distributed video coding (DVC) has attracted a lot of attention during the past decade as a new solution for video compression where the computationally most intensive operations are performed by the decoder instead of by the encoder. One very important issue in many current DVC solutions is the use of a feedback channel from the decoder to the encoder for the purpose of determining the rate of the coded stream. The use of such a feedback channel is not only impractical in storage applications but even in streaming scenarios feedback-channel usage may result in intolerable delays due to the typically large number of requests for decoding one frame. Instead of reverting to a feedback-free solution by adding complexity to the encoder for performing encoder-side rate estimation, as an alternative, in previous work we proposed to incorporate constraints on feedback channel usage. To cope better with rate fluctuations caused by changing motion characteristics, in this paper we propose a refined approach exploiting information available from already decoded frames at other temporal layers. The results indicate significant improvements for all test sequences (using a GOP of length four).
Disparity estimation is a highly complex and time consuming process in multiview video encoders. Since multiple views taken from a two-dimensional camera array need to be coded at every time instance, the complexity of the encoder plays an important role besides its rate-distortion performance. In previous papers we have introduced a new frame type called the D (derived) frame that exploits the strong geometrical correspondence between views, thereby reducing the complexity of the encoder. By employing the D frames instead of some of the P frames in the prediction structure, significant complexity gain can be achieved if the threshold value, which is a keystone element to adjust the complexity at the cost of quality and/or bit-rate, is selected wisely. A new adaptive method to calculate the threshold value automatically from existing information during the encoding process is presented. In this method, the threshold values are generated for each block of each D frame to increase the accuracy. The algorithm is applied to several image sets and 20.6% complexity gain is achieved using the automatically generated threshold values without compromising quality or bit-rate.
Disparity estimation is a highly complex and time consuming process in multi-view video encoders. Since multiple views taken from a 2D camerea array need to be coded at every time instance, the complexity of the encoder plays an important role besides its rate-distortion performance. In previous papers we have introduced a new frame type called D frame that exploits the stron geometrical correspondence between views, thereby reducing the complexity of the encoder. By employing D frames instead of some of the P frames in the prediction structure, significant compexity gain can be achieved if the trhreshold value which is a keystone element to adjust the complexity at the cost of quality and/or bit-rate is selected wisely. In this work, a new adaptive method to calculate the threshold value automatically from existing information during the encoding process is presented. In this method, the threshold values are generated for each block of each D frame to increase the accuracy. The algorithm is applied to several image sets and 20.6% complexity gain is achieved using the automatically generated threshold values without compromising qaulity or bit-rate.
To achieve the high coding efficiency the H.264/AVC standard offers, the encoding process quickly becomes computationally demanding. One of the most intensive encoding phases is motion estimation. Even modern CPUs struggle to process high-definition video sequences in real-time. While personal computers are typically equipped with powerful Graphics Processing Units (GPUs) to accelerate graphics operations, these GPUs lie dormant when encoding a video sequence. Furthermore, recent developments show more and more computer configurations come with multiple GPUs. However, no existing GPU-enabled motion estimation architectures target multiple GPUs. In addition, these architectures provide no early-out behavior nor can they enforce a specific processing order. We developed a motion search architecture, capable of executing motion estimation and partitioning for an H.264/AVC sequence entirely on the GPU using the NVIDIA CUDA (Compute Unified Device Architecture) platform. This paper describes our architecture and presents a novel job scheduling system we designed, making it possible to control the GPU in a flexible way. This job scheduling system can enforce real-time demands of the video encoder by prioritizing calculations and providing an early-out mode. Furthermore, the job scheduling system allows the use of multiple GPUs in one computer system and efficient load balancing of the motion search over these GPUs. This paper focuses on the execution speed of the novel job scheduling system on both single and multi-GPU systems. Initial results show that real-time full motion search of 720p high-definition content is possible with a 32 by 32 search window running on a system with four GPUs.
Background subtraction is a method commonly used to segment objects of interest in image sequences. By comparing new frames to a background model, regions of interest can be found. To cope with highly dynamic and complex environments, a mixture of several models has been proposed in the literature. We propose a novel background subtraction technique derived from the popular mixture of Gaussian models technique (MGM). We discard the Gaussian assumptions and use models existing of an average and an upper and lower threshold. Additionally, we include a maximum difference with the previous value and present an intensity allowance to cope with gradual lighting changes and photon noise, respectively. Moreover, edge-based image segmentation is introduced to improve the results of the proposed technique. This combination of temporal and spatial information results in a robust object detection technique that deals with several difficult situations. Experimental analysis shows that our system is more robust than MGM and more recent techniques, resulting in less false positives and negatives. Finally, a comparison of processing speed shows that our system can process frames up to 50% faster.
Nowadays, most video material is coded using a non-scalable format. When transmitting these single-layer video bitstreams, there may be a problem for connection links with limited capacity. In order to solve this problem, requantization transcoding is often used. The requantization transcoder applies coarser quantization in order
to reduce the amount of residual information in the compressed video bitstream. In this paper, we extend a requantization transcoder for H.264/AVC video bitstreams with a rate-control algorithm. A simple algorithm is proposed which limits the computational complexity. The bit allocation is based on the bit distribution in the original video bitstream. Using the bit budget and a linear model between rate and quantizer, the new quantizer is calculated. The target bit rate is attained with an average deviation lower than 6%, while the rate-distortion performance shows small improvements over transcoding without rate control.
Distributed video coding is a new video coding paradigm that shifts the computational intensive motion estimation
from encoder to decoder. This results in a lightweight encoder and a complex decoder, as opposed
to the predictive video coding scheme (e.g., MPEG-X and H.26X) with a complex encoder and a lightweight
decoder. Both schemas, however, do not have the ability to adapt to varying complexity constraints imposed by
encoder and decoder, which is an essential ability for applications targeting a wide range of devices with different
complexity constraints or applications with temporary variable complexity constraints. Moreover, the effect of
complexity adaptation on the overall compression performance is of great importance and has not yet been investigated.
To address this need, we have developed a video coding system with the possibility to adapt itself to
complexity constraints by dynamically sharing the motion estimation computations between both components.
On this system we have studied the effect of the complexity distribution on the compression performance.
This paper describes how motion estimation can be shared using heuristic dynamic complexity and how
distribution of complexity affects the overall compression performance of the system. The results show that the
complexity can indeed be shared between encoder and decoder in an efficient way at acceptable rate-distortion performance.
Detection and segmentation of objects of interest in image sequences is the first major processing step in visual
surveillance applications. The outcome is used for further processing, such as object tracking, interpretation,
and classification of objects and their trajectories. To speed up the algorithms for moving object detection,
many applications use techniques such as frame rate reduction. However, temporal consistency is an important
feature in the analysis of surveillance video, especially for tracking objects. Another technique is the downscaling
of the images before analysis, after which the images are up-sampled to regain the original size. This method,
however, increases the effect of false detections. We propose a different pre-processing step in which we use a
checkerboard-like mask to decide which pixels to process. For each frame the mask is inverted to avoid that
certain pixel positions are never analyzed. In a post-processing step we use spatial interpolation to predict the
detection results for the pixels which were not analyzed. To evaluate our system we have combined it with a
background subtraction technique based on a mixture of Gaussian models. Results show that the models do not
get corrupted by using our mask and we can reduce the processing time with over 45% while achieving similar
detection results as the conventional technique.
In this paper, a novel compressed-domain motion detection technique, operating on MPEG-2-encoded video, is
combined with H.264 flexible macroblock ordering (FMO) to achieve efficient, error-resilient MPEG-2 to H.264
transcoding. The proposed motion detection technique first extracts the motion information from the MPEG-2-encoded
bit-stream. Starting from this information, moving regions are detected using a region growing approach. The
macroblocks in these moving regions are subsequently encoded separately from those in background regions using FMO.
This can be used to increase error resilience and/or to realize additional bit-rate savings compared to traditional
With all the hype created around multimedia in the last few years, consumers expect to be able to access multimedia content in a real-time manner, anywhere and anytime. One of the problems with the real-time requirement is that transportation networks, such as the Internet, are still prone to errors. Due to real-time constraints, retransmission of lost data is, more often than not, not an option. Therefore, the study of error resilience and error concealment techniques is of the utmost importance since it can seriously limit the impact of a transmission error. In this paper an evaluation of a part of flexible macroblock ordering, one of the new error resilience techniques in H.264/AVC, is made by analyzing its costs and gains in an error-prone environment. This paper concentrates on the study of flexible macroblock ordering (FMO). More specifically a study of scattered slices, FMO type 1, is made. Our analysis shows that FMO type 1 is a good tool to introduce error robustness into an H.264/AVC bitstream as long as the QP is higher than 30. When the QP of the bitstream is below 30, the cost of FMO type 1 becomes a serious burden.
H.264/AVC is the newest block based video coding standard from MPEG and VCEG. It not only provides superior and efficient video coding at various bit rates, it also has a "network-friendly" representation thanks to a series of new techniques which provide error robustness. Flexible Macroblock Ordering (FMO) is one of the new error resilience tools included in H.264/AVC. Here, we present an alternative use of flexible macroblock ordering, using its idea of combining non-neighboring macroblocks together in one slice. Instead of creating a scattered pattern, which is useful when transmitting the data over an error-prone network, we divide the picture into a number of regions of interest and one remaining region of disinterest. It is assumed that people watching the video will pay much more attention to the regions of interest than to the remainder of the video. So we compress the regions of interest at a higher bit rate than the regions of disinterest, thus lowering the overall bit rate. Simulations show that the overhead introduced by using rectangular regions of interest is minimal, while the bit rate can be reduced by 30% and more in most cases. Even at those reductions the video stays pleasant to watch. Transcoders can use this information as well by reducing only the quality of the regions of disinterest instead of the quality of the entire picture if applying SNR scalability. In extreme cases the regions of disinterest can even be dropped easily, thus reducing the overall bit rate even further.
H.264/AVC is a new specification for digital video coding that aims at a deployment in a lot of multimedia applications, such as video conferencing, digital television broadcasting, and internet streaming. This is for instance reflected by the design goals of the standard, which are about the provision of an efficient compression
scheme and a network-friendly representation of the compressed data. Those requirements have resulted in a very flexible syntax and architecture that is fundamentally different from previous standards for video compression. In this paper, a detailed discussion will be provided on how to apply an extended version of the MPEG-21 Bitstream Syntax Description Language (MPEG-21 BSDL) to the Annex B syntax of the H.264/AVC specification. This XML based language will facilitate the high-level manipulation of an H.264/AVC bitstream in
order to take into account the constraints and requirements of a particular usage environment. Our performance measurements and optimizations show that it is possible to make use of MPEG-21 BSDL in the context of the current H.264/AVC standard with a feasible computational complexity when exploiting temporal scalability.
H.264/AVC is a video codec developed by the Joint Video Team (JVT); a cooperation between the ITU-T VCEG (Video Coding Experts Group) and ISO/IEC MPEG (Moving Picture Experts Group). This new video coding standard has some new features that allow to get significant improvements in coding efficiency. This improved coding efficiency leads to an overall more complex algorithm which has high demands regarding memory usage and processing power. Complexity, however, is an abstract concept and cannot be measured in a simple manner.
In this paper we present a method to obtain an accurate and more in-depth view on the internals of the JVT/AVC decoder. By decoding several bit streams having different encoding parameters, various program characteristics were measured. On these measurements, principal components analysis was performed to get a different view on these measurements. Our results show that the various encoding parameters have a clear impact on the low level behavior of the decoder. Moreover, our methodology allows us to give an explanation for the observed dissimilarities.
Video coding is used under the hood of a lot of multimedia applications, such as video conferencing, digital storage media, television broadcasting, and internet streaming. Recently, new standards-based and proprietary technologies have emerged. An interesting problem is how to evaluate these different video coding solutions in terms of delivered quality.
In this paper, a PSNR-based approach is applied in order to compare the coding potential of H.264/AVC AHM 2.0 with the compression efficiency of XviD 0.9.1, DivX 5.05, Windows Media Video 9, and MC-EZBC. Our results show that MPEG-4-based tools, and in particular H.264/AVC, can keep step with proprietary solutions. The rate-distortion performance of MC-EZBC, a wavelet-based video codec, looks very promising too.
The increasing diversity of the characteristics of the terminals and networks that are used to access multimedia content through the internet introduces new challenges for the distribution of multimedia data. Scalable video coding will be one of the elementary solutions in this domain. This type of coding allows to adapt an encoded
video sequence to the limitations of the network or the receiving device by means of very basic operations. Algorithms for creating fully scalable video streams, in which multiple types of scalability are offered at the same time, are becoming mature. On the other hand, research on applications that use such bitstreams is only recently
emerging. In this paper, we introduce a mathematical model for describing such bitstreams. In addition, we show how we can model applications that use scalable bitstreams by means of definitions that are built on top of this model. In particular, we chose to describe a multicast protocol that is targeted at scalable bitstreams. This way, we will demonstrate that it is possible to define an abstract model for scalable bitstreams, that can be used as a tool for reasoning about such bitstreams and related applications.