We present a dual-mode imaging system operating on visible and long-wave infrared wavelengths for achieving the noncontact and nonobtrusive measurements of breathing rate and pattern, no matter whether the subjects use the nose and mouth simultaneously, alternately, or individually when they breathe. The improved classifiers in tandem with the biological characteristics outperformed the custom cascade classifiers using the Viola–Jones algorithm for the cross-spectrum detection of face and nose as well as mouth. In terms of breathing rate estimation, the results obtained by this system were verified to be consistent with those measured by reference method via the Bland–Altman plot with 95% limits of agreement from −2.998 to 2.391 and linear correlation analysis with a correlation coefficient of 0.971, indicating that this method was acceptable for the quantitative analysis of breathing. In addition, the breathing waveforms extracted by the dual-mode imaging system were basically the same as the corresponding standard breathing sequences. Since the validation experiments were conducted under challenging conditions, such as the significant positional and abrupt physiological variations, we stated that this dual-mode imaging system utilizing the respective advantages of RGB and thermal cameras was a promising breathing measurement tool for residential care and clinical applications.
KEYWORDS: Video, Denoising, Neural networks, Associative arrays, Motion models, Video processing, Video acceleration, Visualization, Visual process modeling, 3D modeling
Video denoising can be described as the problem of mapping from a specific length of noisy frames to clean one. We propose a deep architecture based on Recurrent Neural Network (RNN) for video denoising. The model learns a patch-based end-to-end mapping between the clean and noisy video sequences. It takes the corrupted video sequences as the input and outputs the clean one. Our deep network, which we refer to as deep Recurrent Neural Networks (deep RNNs or DRNNs), stacks RNN layers where each layer receives the hidden state of the previous layer as input. Experiment shows (i) the recurrent architecture through temporal domain extracts motion information and does favor to video denoising, and (ii) deep architecture have large enough capacity for expressing mapping relation between corrupted videos as input and clean videos as output, furthermore, (iii) the model has generality to learned different mappings from videos corrupted by different types of noise (e.g., Poisson-Gaussian noise). By training on large video databases, we are able to compete with some existing video denoising methods.
Event detection for video surveillance is a difficult task due to many challenges: cluttered background, illumination variations, scale variations, occlusions among people, etc. We propose an effective and efficient event detection scheme in such complex situations. Moving shadows due to illumination are tackled with a segmentation method with shadow detection, and scale variations are taken care of using the CamShift guided particle filter tracking algorithm. For event modeling, hidden Markov models are employed. The proposed scheme also reduces the overall computational cost by combing two human detection algorithms and using tracking information to aid human detection. Experimental results on TRECVid event detection evaluation demonstrate the efficacy of the proposed scheme. It is robust, especially to moving shadows and scale variations. Employing the scheme, we achieved the best run results for two events in the TRECVid benchmarking evaluation.
In this paper, we propose a wave leader pyramids based Visual Information Fidelity method for image quality assessment. Motivated by the observations that the human vision systems (HVS) are more sensitive to edge and contour regions and that the human visual sensitivity varies with spatial frequency, we first introduce the two-dimensional wavelet leader pyramids to robustly extract the multiscale information of edges. Based on the wavelet leader pyramids, we further propose a visual information fidelity metric to evaluate the quality of images by quantifying the information loss between the original and the distorted images. Experimental results show that our method outperforms many state-of-the-art image quality metrics.
In this paper, we propose a framework for image classification. An image is represented by multiple feature channels which are computed by the bag-of-words model and organized in a spatial pyramid. The main difference among feature channels resides in what type of base descriptor in the bag-of-words model is extracted. The overall features achieve different levels of the trade-off between discriminative power and invariance. Support vector machines with kernels based on histogram intersection distance and χ2 distance are used to obtain a posteriori probabilities of the image in each feature channel. Then, four data fusion strategies are proposed to combine intermediate results from multiple feature channels. Experimental results show that almost all the proposed strategies can significantly improve the classification accuracy as compared with the single cue methods and, especially, prod-max performs best in all experiments. The framework appears to be general and capable of handling diverse classification problems due to the multiple-feature-channel-based representation. Also, it is demonstrated that the proposed method achieves higher, or comparable, classification accuracies with less computational cost as compared with other multiple cue methods on challenging benchmark datasets.
Image interpolation addresses the problem of obtaining high resolution (HR) images from its low resolution (LR) counterparts. For observed LR images with aliasing artifacts caused by undersampling, commonly used interpolation methods cannot recover HR images well, and may often interpolate over-fitting artifacts. In this paper, based on the observation that natural images normally have redundant similar patches, a new patch-synthesis-based interpolation method is proposed for image interpolation. In the proposed method, an inference method based on Markov chain is adopted to select the best patches from the input LR image and synthesize them into the undersampled areas of a desired HR image. In order to improve the efficiency of the algorithm, we also introduce fields of experts to model the sparse prior knowledge and use it to measure the compatibilities among neighboring patches. Experimental results compared with traditional interpolation methods demonstrate that our method cannot only alleviate the aliasing artifact, but also produce better results in terms of quantitative evaluation and subjective visual quality
Video transmission over error-prone channels often suffers from inevitable transmission errors, which necessitates proper error concealment (EC) for acceptable image quality. Furthermore, the region of interest (ROI) in images usually draws much attention, and so the EC of the ROI receives special treatment during encoding and decoding. We explore a data hiding-based scheme to effectively improve the EC of the ROI in the case of erasures of large continuous regions, which becomes impractical for conventional EC methods. At the encoder side, motion vectors of the ROI are adaptively embedded in the background based on original quantized coefficients of background macroblocks. Considering the limited embedding capacity of the background, we further propose to assign priorities to each ROI macroblock based on a predefined metric of error propagation. Our scheme is applied with the state-of-the-art H.264/AVC standard in a packet loss scenario, and better video quality can be obtained. Experimental results show that the scheme can improve the EC of the ROI significantly without much loss of coding efficiency.
Colorization is the process of adding colors to monochrome images. In this paper we present a scribble-based colorization, which uses nearest neighborhood propagation and mixed weighing. The major assumption of our method is that chrominance is similar if luminance is similar in a natural image. We introduce the definition of the nearest neighborhood and mixed weighting. Non-local-mean-based patch weight and point weight are used in the mixed weighting. Experimental results show the benefits of our method.
KEYWORDS: Magnesium, Scalable video coding, Signal to noise ratio, Video, Binary data, Distortion, Forward error correction, Chemical elements, Standards development, Algorithm development
We address the problem of unequal error protection (UEP) for SNR enhancement layer network abstraction layer (NAL) units of scalable video coding extension of H.264/AVC standard over wireless packet-erasure channel. We develop a UEP scheme by jointly selecting SNR NAL units and allocating unequal amounts of protection to selected NAL units for every group of pictures in the sequence. A simple heuristic algorithm is proposed to quickly derive the protection pattern. Experimental results demonstrate the proposed UEP scheme provides significant error resilience.
In general, an inevitable side effect of the block-based transform coding includes grid noise in the monotone area, staircase noise along the edges, ringing around the strong edges, corner outliers, and edge corruption near the block boundaries. We propose a comprehensive postprocessing method for removing all the blocking-related artifacts in block-based discrete cosine transfer compressed image in the framework of overcomplete wavelet expansion (OWE) proposed by Mallat and Zhong [IEEE Trans. Pattern Anal. Mach. Intell 14(7), 710–732 (1992)], which is translationally invariant and can efficiently characterize signal singularities. We propose to use the wavelet transform modulus maxima extension (WTMME) to represent the image. The WTMME is extracted from the wavelet coefficients of three-level OWE of the blocky image. The artifacts related to blockiness are modeled and detected through multiscale edge analysis of the image using the information of both modulus and angle. Both the WTMME and the angle image are reconstructed accordingly using inter-/intraband correlation to suppress the influence of the distortions. Finally, the inverse OWE transform is performed for the processed image. Because the algorithm takes no assumption that the blockiness occurs at block boundaries, it is also applicable to video, where due to motion estimation and compensation, the grid noise may propagate into blocks. Extensive simulation and comparative study with 21 exiting relevant algorithms have demonstrated the effectiveness of the proposed algorithm in terms of subjective and objective quality of the resultant images.
An encoded video bitstream is composed of two main components: the coefficient bits representing the discrete cosine transform coefficients, and the header bits representing the header information (e.g., motion vectors, prediction modes, etc.). Compared with previous video standards, the H.264 Advanced Video Coding (AVC) standard has some unique features: (1) the header bits take up a considerable portion of the encoded bitstream; (2) the header bits vary significantly across macroblocks (MBs); and (3) a large number of MBs are quantized to zero and produce zero coefficient bits (zero-coefficient MB). These unique features make most existing rate estimators inaccurate for decision-making processes related to rate-distortion calculation for rate control. This paper analyzes the characteristics of the H.264/AVC bitstream, and reveals that both the header bits and the occurrence of the zero-coefficient MBs are strongly related with motion-compensated residues obtained by INTER16×16. Therefore, two statistical models are proposed for estimating the header bits and separating the zero-coefficient MBs. Based on the proposed models, a rate control scheme is developed for buffer-constrained constant-bit-rate video coding. Experimental results show that the resultant scheme achieves an average of 0.53 dB peak signal-to-noise ratio (PSNR) gain over the original JM6.1e, and less than 2% bit-rate inaccuracy.
Almost all existing color demosaicking algorithms for digital cameras are designed on the assumption of high
correlation between red, green, blue (or some other primary color) bands. They exploit spectral correlations
between the primary color bands to interpolate the missing color samples. The interpolation errors increase in
areas of no or weak spectral correlations. Consequently, objectionable artifacts tend to occur on highly saturated
colors and in the presence of large sensor noises, whenever the assumption of high spectral correlations does
not hold. This paper proposes a remedy to the above problem that has long been overlooked in the literature.
The main contribution of this work is a technique of correcting the interpolation errors of any existing color
demosaicking algorithm by piecewise autoregressive modeling.
The Gaussian mixture model (GMM) is an important metric for moving objects segmentation and is fit to deal with the gradual changes of illumination and the repetitive motions of scene elements. However, the performance of the GMM may be plagued by the complex motion of the dynamic background such as waving trees and flags fluttering. A spatiotemporal Gaussian mixture model (STGMM) is proposed to handle the complex motion of the background by considering every background pixel to be fluctuating both in intensity and in its neighboring region. A new matching rule is defined to incorporate the spatial information. Experimental results on typical scenes show that STGMM can segment the moving objects correctly in complex scenes. Quantitative evaluations demonstrate that the proposed STGMM performs better than GMM.
Recent studies show that wavelet-based image fusion methods provide a high spectral quality in fused satellite images. However, images fused by most wavelet-based methods have less spatial resolution because the critical downsampling is included in the wavelet transform. We propose a useful fusion method based on contourlet and local average gradient (LAG) for multispectral and panchromatic satellite images. Contourlet represents edges and texture better than wavelet. Because edges and texture are fundamental in image representation, enhancing them is an effective means of enhancing spatial resolution. Based on LAG, the proposed fusion method reduces the spectral distortion of the fused image further. Experimental results show that the proposed fusion method is able to increase the spatial resolution and reduce the spectral distortion of the fused image at the same time.
A critical problem of maximum a posteriori (MAP) super-resolution (SR) image reconstructed algorithms is the choice of an appropriate prior model. Instead of modeling an original image directly, this work proposes an edge-image-based approach for stable SR reconstruction of the Lorentzian distribution. Through analyzing the convex and derivative properties of the Lorentzian distribution, we demonstrate the validity and stability of the proposed method for MAP SR reconstruction. The Lorentzwidth parameter is calculated iteratively to control the general sharpness degree of the image in the SR reconstruction process. Experiments confirm the effectiveness and robustness of the proposed method, and yield both objective and subjective qualities of the reconstructed SR images significantly better than conventional methods.
We present a novel postprocessing algorithm for blocking artifact removal in the discrete Hadamard transform (DHT) domain, which does not require prior knowledge of quantization parameters and features low computational complexity. All block-based video coding methods suffer annoying blocking artifacts at low bit rates. We first acquire edge information for the frame, then calculate block activities from DHT coefficients so as to classify smooth and coarse regions adaptively. Blocking artifacts are adaptively filtered in the DHT domain according to block activities. Experimental results show that the proposed method is able to remove blocking artifacts effectively while preserving image details and object edges well, and that it achieves better visual quality compared with other methods.
We present a perceptually-adaptive in-band preprocessing scheme for 3-D wavelet video coding. In our scheme, after the original video is decomposed by 2-D spatial wavelet transform, a preprocessor is incorporated to remove some visually insignificant wavelet coefficients (noise-like) before the motion compensated temporal filtering of each spatial subband. The preprocessing process is guided by a wavelet domain just-noticeable-distortion profile, which locally adapts to spatial wavelet transform coefficients. Experimental results show that the proposed scheme can efficiently enhance the visual quality of coded video with the same objective quality at different bitrates.
We propose an image contrast enhancement algorithm using multi-scale edge representation of images. It has long been known that the Human Vision System (HVS) heavily depends on edges in the understanding and perception of scenes. Contrasts in grayscale images are measured between the differences of pixels on both sides of edges, which is defined as the gradient magnitudes of those edges. And multi-scale edge of an image is characterized by the local extrema of wavelet coefficients across levels. So rebuilding an image from properly stretched the extrema is a promising way to enhance the contrast of the image. We tackle this reconstruction problem with a straightforward interpolation method instead of the commonly used iterative projection process. Extensive experiments justify our algorithm an efficient and effective contrast enhancement method.
Work presented in the paper includes two parts: first we measured the detectability and annoyance of frame dropping's effect on perceptual visual quality evaluation under different motion and framesize conditions. Then, a new logistics function and an effective yet simple motion content representation are selected to model the relationship among motion, framerate and negative impact of frame-dropping on visual quality, in one formula. The high Pearson and Spearman correlation results between the MOS and predicted MOSp, as well as the results of other two error metrics, confirm the success of the selected logistic function and motion content representation.
In this paper, we propose a new video quality evaluation method based on multi-feature and radial basis function neural network. Multi-feature is extracted from a degraded image sequence and its reference sequence, including error energy, activity-masking and luminance-masking as well as blockiness and blurring features. Based on these factors we apply a radial basis function neural network as a classifier to give quality assessment scores. After training with the subjective mean opinion scores (MOS) data of VQEG test sequences, the neural network model can be used to evaluate video quality with good correlation performance in terms of accuracy and consistency measurements.
In this paper, just noticeable distortion (JND) profile based upon the human visual system (HVS) has been exploited to guide the motion search and introduce an adaptive filter for residue error after motion compensation, in hybrid video coding (e.g., H.26x and MPEG-x). Because of the importance of accurate JND estimation, a new spatial-domain JND estimator (the nonlinear additivity model for masking-NAMM for short) is to be firstly proposed. The obtained JND profile is then utilized to determine the extent of motion search and whether a residue error after motion compensation needs to be consine-tranformed. Both theoretical analysis and experimental data indicate significant improvement in motion search speedup, perceptual visual quality measure, and most remarkably, objective quality (i.e., PSNR) measure.
This paper presents a new and general concept, PQSM (Perceptual
Quality Significance Map), to be used in measuring the visual
distortion. It makes use of the selectivity characteristic of HVS
(Human Visual System) that it pays more attention to certain
area/regions of visual signal due to one or more of the following
factors: salient features in image/video, cues from domain
knowledge, and association of other media (e.g., speech or audio).
PQSM is an array whose elements represent the relative
perceptual-quality significance levels for the corresponding
area/regions for images or video. Due to its generality, PQSM can
be incorporated into any visual distortion metrics: to improve
effectiveness or/and efficiency of perceptual metrics; or even to
enhance a PSNR-based metric. A three-stage PQSM estimation method
is also proposed in this paper, with an implementation of motion,
texture, luminance, skin-color and face mapping. Experimental
results show the scheme can improve the performance of current
image/video distortion metrics.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.