With the advent of deep convolutional neural networks (DCNN), the improvements in visual saliency prediction research are impressive. Despite this, it is still needed to fully characterize the multi-scale saliency-influential factors into the current deep saliency framework for further improvement. However, the existing approaches aiming at capturing multi-scale contextual features either suffer from the heavy computation or limited performance gain. To overcome this, a lightweight yet powerful module for fully exploiting multi-scale contextual features is desired. In this paper, we propose a DCNN-based visual saliency prediction model to approach this goal. Our model is inspired by the GoogleNet, which use the inception module to capture multi-scale contextual features at various receptive fields. Specifically, we revise the original inception module to have more powerful multi-scale feature extraction capacity and less computation load by utilizing dilated convolutions to replace the original standard ones. The whole model is trained end-to-end and is efficient to achieve real-time performance. Experimental results on several challenging saliency benchmark datasets, including SALICON, MIT1003, and MIT300, demonstrate that our proposed saliency model can achieve state-of-the-art performance with competitive inference time.
In this paper, we introduce a visual pattern degradation based full-reference (FR) image quality assessment (IQA) method. Researches on visual recognition indicate that the human visual system (HVS) is highly adaptive to extract visual structures for scene understanding. Existing structure degradation based IQA methods mainly take local luminance contrast to represent structure, and measure quality as degradation on luminance contrast. In this paper, we suggest that structure includes not only luminance contrast but also orientation information. Therefore, we analyze the orientation characteristic for structure description. Inspired by the orientation selectivity mechanism in the primary visual cortex, we introduce a novel visual pattern to represent the structure of a local region. Then, the quality is measured as the degradations on both luminance contrast and visual pattern. Experimental results on Five benchmark databases demonstrate that the proposed visual pattern can effectively represent visual structure and the proposed IQA method performs better than the existing IQA metrics.
Local structure, e.g., local binary pattern (LBP), is widely used in texture classification. However, LBP is too
sensitive to disturbance. In this paper, we introduce a novel structure for texture classification. Researches
on cognitive neuroscience indicate that the primary visual cortex presents remarkable orientation selectivity for
visual information extraction. Inspired by this, we investigate the orientation similarities among neighbor pixels,
and propose an orientation selectivity based pattern for local structure description. Experimental results on
texture classification demonstrate that the proposed structure descriptor is quite robust to disturbance.
In general, an inevitable side effect of the block-based transform coding includes grid noise in the monotone area, staircase noise along the edges, ringing around the strong edges, corner outliers, and edge corruption near the block boundaries. We propose a comprehensive postprocessing method for removing all the blocking-related artifacts in block-based discrete cosine transfer compressed image in the framework of overcomplete wavelet expansion (OWE) proposed by Mallat and Zhong [IEEE Trans. Pattern Anal. Mach. Intell 14(7), 710–732 (1992)], which is translationally invariant and can efficiently characterize signal singularities. We propose to use the wavelet transform modulus maxima extension (WTMME) to represent the image. The WTMME is extracted from the wavelet coefficients of three-level OWE of the blocky image. The artifacts related to blockiness are modeled and detected through multiscale edge analysis of the image using the information of both modulus and angle. Both the WTMME and the angle image are reconstructed accordingly using inter-/intraband correlation to suppress the influence of the distortions. Finally, the inverse OWE transform is performed for the processed image. Because the algorithm takes no assumption that the blockiness occurs at block boundaries, it is also applicable to video, where due to motion estimation and compensation, the grid noise may propagate into blocks. Extensive simulation and comparative study with 21 exiting relevant algorithms have demonstrated the effectiveness of the proposed algorithm in terms of subjective and objective quality of the resultant images.
KEYWORDS: Video, Visualization, Cameras, Optical filters, Motion estimation, Video processing, Motion models, Visual process modeling, Video coding, Mobile devices
The visual saliency map represents the most attractive regions in video. Automatic saliency map determination
is important in mobile video applications such as autofocusing in video capturing. It is well known that motion
plays a critical role in visual attention modeling. Motion in video consists of camera's motion and foreground
target's motion. In determining the visual saliency map, we are concerned with the foreground target's motion.
To achieve this, we evaluate the camera/global motion and then identify the moving target from the background.
Specifically, we propose a three-step procedure for visual saliency map computation: 1) motion vector (MV) field
filtering, 2) background extraction and 3) contrast map computation. In the first step, the mean value of the MV
field is treated as the camera's motion. As a result, the MV of the background can be detected and eliminated,
and the saliency map can be roughly determined. In the second step, we further remove noisy image blocks in the
background and provide a refined description of the saliency map. In the third step, a contrast map is computed
and integrated with the result of foreground extraction. All computations required in the our proposed algorithm
are low so that they can be used in mobile devices. The accuracy and robustness of the proposed algorithm is
supported by experimental results.
Work presented in the paper includes two parts: first we measured the detectability and annoyance of frame dropping's effect on perceptual visual quality evaluation under different motion and framesize conditions. Then, a new logistics function and an effective yet simple motion content representation are selected to model the relationship among motion, framerate and negative impact of frame-dropping on visual quality, in one formula. The high Pearson and Spearman correlation results between the MOS and predicted MOSp, as well as the results of other two error metrics, confirm the success of the selected logistic function and motion content representation.
This paper presents a new and general concept, PQSM (Perceptual
Quality Significance Map), to be used in measuring the visual
distortion. It makes use of the selectivity characteristic of HVS
(Human Visual System) that it pays more attention to certain
area/regions of visual signal due to one or more of the following
factors: salient features in image/video, cues from domain
knowledge, and association of other media (e.g., speech or audio).
PQSM is an array whose elements represent the relative
perceptual-quality significance levels for the corresponding
area/regions for images or video. Due to its generality, PQSM can
be incorporated into any visual distortion metrics: to improve
effectiveness or/and efficiency of perceptual metrics; or even to
enhance a PSNR-based metric. A three-stage PQSM estimation method
is also proposed in this paper, with an implementation of motion,
texture, luminance, skin-color and face mapping. Experimental
results show the scheme can improve the performance of current
image/video distortion metrics.
In this paper, we propose a new video quality evaluation method based on multi-feature and radial basis function neural network. Multi-feature is extracted from a degraded image sequence and its reference sequence, including error energy, activity-masking and luminance-masking as well as blockiness and blurring features. Based on these factors we apply a radial basis function neural network as a classifier to give quality assessment scores. After training with the subjective mean opinion scores (MOS) data of VQEG test sequences, the neural network model can be used to evaluate video quality with good correlation performance in terms of accuracy and consistency measurements.
In this paper, just noticeable distortion (JND) profile based upon the human visual system (HVS) has been exploited to guide the motion search and introduce an adaptive filter for residue error after motion compensation, in hybrid video coding (e.g., H.26x and MPEG-x). Because of the importance of accurate JND estimation, a new spatial-domain JND estimator (the nonlinear additivity model for masking-NAMM for short) is to be firstly proposed. The obtained JND profile is then utilized to determine the extent of motion search and whether a residue error after motion compensation needs to be consine-tranformed. Both theoretical analysis and experimental data indicate significant improvement in motion search speedup, perceptual visual quality measure, and most remarkably, objective quality (i.e., PSNR) measure.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.