Pedestrian detection is a particular issue in both academia and industry. However, most existing pedestrian detection methods usually fail to detect small-scale pedestrians due to the introduction of feeble contrast and motion blur in images and videos. In this paper, we propose a multi-level feature fusion strategy to detect multi-scale pedestrians, which works particularly well with small-scale pedestrians that are relatively far from the camera. We propose a multi-level feature fusion strategy to make the shallow feature maps encode more semantic and global information to detect small-scale pedestrians. In addition, we redesign the aspect ratio of anchors to make it more robust for pedestrian detection task. The extensive experiments on both Caltech and CityPersons datasets demonstrate that our method outperforms the state-of-the-art pedestrian detection algorithms. Our proposed approach achieves a MR−2 of 0.84%, 23.91% and 62.19% under the “Near”, Medium” and “Far” settings respectively on Caltech dataset, and also leads a better speed-accuracy trade-off with 0.28 second per image of 1024×2048 pixel compared with others on CityPersons dataset.
In the field of image processing, it has been a challenging task to obtain a complete foreground that is not uniform in color or texture. Unlike other methods, which segment the image by only using low-level features, we present a segmentation framework, in which high-level visual features, such as semantic information, are used. First, the initial semantic labels were obtained by using the nonparametric method. Then, a subset of the training images, with a similar foreground to the input image, was selected. Consequently, the semantic labels could be further refined according to the subset. Finally, the input image was segmented by integrating the object affinity and refined semantic labels. State-of-the-art performance was achieved in experiments with the challenging MSRC 21 dataset.
Object tracking is a challenging research task due to target appearance variation caused by deformation and occlusion. Keypoint matching based tracker can handle partial occlusion problem, but it’s vulnerable to matching faults and inflexible to target deformation. In this paper, we propose an innovative keypoint matching procedure to address above issues. Firstly, the scale and orientation of corresponding keypoints are applied to estimate the target’s status. Secondly, a kernel function is employed in order to discard the mismatched keypoints, so as to improve the estimation accuracy. Thirdly, the model updating mechanism is applied to adapt to target deformation. Moreover, in order to avoid bad updating, backward matching is used to determine whether or not to update target model. Extensive experiments on challenging image sequences show that our method performs favorably against state-of-the-art methods.
A performance metric for infrared and visible image fusion is proposed based on Weber’s law. To indicate the stimulus of source images, two Weber components are provided. One is differential excitation to reflect the spectral signal of visible and infrared images, and the other is orientation to capture the scene structure feature. By comparing the corresponding Weber component in infrared and visible images, the source pixels can be marked with different dominant properties in intensity or structure. If the pixels have the same dominant property label, the pixels are grouped to calculate the mutual information (MI) on the corresponding Weber components between dominant source and fused images. Then, the final fusion metric is obtained via weighting the group-wise MI values according to the number of pixels in different groups. Experimental results demonstrate that the proposed metric performs well on popular image fusion cases and outperforms other image fusion metrics.
Action recognition is a very challenging task in the field of real-time video surveillance. The traditional models on action recognition are constructed of Spatial-temporal features and Bag-of-Feature representations. Based on this model, current research work tends to introduce dense sampling to achieve better performance. However, such approaches are computationally intractable when dealing with large video dataset. Hence, there are some recent works focused on feature reduction to speed up the algorithm without reducing accuracy.
In this paper, we proposed a novel selective feature sampling strategy on action recognition. Firstly, the optical flow field is estimated throughout the input video. And then the sparse FAST (Features from Accelerated Segment Test) points are selected within the motion regions detected by using the optical flows on the temporally down-sampled image sequences. The selective features, sparse FAST points, are the seeds to generate the 3D patches. Consequently, the simplified LPM (Local Part Model) which greatly speeds up the model is formed via 3D patches. Moreover, MBHs (Motion Boundary Histograms) calculated by optical flows are also adopted in the framework to further improve the efficiency. Experimental results on UCF50 dataset and our artificial dataset show that our method could reach more real-time effect and achieve a higher accuracy compared with the other competitive methods published recently.
Visual tracking is a challenging problem in computer vision. Recent years, significant numbers of trackers have been proposed. Among these trackers, tracking with dense spatio-temporal context has been proved to be an efficient and accurate method. Other than trackers with online trained classifier that struggle to meet the requirement of real-time tracking task, a tracker with spatio-temporal context can run at hundreds of frames per second with Fast Fourier Transform (FFT). Nevertheless, the performance of the tracker with Spatio-temporal context relies heavily on the learning rate of the context, which restricts the robustness of the tracker.
In this paper, we proposed a tracking method with dual spatio-temporal context trackers that hold different learning rate during tracking. The tracker with high learning rate could track the target smoothly when the appearance of target changes, while the tracker with low learning rate could percepts the occlusion occurring and continues to track when the target starts to emerge again. To find the target among the candidates from these two trackers, we adopt Normalized Correlation Coefficient (NCC) to evaluate the confidence of each sample. Experimental results show that the proposed algorithm performs robustly against several state-of-the-art tracking methods.
Object tracking is a challenging task in computer vision. Most state-of-the-art methods maintain an object model and update the object model by using new examples obtained incoming frames in order to deal with the variation in the appearance. It will inevitably introduce the model drift problem into the object model updating frame-by-frame without any censorship mechanism. In this paper, we adopt a multi-expert tracking framework, which is able to correct the effect of bad updates after they happened such as the bad updates caused by the severe occlusion. Hence, the proposed framework exactly has the ability which a robust tracking method should process. The expert ensemble is constructed of a base tracker and its formal snapshot. The tracking result is produced by the current tracker that is selected by means of a simple loss function. We adopt an improved compressive tracker as the base tracker in our work and modify it to fit the multi-expert framework. The proposed multi-expert tracking algorithm significantly improves the robustness of the base tracker, especially in the scenes with frequent occlusions and illumination variations. Experiments on challenging video sequences with comparisons to several state-of-the-art trackers demonstrate the effectiveness of our method and our tracking algorithm can run at real-time.
Detecting dim and small target in infrared images and videos is one of the most important techniques in many computer vision applications, such as video surveillance and infrared imaging precise guidance. In this paper, we proposed a real-time target detection approach in infrared imagery. This method combined saliency detection technology and local average filtering. First, we compute the log amplitude spectrum of infrared image. Second, we find the spikes of the amplitude spectrum using cubic facet model and suppress the sharp spikes using local average filtering. At last, the detection result in spatial domain is obtained by reconstructing the 2D signal using the original phase and the filtered amplitude spectrum. Experimental results of infrared images with different types of backgrounds demonstrate the high efficiency and accuracy of the proposed method to detect the dim and small targets.
KEYWORDS: Super resolution, Video, Associative arrays, Video coding, Video processing, Lawrencium, Image processing, Visualization, Feature extraction, Information visualization
Methods for super-resolution can be classified into three categories: (i) The Interpolation-based methods, (ii) The Reconstruction-based methods (iii) The Learning-based methods. The Learning-based methods usually have the best performance due to the learning process. However, learning-based methods can’t be applied to video super-resolution due to the great computational complexity. We proposed a fast sparsity-based video super-resolution algorithm by utilizing inter-frame information. Firstly, the background can be extracted via existing methods such as Gaussians Mixture Model(GMM) in this paper. Secondly, we construct background and foreground patch dictionaries by randomly sampling patches from high-resolution video. During the process of video super-resolution, only the foreground regions are reconstructed using foreground dictionary via sparse coding. Respectively the background is updated and only changed regions of the background is reconstructed using background dictionary in the same way. Finally, the background and foreground should be fused to get the super-resolution outcome. The experiments show that it makes sparsity-based methods much faster in video super-resolution with approximate, even better, performance.
Human abnormal behaviors detection is one of the most challenging tasks in the video surveillance for the public
security control. Interaction Energy Potential model is an effective and competitive method published recently to detect
abnormal behaviors, but their model of abnormal behaviors is not accurate enough, so it has some limitations. In order to
solve this problem, we propose a novel Particle Motion model. Firstly, we extract the foreground to improve the
accuracy of interest points detection since the complex background usually degrade the effectiveness of interest points
detection largely. Secondly, we detect the interest points using the graphics features. Here, the movement of each human
target can be represented by the movements of detected interest points of the target. Then, we track these interest points
in videos to record their positions and velocities. In this way, the velocity angles, position angles and distance between
each two points can be calculated. Finally, we proposed a Particle Motion model to calculate the eigenvalue of each
frame. An adaptive threshold method is proposed to detect abnormal behaviors. Experimental results on the BEHAVE
dataset and online videos show that our method could detect fight and robbery events effectively and has a promising
performance.
This paper proposes an efficient fusion method for multiple remote sensing images based on sparse representation, in
which we mainly solve the fusion rules of the sparse coefficients. In the proposed fusion method, first is to obtain the
sparse coefficients of different source images based on three dictionaries. Considering the sparsity, the source
coefficients can be divided into large, middle, and small correlation classer. According to the analysis and comparison of
permutations, the final coefficients are fused in the term of different fusion rules according to the correlation. Finally, the
fused image can be reconstructed via combining the fused coefficients and trained dictionaries.
Abnormal event detection in crowded scenes is one of the most challenging tasks in the video surveillance for the
public security control. Different from previous work based on learning. We proposed an unsupervised Interaction Power
model with an adaptive threshold strategy to detect abnormal group activity by analyzing the steady state of individuals’
behaviors in the crowed scene. Firstly, the optical flow field of the potential pedestrians is only calculated within the
extracted foreground to reduce the computational cost. Secondly, each pedestrian can be divided into patches of the same
size, and the interaction power of the pedestrians will be represented by the motion particles which describe the motion
status at the center pixels of the patches. The motion status of each patch is computed by using the optical flows of the
pixels within the patch. For each motion particle, its interaction power, defined as its steady state of the current behavior,
is computed among all its neighboring motion particles. Finally, the dense crowds’ steady state can be represented as a
collection of motion particles’ interaction power. Here, an adaptive threshold strategy is proposed to detect abnormal
events by examining the frame power field which is a fixed-size random sampling of the interaction power of motion
particles. Experimental results on the standard UMN dataset and online videos show that our method could detect the
crowd anomalies and achieve a higher accuracy compared to the other competitive methods published recently.
Accurate and fast detection of small infrared target has very important meaning for infrared precise guidance, early
warning, video surveillance, etc. Based on human visual attention mechanism, an automatic detection algorithm for
small infrared target is presented. In this paper, instead of searching for infrared targets, we model regular patches that do
not attract much attention by our visual system. This is inspired by the property that the regular patches in spatial domain
turn out to correspond to the spikes in the amplitude spectrum. Unlike recent approaches using global spectral filtering,
we define the concept of local maxima suppression using local spectral filtering to smooth the spikes in the amplitude
spectrum, thereby producing the pop-out of the infrared targets. In the proposed method, we firstly compute the
amplitude spectrum of an input infrared image. Second, we find the local maxima of the amplitude spectrum using cubic
facet model. Third, we suppress the local maxima using the convolution of the local spectrum with a low-pass Gaussian
kernel of an appropriate scale. At last, the detection result in spatial domain is obtained by reconstructing the 2D signal
using the original phase and the log amplitude spectrum by suppressing local maxima. The experiments are performed
for some real-life IR images, and the results prove that the proposed method has satisfying detection effectiveness and
robustness. Meanwhile, it has high detection efficiency and can be further used for real-time detection and tracking.
In this paper, we propose a novel method to estimate the camera’s ego-motion parameters by directly using the normal flows. Normal flows, the projection of the optical flows along the direction of the gradient of image intensity, could be calculated directly from the image sequence without any artificial assumptions about the captured scene. Different from many traditional approaches which tackle the problem by establishing motion correspondences or by estimating optical flows, our proposed method could obtain the motion parameters directly by using the information of spatio-temporal gradient of the image intensity. Hence, our method requires no specific assumptions about the captured scene, such as the smoothness constraint, continuity constraint, distinct features appearing in the scene and etc.. Our method has been experimentally tested by using both synthetic image data and real image sequences. The experimental results demonstrate that our proposed method is feasible and reliable.
Infrared images are usually subject to low contrast, edge blurring and a large amount of noise. Aiming at improving the quality of the infrared images, this paper presents a novel adaptive algorithm on infrared image enhancement. Firstly, the input image is decomposed via the nonsubsampled Contourlet transform (NSCT) to achieve the coefficients of subbands at different scales and directions. Next, the coefficients of high frequency are classified into three categories automatically by using an adaptive classification method which analyzes the coefficients in their local neighborhood. After that, a nonlinear mapping function is adopted to modify the coefficients, in order to highlight the edges and suppress the high frequency noise. Finally, the enhanced image is obtained by reconstructing via the above modified coefficients. Experiment results show that the proposed algorithm could effectively enhance image contrast and highlight edges while avoiding the image distortion and noise amplification.
This paper proposes a novel method on scene matching which aims to detect the unauthorized change of the camera’s field of view (FOV) automatically. The problem is substantially difficult due to mixed representation of FOV change and scene content variation in actual situation. In this work, a local viewpoint-invariant descriptor is firstly proposed to measure the appearance similarity of the captured scenes. And then the structural similarity constraint is adopted to further distinguish whether the current scene remains despite the content change in the scene. Experimental results demonstrate that the proposed method works well in existence of viewpoint change, partial occlusion and structural similarities in real environment. The proposed scheme has been proved to be practically applicable and reliable by its use in an actual intelligent surveillance system.
An autonomous system must have the capability of estimating or controlling its own motion parameters. There already exit tens of research work to fulfill the task. However, most of them are based on the motion correspondences establishment or full optical flows estimation. The above solutions put restrictions on the scene: either there must be presence of enough distinct features, or there must be dense texture. Different from the traditional works, utilizing no motion correspondences or epipolar geometry, we start from the normal flow data, ensure good use of every piece of them because they could only be sparsely available. We apply the spherical image model to avoid the ambiguity in describing the camera motion. Since each normal flow gives a locus for the location of the camera motion, the intersection of such loci offered by different data points will narrow the possibilities of the camera motion and even pinpoint it. A voting scheme in φ-θ domain is applied to simplify the 3D voting space to a 2D voting space. We tested the algorithms introduced above by using both synthetic image data and real image sequences. Experimental results are shown to illustrate the potential of the methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.