Most existing supervised machine learning frameworks assume there is no mistake or false interpretation on the training samples. However, this assumption may not be true in practical applications. In some cases, if human being is involved in providing training samples, there may be errors in the training set. In this paper, we study the effect of imperfect training samples on the supervised machine learning framework. We focus on the mathematical framework that describes the learnability of noisy training data. We study theorems to estimate the error bounds of generated models and the required amount of training samples. These errors are dependent on the amount of data trained and the probability of the accuracy of training data. Based on the effectiveness of learnability on imperfect annotation, we describe an autonomous learning framework, which uses cross-modality information to learn concept models. For instance, visual concept models can be trained based on the detection result of Automatic Speech Recognition, Closed Captions, or prior detection results of the same modality. Those detection results on an unsupervised training set serve as imperfect labeling for the models-to-build. A prototype system based on this learning technique has been built. Promising results have been shown on these experiments.
Model vector-based retrieval is a novel approach for video indexing that uses a semantic model vector signature that describes the detection of a fixed set of concepts across a lexicon. The model vector basis is created using a set of independent binary classifiers that correspond to the semantic concepts. The model vectors are created by applying the binary detectors to video content and measuring the confidence of detection. Once the model vectors are extracted, simple techniques can be used for searching to find similar matches in a video database. However, since confidence scores alone do not capture information about the reliability of the underlying detectors, techniques are needed to ensure good performance in the presence of varying qualities of detectors. In this
paper, we examine the model vector-based retrieval framework for video and propose methods using detector validity to improve matching performance. In particular, we develop a model vector distance metric that weighs the dimensions using detector validity scores. In this paper, we explore the new model vector-based retrieval method for video indexing and empirically evaluate the retrieval effectiveness on a large video test collection using different methods of measuring and incorporating detector validity indicators.
In this paper, we present our new results in news video story
segmentation and classification in the context of TRECVID video
retrieval benchmarking event 2003. We applied and extended the
Maximum Entropy statistical model to effectively fuse diverse
features from multiple levels and modalities, including visual,
audio, and text. We have included various features such as motion,
face, music/speech types, prosody, and high-level text
segmentation information. The statistical fusion model is used to
automatically discover relevant features contributing to the
detection of story boundaries. One novel aspect of our method is
the use of a feature wrapper to address different types of
features -- asynchronous, discrete, continuous and delta ones. We
also developed several novel features related to prosody. Using
the large news video set from the TRECVID 2003 benchmark, we
demonstrate satisfactory performance (F1 measures up to 0.76 in
ABC news and 0.73 in CNN news), present how these multi-level
multi-modal features construct the probabilistic framework, and
more importantly observe an interesting opportunity for further
improvement.
KEYWORDS: Video, Databases, Multimedia, Video compression, Semantic video, Video processing, Personal digital assistants, Internet, Cameras, Composites
A video personalization and summarization system is designed and implemented incorporating usage environment to dynamically generate a personalized video summary. The personalization system adopts the three-tier server-middleware-client architecture in order to select, adapt, and deliver rich media content to the user. The server stores the content sources along with their corresponding MPEG-7 metadata descriptions. Our semantic metadata is provided through the use of the VideoAnnEx MPEG-7 Video Annotation Tool. When the user initiates a request for content, the client communicates the MPEG-21 usage environment description along with the user query to the middleware. The middleware is powered by the personalization engine and the content adaptation engine. Our personalization engine includes the VideoSue Summarization on Usage Environment engine that selects the optimal set of desired contents according to user preferences. Afterwards, the adaptation engine performs the required transformations and compositions of the selected contents for the specific usage environment using our VideoEd Editing and Composition Tool. Finally, two personalization and summarization systems are demonstrated for the IBM Websphere Portal Server and for the pervasive PDA devices.
Model-based approach to video retrieval requires ground-truth data for training the models. This leads to the development of video annotation tools that allow users to annotate each shot in the video sequence as well as to identify and label scenes, events, and objects by applying the labels at the shot-level. The annotation tool considered here also allows the user to associate the object-labels with an individual region in a key-frame image. However, the abundance of video data and diversity of labels make annotation a difficult and overly expensive task. To combat this problem, we formulate the task of annotation in the framework of supervised training with partially labeled data by viewing it as an exercise in active learning. In this scenario, one first trains a classifier with a small set of labeled data, and subsequently updates the classifier by selecting the most informative, or most uncertain subset of the available data-set. Consequently, propagation of labels to yet unlabeled data is automatically achieved as well. The purpose of this paper is primarily twofold. The first is to describe a video annotation tool that has been developed for the purpose of annotating generic video sequences in the context of a recent video-TREC benchmarking exercise. The tool is semi-automatic in that it automatically propagates labels to similar shots, which requires the user to confirm or reject the propagated labels. The second purpose is to show how active learning strategy can be potentially implemented in this context to further improve the performance of the annotation tool. While many versions of active learning could be thought of, we specifically report results on experiments with support vector machine classifiers with polynomial kernels.
KEYWORDS: Video, Semantic video, Video compression, Personal digital assistants, Databases, Image segmentation, Multimedia, Human-machine interfaces, Mobile devices, Visualization
We have designed and implemented a video semantic summarization system, which includes an MPEG-7 compliant annotation interface, a semantic summarization middleware, a real-time MPEG-1/2 video transcoder on PCs, and an application interface on color/black-and-white Palm-OS PDAs. We designed a video annotation tool, VideoAnn, to annotate semantic labels associated with video shots. Videos are first segmentated into shots based on their visual-audio characteristics. They are played back using an interactive interface, which facilitate and fasten the annotation process. Users can annotate the video content with the units of temporal shots or spatial regions. The annotated results are stored in the MPEG-7 XML format. We also designed and implemented a video transmission system, Universal Tuner, for wireless video streaming. This system transcodes MPEG-1/2 videos or live TV broadcasting videos to the BW or indexed color Palm OS devices. In our system, the complexity of multimedia compression and decompression algorithms is adaptively partitioned between the encoder and decoder. In the client end, users can access the summarized video based on their preferences, time, keywords, as well as the transmission bandwidth and the remaining battery power on the pervasive devices.
Handling packet loss or delay in the mobile and/or Internet environment is usually a challenging problem for multimedia transmission. Using connection-oriented protocol such as TCP may introduce intolerable time delay in re-transmission. Using datagram-oriented protocols such as UDP may cause partial representation in case of packet loss. In this paper, we propose a new method of using our self-authentication-and-recovery images (SARI) to do the error detection and concealment in the UDP environment. The lost information in a SARI image can be approximately recovered based on the embedded watermark, which includes the content-based authentication information and recovery information. Images or video frames are watermarked in a priori such that no additional mechanism is needed in the networking or the encoding process. Because the recovery is not based on adjacent blocks, the proposed method can recover the corrupted area even though the information loss happen in large areas or high variant areas. Our experiments show the advantages of such technique in both transmission time saving and broad application potentials.
KEYWORDS: Digital watermarking, Sensors, Fourier transforms, Databases, Signal to noise ratio, Image registration, Pattern recognition, Image processing, Signal processing, Signal detection
Many electronic watermarks for still images and video content are sensitive to geometric distortions. For example, simple rotation, scaling, and/or translation (RST) of an image can prevent detection of a public watermark. In this paper, we propose a watermarking algorithm that is robust to RST distortions. The watermark is embedded into a 1-dimensional signal obtained by first taking the Fourier transform of the image, resampling the Fourier magnitudes into log-polar coordinates, and then summing a function of those magnitudes along the log-radius axis. If the image is rotated, the resulting signal is cyclically shifted. If it is scaled, the signal is multiplied by some value. And if the image is translated, the signal is unaffected. We can therefore compensate for rotation with a simple search, and for scaling by using the correlation coefficient for the detection metric. False positive results on a database of 10,000 images are reported. Robustness results on a database of 2,000 images are described. It is shown that the watermark is robust to rotation, scale and translation. In addition, the algorithm shows resistance to cropping.
In this paper, we propose a semi-fragile watermarking technique that accepts JPEG lossy compression on the watermarked image to a pre-determined quality factor, and rejects malicious attacks. The authenticator can identify the positions of corrupted blocks, and recover them with approximation of the original ones. In addition to JPEG compression, adjustments of the brightness of the image within reasonable ranges, are also acceptable using the proposed authenticator. The security of the proposed method is achieved by using the secret block mapping function which controls the signature generating/embedding processes. Our authenticator is based on two invariant properties of DCT coefficients before and after JPEG compressions. They are deterministic so that no probabilistic decision is needed in the system. The first property shows that if we modify a DCT coefficient to an integral multiple of a quantization step, which is larger than the steps used in later JPEG compressions, then this coefficient can be exactly reconstructed after later acceptable JPEG compression. The second one is the invariant relationships between two coefficients in a block pair before and after JPEG compression. Therefore, we can use the second property to generate authentication signature, and use the first property to embed it as watermarks. There is no perceptible degradation between the watermarked image and the original. In additional to authentication signatures, we can also embed the recovery bits for recovering approximate pixel values in corrupted areas. Our authenticator utilizes the compressed bitstream, and thus avoids rounding errors in reconstructing DCT coefficients. Experimental results showed the effectiveness of this system. The system also guaranies no false alarms, i.e., no acceptable JPEG compression is rejected.
KEYWORDS: Video, Multimedia, Digital watermarking, Video compression, Image processing, Video processing, Tolerancing, Digital video recorders, Cameras, Image compression
Video authentication techniques are used to prove the originality of received video content and to detect malicious tampering. Existing authentication techniques protect every single bit of the video content and do not allow any form of manipulation. In real applications, this may not be practical. In several situations, compressed videos need to be further processed to accommodate various application requirements. Examples include bitrate scaling, transcoding, and frame rate conversion. The concept of asking each intermediate processing stage to add authentication codes is flawed in practical cases. In this paper, we extend our prior work on JPEG- surviving image authentication techniques to video. We first discuss issues of authenticating MPEG videos under various transcoding situations, including dynamic rate shaping, requantization, frame type conversion, and re-encoding. Different situations pose different technical challenges in developing robust authentication techniques. In the second part of this paper, we propose a robust video authentication system which accepts some MPEG transcoding processes but is able to detect malicious manipulations. It is based on unique invariant properties of the transcoding processes. Digital signature techniques as well as public key methods are used in our robust video authentication system.
Image authentication verifies the originality of an image by detecting malicious manipulations. This goal is different from that of image watermarking which embeds into the image a signature surviving most manipulations. Existing methods for image authentication treat all types of manipulation equally (i.e., as unacceptable). However, some applications demand techniques that can distinguish acceptable manipulations (e.g., compression) from malicious ones. In this paper, we describe an effective technique for image authentication which can prevent malicious manipulations but allow JPEG lossy compression. The authentication signature is based on the invariance of the relationship between DCT coefficients of the same position in separate blocks of an image. This relationship will be preserved when these coefficients are quantized in a JPEG compression process. Our proposed method can distinguish malicious manipulations from JPEG lossy compression regardless of how high the compression ratio is. We also show that, in different practical cases, the design of authenticator depends on the number of recompression times and on whether the image is decoded into integral values in the pixel domain during the recompression process. Theoretical and experimental results indicate that this technique is effective for image authentication.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.