Home
Research

Research

Research Interests

Biometrics: biometrics recognition, biometrics template protection, cross-modality, multimodal biometrics
Deep Learning: dimensionality reduction, deep domain translation, coupled DNN, coupled GAN, adversarial examples, deep fusion
Computer vision: domain adaptation, dictionary learning, object recognition, activity recognition, shape representation, multiview face recognition
Machine learning: deep learning, dimensionality reduction, clustering, kernel methods, semi-supervised learning, transfer learning, active learning
Signal and image processing: sparse representation, compressive sampling, automatic target recognition, multispectral and hyperspectral imaging, acoustic and seismic signal processing

Biometrics and Identity Innovation Center

Director: Nasser M. Nasrabadi
https://biic.wvu.edu/

Cognitive Computing Laboratory

Director: Nasser M. Nasrabadi
https://cognitivecomputinglab.faculty.wvu.edu/

Research Projects

Cross Audio-to-Visual Speaker Identification in the Wild Using Deep Learning

Speaker recognition technology has achieved significant performance for some real-world applications. However, the performance of speaker recognition is still greatly degraded when used in noisy environments. One approach to improve speech recognition/identification is by combining video and audio sources to link the visual features of lip motion with vocal features, two modalities which are correlated and convey complementary information. In this project, we are interested in identifying an individual face from a coupled video/audio clip of several individuals based on data collected in an unrestricted environment (wild). For this effort, we are proposing to use visual lip motion features for a face in a video clip and the co-recorded audio signal features from several speakers to identify the individual who uttered the audio recorded along with the video. To solve this problem, we are proposing to use an auto-associative deep neural network architecture (as shown in Fig. 1), which is a data-driven model and does not model phonemes or visemes (the visual equivalent of a phoneme). A speech-to-video auto-associative deep network will be used where the network has learned to reconstruct the visual lip features given only speech features as the input. The visual lip feature vector generated by our deep network for an input test speech signal will be compared with a gallery of individual visual lip features for speaker identification. The proposed speech-to-video deep network will be trained with our current WVU voice and video training dataset using the corresponding audio and video features from individuals as inputs to the network. For the audio signal we will use the mel-frequency cepstral coefficients; for video, we will extract static and temporal visual features of the lip motion.

^{Fig. 1.} ^{Outline of the hetero-associative deep neural network architecture. Block diagram
illustrates a speech-to-lip motion features using a five-layer feed forward
hetero-associative deep neural network where the network has learned to reconstruct
the visual lip features as output given only speech features (i.e., Mel-Frequency
Cepstrum Coefficient) as the input. The visual lip feature vector generated
by this network for an input test speech signal is compared with a gallery
of individual visual lip features for speaker identification.}

Nonlinear Mapping using Deep Learning for Thermal-to-Visible Night-Time Face Recognition

Infrared thermal cameras are important for night-time surveillance and security applications. They are especially useful in nighttime scenarios when the subject is far away from the camera. The motivation behind thermal face recognition is the need for enhanced intelligence gathering capabilities in darkness where active illumination is impractical and when surveillance with visible cameras is not feasible. However, the acquired thermal face images have to be identified using the images from existing visible face databases. Therefore, cross-spectral face matching between the thermal and visible spectrum is a much desired capability. In cross-modal face recognition, identifying a thermal probe image based on a visible face database is especially difficult because of the wide modality gap between thermal and visible physical phenomenology. In this project we address the cross-spectral (thermal vs. visible) and cross-distance (50 m, 100 m, and 150 m vs. 1 m standoff) face matching problem for night-time FR applications. Previous research activities have mainly concentrated on extracting hand-crafted features (i.e., SIFT, SURF, HOG, LBP, wavelets, Gabor jets, kernel functions) by assuming that the two modalities share the same extracted features. However, the relationship between the two modalities is highly non-linear. In this project we will investigate non-linear mapping techniques based on deep neural networks learning procedures (shown in Fig. 2) to bridge the modality gap between visible-thermal spectrums while preserving the subject identity information. The nonlinear coupled DNN features will be used by a FR classifier.

^{Fig. 2.} ^{Illustrates a coupled deep convolutional neural network to bridge the modality
gap between a visible and a thermal face while preserving the subject identity
information. The CP-DCNN consists of two DCNN-based autoencoders, the top network
in the figure is dedicated to visible spectrum and the bottom one is dedicated
to the long wave infrared spectrum, which are coupled together at the latent
feature layer for the purpose of matching the two modalities.}

Face Recognition for Mobile Passive Video Acquisition at a Distance Scenario

In this scenario we are considering a situation where a stationary or moving body-worn camera is observing or monitoring a crowed scene at a distance as shown in Fig. 3. Our objective is to detect and perform face recognition at a distance on the sensor or at a remote site. We propose to do this by performing a super-resolution algorithm on the video footage in order to increase the number of pixels on the target and identify key-frames which are then used to perform the face detection algorithm to obtain the face-chips as shown in Fig. 3. Once face-chips are extracted they are matched against a search list on the sensor or these face-chips are transmitted to a remote dedicated server with a much larger search list to identify each of the face-chips. This problem is more challenging than the close-up scenario since there are a large number of faces and the resolution on faces may not be sufficient to perform face identification. To improve the video resolution we propose to use super-resolution algorithms to increase the number of pixels on the face. Also there might be multiple body-worn cameras observing the same physical event at different vantage points. We propose to exploit this multi-view scenario to improve the face recognition algorithm by using multi-view face recognition techniques. In the multi-view scenario association procedures between the same face-chips extracted from different body-worn cameras need also to be developed.

^{Fig. 3.} ^{The first column illustrates a face at different resolution, the second column shows a super-resolved face by a factor of four and third column shows a super-resolved face by a factor of 8.}

Use of Body-worn Video Cameras for Facial and Activity Recognition Used by Law Enforcement

A body-worn camera is a small device clipped to a security guards/agents, soldiers or a police officer’s uniform, or possibly to his head-gear. It can record video of the area in front of him and audio of the surrounding environment. The implementation of body-worn cameras in police force has specially gained increased attention in recent years. The footage from body-worn cameras can be used for biometric tasks performed instantly on the sensor (live) or off the sensor (remotely). For example, live face recognition algorithm can be used for day/night-time police traffic stop scenarios by using smart body-worn cameras. FR can be used by solider to identify insurgents as they come across individuals and crowds. Security agents can passively observe crowd gatherings and collect videos for real-time or off-line face identification. However, the video footage from body-worn cameras presents new issues and challenges that do not exist or have not been addressed in traditional video FR using a hand held or stationary surveillance devices. In the case of body-worn cameras, the video is very shaky due to rapid body movements of the officer/soldier who is continuously recording while the user performs his/her normal operations, thus the user cannot capture all the relevant activities. The video may sometimes be not focused on the scene at large but rather at nearby objects. The poor image quality of many body-worn camera videos effectively renders them useless for the purposes of identifying a person or action. Video stabilization, background clutter/subtraction, video summarization, selecting representative key-frames, face detection/identification, video anomaly detection and action recognition algorithms are serious problems for wearable camera biometrics that are addressed in this project. We envision a smart body-worn camera that will capture, detect and match face images locally on-device against a watch list of suspected personnel stored within the camera system. The device can also transmit key-frames from the video footage or detect face regions (face-chips) and transmit to a remote device (server) that can launch a search against much larger offline databases. It can also record the video events for post evaluation analysis for officer/soldier accountability or multi-platform surveillance using video analytics toolboxes. We believe by enhancing the functions of the body-worn cameras they can significantly improve the capability, situational awareness and operational capability of soldiers, police and security officers in different challenging environments. Having FR and video analytics functions added to smart body-worn camera devices will also be very useful for special covert operations and SWAT teams in addition to war fighters.

^{Fig. 4.} ^{Illustrates a smart body-worn camera that can be used by a law enforcement officer
at a check-point to provide an on-line face recognition capability in order
to identify individuals arriving at a military gate. The officer’s body-worn
camera will automatically take a photo of the driver (left image) and the algorithm
built in the camera will extract the face region (right image) for identification.}

Multi-Sensor Classification Border Patrol Using Seismic, Acoustic and PIR Sensors

The objective of this project is to detect and classify different targets (e.g., humans, vehicles and animals led by human), where unattended ground sensors (i.e., seismic, acoustic and PIR sensors) are used to capture the characteristic target signatures. For example, in the movement of a human or an animal across the border, oscillatory motions of the body appendages provide the respective characteristic signatures. Efﬁcacy of UGS systems is often limited by high false alarm rates because the onboard data processing algorithms may not be able to correctly discriminate different types of targets (e.g., humans from animals). Power consumption is a critical consideration in UGS systems. Therefore, power-efﬁcient sensing modalities, low-power signal processing algorithms and efﬁcient methods for exchanging information between the UGS nodes are needed. In the detection and classiﬁcation problem at hand, the targets usually include human, vehicles and animals. For example, discriminating human footstep signals from other targets and noise sources is a challenging task, because the signal-to-noise ratio of footsteps decreases rapidly with the distance between the sensor and the pedestrian. Furthermore, the footstep signals may vary signiﬁcantly for different people and environments. Often the weak and noise-contaminated signatures of humans and light vehicles may not be clearly distinguishable from each other, in contrast to heavy vehicles that radiate loud signatures. In this project we demonstrate the effectiveness of using multiple sensors over using a single sensor for discriminating between the human and human-animal footsteps. We are proposing to use a nonlinear technique for multi-sensor classification, which relies on sparsely representing a test sample in terms of all the training samples in a feature space induced by a kernel function. Our approach takes into account correlations as well as complementary information between homogeneous/heterogeneous sensors simultaneously while considering joint sparsity within each sensor’s observations in the feature space. This approach can be seen as the generalized model of multitask and multivariate Lasso in the feature space, where data from all the representing the same physical events are jointly represented by a sparse linear combination of the training data. Experiments will be conducted on real data sets (using real dataset from ARL) and the results will be compared with the conventional discriminative classifiers to verify the effectiveness of the proposed methods in the application of automatic border patrol, where it is required to discriminate between human and animal footsteps.

Fig 5. — ^{Fig. 5.} ^{This figure illustrate the use of several disposable sensors such as acoustic,
seismic, passive IR and ultrasound referred to as Unattended Ground Sensors.
Data plotted for each sensor represents the signal recorded when a single person
is walking in proximity about 10-15 meters away. The picture on the right shows
the sensor arrangement in the field.}

Deep Transfer Learning for Automatic Target Classification: MWIR to LWIR

Transfer learning tends to be a powerful tool that can mitigate the divergence across different domains – or MWIR vs LWIR – through knowledge transfer. Recent research efforts on transfer learning have exploited deep neural network structures for discriminative feature representation to better tackle cross-domain disparity. Cross-domain disparity can be due to the difference between source and target distributions or different modalities such as going from midwave IR to longwave IR. However, few of these techniques are able to jointly learn deep features and train a classifier in a unified transfer learning framework. To this end, we propose a task-driven deep transfer learning framework for automatic target classification, where the deep feature and classifier are obtained simultaneously for optimal classification performance. Therefore, the proposed deep structure can generate more discriminative features by using the classifier performance as a guide. Furthermore, the classifier performance is increased since it is optimized on a more discriminative deep features. The developed supervised formulation is a task-driven scheme, which will provide better learned features for target classification task. By assigning pseudo labels to target data using semi-supervised algorithms, we can transfer knowledge from source (i.e., MWIR) to target (i.e., LWIR) through the deep structures. Experimental results on a real database of MWIR and LWIR targets demonstrate the superiority of the proposed algorithm by comparing with other ones.

Fig 6. — ^{Fig. 6. Illustration of our proposed (𝐿+1)-layer coupled deep neural network (here
𝐿=2). Two coupled deep structures are built to learn deep features for source
X
_s and target domains X
_t to capture the rich information within two domains, respectively. A
weighted reconstruction scheme is adopted to couple the output of
two deep networks, i.e., H
_s
^{(
𝐿)}
and H
_t
^{(
𝐿)}, where each source sample is reconstructed by target samples
with a different probability. Meanwhile, a classifier is jointly trained on
both labeled source H
_s
^{(
𝐿)} and unlabeled target data H
_t
^{(
𝐿)} in a semi-supervised fashion.}

Using LWIR Hyperspectral for Camouflaged Sniper Detection

Hyperspectral imagery provides both spatial and spectral information. In principle, different materials produce different spectral responses. In this project we propose to develop HSI algorithms to build an effective system for camouflaged sniper detection using LWIR HSI cameras. HSI spectral signatures make it possible to detect camouflaged sniper activities in infrared spectra. We will develop unique hyperspectral unmixing methods to detection camouflaged targets. We will extend our previous technique on detecting camouflaged military targets to sniper detection. We will study performance limitation of this proposed algorithm as well as analysis of physical limitations. Several activity scenarios of a camouflaged sniper will be studied. Effect of sniper camouflage type on detection performance will be investigated.

Fig. 7A - Before detection — ^{Fig. 7.} ^{Illustrate use of hyperspectral imagery for to detecting a camouflaged sniper.
The first picture shows an image with a sniper camouflaged, using our hyperspectral
algorithm camouflaged sniper pixels are detected as shown in the second picture.}

Fig. 7B After detection — ^{Fig. 7.} ^{Illustrate use of hyperspectral imagery for to detecting a camouflaged sniper.
The first picture shows an image with a sniper camouflaged, using our hyperspectral
algorithm camouflaged sniper pixels are detected as shown in the second picture.}

Restoration of Distorted Latent Fingerprints

Automatic fingerprint technology has become a highly accurate method for identification of individuals by using FBI’s NGI system. However, there still exists challenging problems with low quality latent fingerprints that are unintentionally left by a subject at crime scenes. Latent fingerprints are typically partial, blurred, noisy, exhibit poor ridge quality and contain large overlap between the foreground area (friction ridge pattern) and structured or random noise in the background. In practice, latent fingerprints are analyzed with the help of forensic examiners who perform a manual latent fingerprint identification procedure. Since this process is time consuming forensic experts tend to restrict their matching process only to a limited number of suspects. In this project, we are proposing to improve the quality of the latent fingerprint by removing the aforementioned latent fingerprint distortions so that standard feature extraction methods are more reliable.

Fig 8B Results — ^{Fig. 8.} ^{(A) The block diagram on the top shows our proposed Deep Conditional Generative
Adversarial Network that restores a fingerprint from its distorted input latent
fingerprint. Our DCGAN consists of a convolutional neural network generator
to reconstruct the distorted input latent fingerprint and a CNN-based discriminator
that distinguishes between the reconstructed and its ground truth fingerprint,
the discriminator helps to improve the performance of the generator. (B) The
bottom image shows our preliminary results on several distorted latent fingerprint
(top row), corresponding ground truth fingerprint (middle row) and the reconstructed
fingerprint (bottom row).}

Multimodal Biometrics for Personnel Identification

The project looks to develop a multimodal classification method using multiple modalities such as fingerprint, iris, face and speech for personnel identification. The goal is to develop a truly joint classification algorithm that simultaneously uses all the modalities from an individual to make a decision. This approach is different from the classical algorithms that use handcrafted features from each modality and perform feature fusion or approaches that are based on fusion of scores obtained from different modality. Due to success of deep neural networks for single modality we propose a multimodal classifier based on DNN framework (Fig. 1), which is trained on all modalities simultaneously to make classification. This algorithm should perform better than the classical fusion algorithms since all the modalities are simultaneously used to make a final decision. This project will empower the FBI with a multimodal biometrics identifier.

^{Fig. 9.} ^{Show the deep neural network architecture for a multi-modal biometrics classifier,
which consists of several domain-specific deep networks (first learning stage)
each dedicated for a particular input modality (each network acts as a domain-specific
feature extractor), the second learning stage is made of several fusion layers
to combine the features from the first stage followed by a classifier.}

Facial Attribute-guided Sketch to Photo Synthesizer

In this project, we are proposing to improve our previously developed sketch-to-photo synthesizer by incorporating the soft biometrics information (facial attributes) provided by an eyewitness into our algorithm. A typical forensic or composite sketch contains a rough spatial topology of the suspect face and lacks the soft complementary information such as gender, ethnicity, skin and hair color. Furthermore, most of the previous research works on the sketch-based photo synthesis have ignored to directly incorporate the soft biometrics information into their algorithm. Therefore, in this project a software algorithm will be developed that will take a hand drawn sketch of a face with its associated facial attributes as input to a deep network and will produce a synthesized face photo that can be plugged into a face recognition algorithm to identify the sketch against a gallery of mug-shots. Our deep network, as shown in the Figure below, is based on a deep convolutional cycle generative adversarial network, or CycleGAN, which was developed in a the previous project. Our CycleGAN will now be modified to incorporate the soft attributes provided by an eyewitness. In this project we investigate several CycleGAN architectures with different facial attributes to map the whole sketch to a realistic photo. Quality of the synthesized photo will be evaluated by subjective measures as well as performance of an off-the-shelf face recognition system. The software can be integrated into the image processing pipeline for the FBI face recognition program. Incoming sketches can be used by the software to synthesize realistic photo images.

^{Fig. 10.} ^{Shows the architecture of our facial attribute-guided sketch-to-photo synthesizer;
(top) represents a cCycleGan network architecture for sketch-to-photo synthesizer,
G
_y is a sketch-to-photo generator and D
_a and D
_y are the discriminators for the attribute and photo, respectively; (bottom)
represents a cCycleGan network architecture for photo-to-sketch synthesizer,
G
_x is a photo-to-sketch generator and D
_x is the discriminator for the sketch.}

Textured Contact Lens Detection for Mobile Biometrics

Textured contact lenses, also known as cosmetic lenses, obscure the physiological iris texture during image acquisition for biometric recognition. Both transparent (soft) and textured (cosmetic) lenses have previously been shown to degrade the performance of iris recognition algorithms. This is a major problem for law enforcement when acquiring iris images. There are many techniques that are used to detect textured contact lenses using feature-based approaches. Most of these techniques extract a particular feature descriptor (SIFT, HOG, LBP, BSIF, LPQ, Gabor, wavelet, Zernike moments) or their combinations for analyzing iris texture. It is not clear in the literature which feature descriptors will provide the best discrimination between a fake and real iris. The primary goal of this proposal is to develop an algorithm to select the optimal feature descriptors for robust detection of textured contact lenses in an unconstrained environment (i.e., mobile biometrics) for law enforcement applications. Our hypothesis is that an optimal combination of specific feature descriptors would perform better than any single feature descriptor. In the Figure below, we propose a method to solve both the problems of feature selection and their fusion simultaneously. Our model is composed of two main components: 1) a set of dedicated feature descriptor neural networks and 2) a fusion neural network to select the optimal feature descriptors based on a feature group sparsity constraint. During the training of the whole network, the feature group sparsity constraint will discard non-informative feature descriptors. Each feature descriptor considered in our model is assigned as input to its corresponding feature network. The feature networks in our model are comprised of several MLPs (corresponding to N feature descriptors) and one CNN (the 8-layer VGG-based network). The output of each feature network is a feature embedding vector sent to a Softmax binary classifier (fake or real).

^{Fig. 11. Our model Sparse fusion of multiple feature descriptors and detection.
Our model is composed of two main components: 1) a set of feature deep networks
(several multi-layer perceptrons and one convolutional neural network) and
2) an MLP-based fusion network. Each feature descriptor considered in our model
is assigned as input to its corresponding feature network. The feature networks
in our model are comprised of N=6 MLPs (corresponding to the six features:
BSIF, LBP, CoA-LBP, HoG, DAISY and SID) and one CNN (the eight-layer VGG-based
network). The output of each feature network is a feature embedding vector.
After a group sparsity the feature embeddings are concatenated to form the
input to the MLP-based fusion network.}

Face Recognition Using Thermal Polarization for Night Surveillance

Face Recognition in the visible spectrum is sensitive to illumination variations, and is not practical in low-light or nighttime surveillance. In contrast, thermal imaging is ideal for nighttime surveillance and intelligence gathering operations. But lacks textural details that can be obtained from polarimetric information. In this work, we use thermal polarimetric imaging to further improve thermal FR face recognition performance. A multi-polarimetric dictionary is designed, which simultaneously exploits information in visible and all the polarimetric (Stokes) images to map the image set into a common surrogate feature space (referred to as sparse coefficients in sparse theory framework). A classifier designed in this common feature space, obtained from a gallery of visible database, is used to identify the polarimetric image probes for cross-spectral face recognition. The innovation in our approach is the development of a common surrogate feature space via our novel polarimetric dictionary design.

^{Fig. 12.} ^{Shows the signature of a face at visible spectrum and different long wave infrared
polarization stoke faces, S
₀ represents the classical LWIR face, S
₁ represents horizontal polarization, S
₂ represents vertical polarization and DoLP denotes Degree-of-Liner
Polarization of a face.}