Research
I am interested in applications of machine learning and data science in computer vision and natural language processing.
The focus of my research is mainly on the intersection of these areas, with several interesting applications such as
video event detection and recounting, image/video description generation and image/video question answering.
|
|
Consensus-based Sequence Training for Video Captioning
Sang Phan,
Gustav Eje Henter,
Yusuke Miyao,
Shin'ichi Satoh
arXiv Preprint, 2017, [Code]
We propose a Consensus-based Sequence Training (CST) scheme to generate video captions. First, CST performs an RLlike pre-training, but with captions from the training data
replacing model samples. Second, CST applies REINFORCE for fine-tuning using
the consensus (average reward) among training captions
as the baseline estimator. The two stages of CST allow objective mismatch
and exposure bias to be assessed separately, and together
establish a new state-of-the-art on the task.
|
|
MANet: A Modal Attention Network for Describing Videos
Sang Phan,
Yusuke Miyao,
Shin'ichi Satoh
ACM Multimedia, 2017 (Grand Challenge Paper) -- Honorable Mention Award
We propose a Modal Attention Network (MANet) to learn dynamic weighting combinations of multimodal features (audio, image, motion, and text) for video captioning. Our MANet extends the standard encoder-decoder
network by adapting the attention mechanism to video modalities.
|
|
Evaluation of multiple features for violent scenes detection
Vu Lam,
Sang Phan,
Duy-Dinh Le,
Duc Anh Duong,
Shin'ichi Satoh
Multimedia Tools Applications, 2017 (Journal)
We evaluated the performance of various features in violent scenes detection. The evaluated
features included global and local image features, motion features, audio features, VSD
concept features, and deep learning features. We also compared two popular encoding strategies:
Bag-of-Words and Fisher vector.
|
|
Video Event Detection by Exploiting Word Dependencies from Image Captions
Sang Phan,
Yusuke Miyao,
Duy-Dinh Le,
Shin'ichi Satoh
COLING, 2016 (Oral)
We propose a new approach to obtain the relationship between concepts by exploiting the syntactic dependencies between words in the image captions.
|
|
Generating Video Description using Sequence-to-sequence Model with Temporal Attention
Natsuda Laokulrat,
Sang Phan,
Noriki Nishida,
Raphael Shu,
Yo Ehara,
Naoaki Okazaki,
Yusuke Miyao,
Hideki Nakayama
COLING, 2016 (Oral), [Code]
We combine sequence to sequence approach with temporal attention mechanism for video captioning.
|
|
Multimedia Event Detection Using Event-Driven Multiple Instance Learning
Sang Phan,
Duy-Dinh Le,
Shin'ichi Satoh
ACM Multimedia, 2015 (Poster)
We propose to use Event-driven Multiple Instance Learning (EDMIL) to learn the key evidences for event detection. The key evidences are obtained by matching its detected concepts against the evidential description of that event.
|
|
Sum-max Video Pooling for Complex Event Recognition
Sang Phan,
Duy-Dinh Le,
Shin'ichi Satoh
ICIP, 2014 (Poster)
We leverage the layered structure of video to propose a new pooling method, named sum-max video pooling, to combine the advantages of sum pooling and max pooling for video event detection.
|
|
Multimedia Event Detection Using Segment-Based Approach for Motion Feature
Sang Phan,
Thanh Duc Ngo,
Vu Lam,
Son Tran,
Duy-Dinh Le,
Shin'ichi Satoh
PCM, 2012 (Oral), [Journal version]
We propose to use a segment-based approach for video representation. Basically,
original videos are divided into segments for feature
extraction and classification, while still keeping the evaluation
at the video level. Experimental results on the
TRECVID Multimedia Event Detection 2010 dataset proved the
effectiveness of our approach.
|
|