Sang Phan

Greetings! I am a project researcher at Satoh Lab, National Institute of Informatics, Tokyo, Japan, where I work on medical imaging with Prof. Shin'ichi Satoh. I also have been working on applications of Computer Vision and Natural Language Processing with Prof. Yusuke Miyao at Miyao Lab.

I did my PhD at SOKENDAI (The Graduate University for Advanced Studies), where I was advised by Prof. Shin'ichi Satoh and Duy-Dinh Le, and funded by the NII Scholarship. Prior to that, I did my master's and bachelor's at the University of Science - Vietnam National University Ho Chi Minh City.

Email  /  CV  /  Biography  /  DBLP  /  Google Scholar  /  LinkedIn

Research

My research focuses on the intersection of computer vision and natural language processing. This research area has been getting a lot of attention from researchers in both fields. There are several exciting applications emerging from this joint research such as video event detection and recounting, image/video description generation, and image/video question answering.

Consensus-based Sequence Training for Video Captioning
Sang Phan, Gustav Eje Henter, Yusuke Miyao, Shin'ichi Satoh
arXiv Preprint, 2017, [Code]

We propose a Consensus-based Sequence Training (CST) scheme to generate video captions. First, CST performs an RLlike pre-training, but with captions from the training data replacing model samples. Second, CST applies REINFORCE for fine-tuning using the consensus (average reward) among training captions as the baseline estimator. The two stages of CST allow objective mismatch and exposure bias to be assessed separately, and together establish a new state-of-the-art on the task.

MANet: A Modal Attention Network for Describing Videos
Sang Phan, Yusuke Miyao, Shin'ichi Satoh
ACM Multimedia, 2017 (Grand Challenge Paper) -- Honorable Mention Award

We propose a Modal Attention Network (MANet) to learn dynamic weighting combinations of multimodal features (audio, image, motion, and text) for video captioning. Our MANet extends the standard encoder-decoder network by adapting the attention mechanism to video modalities.

clean-usnob

Evaluation of multiple features for violent scenes detection
Vu Lam, Sang Phan, Duy-Dinh Le, Duc Anh Duong, Shin'ichi Satoh
Multimedia Tools Applications, 2017 (Journal)

We evaluated the performance of various features in violent scenes detection. The evaluated features included global and local image features, motion features, audio features, VSD concept features, and deep learning features. We also compared two popular encoding strategies: Bag-of-Words and Fisher vector.

clean-usnob

Video Event Detection by Exploiting Word Dependencies from Image Captions
Sang Phan, Yusuke Miyao, Duy-Dinh Le, Shin'ichi Satoh
COLING, 2016 (Oral)

We propose a new approach to obtain the relationship between concepts by exploiting the syntactic dependencies between words in the image captions.

clean-usnob

Generating Video Description using Sequence-to-sequence Model with Temporal Attention
Natsuda Laokulrat, Sang Phan, Noriki Nishida, Raphael Shu, Yo Ehara, Naoaki Okazaki, Yusuke Miyao, Hideki Nakayama
COLING, 2016 (Oral), [Code]

We combine sequence to sequence approach with temporal attention mechanism for video captioning.

clean-usnob

Multimedia Event Detection Using Event-Driven Multiple Instance Learning
Sang Phan, Duy-Dinh Le, Shin'ichi Satoh
ACM Multimedia, 2015 (Poster)

We propose to use Event-driven Multiple Instance Learning (EDMIL) to learn the key evidences for event detection. The key evidences are obtained by matching its detected concepts against the evidential description of that event.

clean-usnob

Sum-max Video Pooling for Complex Event Recognition
Sang Phan, Duy-Dinh Le, Shin'ichi Satoh
ICIP, 2014 (Poster)

We leverage the layered structure of video to propose a new pooling method, named sum-max video pooling, to combine the advantages of sum pooling and max pooling for video event detection.

clean-usnob

Multimedia Event Detection Using Segment-Based Approach for Motion Feature
Sang Phan, Thanh Duc Ngo, Vu Lam, Son Tran, Duy-Dinh Le, Shin'ichi Satoh
PCM, 2012 (Oral), [Journal version]

We propose to use a segment-based approach for video representation. Basically, original videos are divided into segments for feature extraction and classification, while still keeping the evaluation at the video level. Experimental results on the TRECVID Multimedia Event Detection 2010 dataset proved the effectiveness of our approach.


Kudos to Jon