Advancing Multi-Modal Deep Learning: Towards Language-Grounded Visual Understanding