Towards Multimodal and Context-Aware Emotion Perception