Towards Generalist Vision-Language Models for Videos in Embodied AI