Towards Grounded Multimodal Enterprise Document Understanding