Architectural and runtime enhancements for dynamically controlled multi-level concurrency on GPUs