Hardware-Algorithm Co-Design for Energy-Efficient and Low-Latency Domain-Specific Machine Learning Systems