It will useful to include instruction following benchmarks to explicitly evaluate the instruction following capability of LLMs. Some datasets like Self-instruct, SuperNaturalInstructions, [Natural Instructions](Cross-task generalization via natural language crowdsourcing instructions)