Adding Instruction following benchmarks

It will useful to include instruction following benchmarks to explicitly evaluate the instruction following capability of LLMs. Some datasets like [Self-instruct](https://arxiv.org/abs/2212.10560), [SuperNaturalInstructions](https://arxiv.org/pdf/2204.07705), [Natural Instructions](Cross-task generalization via natural language crowdsourcing instructions)