Traditionally, RL algorithms rely on predefined reward functions that provide feedback to the model based on its actions. However, designing an optimal reward function can be challenging, as it requires domain expertise and may not always capture the nuances of complex tasks, especially in case of chatbots. This is where…