Q-learning is a reinforcement learning technique used for learning the optimal policy in a Markov decision process. In Q-learning, the agent iteratively updates the Q-values for each state-action pair using the Bellman equation until the Q-function converges to the optimal Q-function, Q*. This iterative approach is called value iteration. Q-learning is used to find the optimal policy by learning the optimal Q-values for each state-action pair.
What is the objective of Q learning?
In Q learning, what is the trade-off between exploration and exploitation?
What is an epsilon greedy strategy?