**Summary** Value iteration is an iterative algorithm for finding the optimal policy in an MDP. It starts with an arbitrary policy and iteratively improves it by evaluating the policy and then improving it. Policy evaluation computes the value function for a given policy, and policy improvement finds a new policy that is greedy with respect to the value function. Policy iteration converges to the optimal policy if the policy evaluation step converges. Policy iteration can be sped up by using a single iteration of policy evaluation in each iteration. This reduces to value iteration if the policy evaluation step is perfect. Hybrid methods combine value iteration and policy iteration in different ways to achieve different trade-offs between speed and accuracy.
Which algorithm does not converge to the optimal value function if you only run one iteration of policy evaluation?
Which of the following is not an advantage of policy iteration over value iteration?
Which of the following is not a step in policy iteration?