A state-action pair $(s_t, a_t)$ is passed through the encoder $\phi$, and the corresponding future state is passed through the encoder $\psi$. The outputs of the encoders are used to compute the similarity score and the intrinsic reward.
Evolution of the C-TeC reward during training. This figure shows how the intrinsic reward changes over the course of training based on future state visitation. The black circle in the lower-left corner represents the starting state. Early in training (3M steps), higher rewards are assigned to nearby states. As training progresses, the agent explores farther, and the reward increases for more distant regions. All reward values are normalized for visualization.