if the Mathematical proof in the paper Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk can support the code of cppo in this project? I can not understand the variable cvarlam and nu. what is the relationship between carlam and CVaR. and why can we use nu as the threshold for bad trajectory. I think the nu is a cumulative rewards including all the steps. however ep_ret + v - r is the reward in one step. if nu and ep_ret + v - r are comparable?
if the Mathematical proof in the paper Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk can support the code of cppo in this project? I can not understand the variable cvarlam and nu. what is the relationship between carlam and CVaR. and why can we use nu as the threshold for bad trajectory. I think the nu is a cumulative rewards including all the steps. however ep_ret + v - r is the reward in one step. if nu and ep_ret + v - r are comparable?