The model training issue with reward function optimizing makespan

Hi, Hongzi 

I noticed your code supports the makespan-optimized policy by setting args.learn_obj to 'makespan'. However, when trained with the recommended small scale setting (200 stream jobs on 8 agents) in 3000 episodes, the model doesn't seem to converge as it normally does with objective of avg JCT. The following figures demonstrate the actor_loss and average_reward_per_second collected during training. The average_reward_per_second is always around -1, which is due to the reward is the same as negative makespan (equal to total time to be divided by). Could you suggest the setting that is maybe missed to guarantee the convergence? 
![avg_reward_per_sec](https://user-images.githubusercontent.com/11798328/110058691-c4cc6880-7d9d-11eb-9ecf-67b5b29355b3.png)
![actor_loss](https://user-images.githubusercontent.com/11798328/110058698-c6962c00-7d9d-11eb-934d-b86e2f112409.png)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The model training issue with reward function optimizing makespan #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development