Authors: Kateryna Nekhomiazh and Bohdan Naida
This repository hosts our course project for CSC2542: Topics in Knowledge Representation and Reasoning at the University of Toronto.
- Two-Agent OfficeWorld Environment: Based on the PettingZoo library. Explore here.
- RL Agents: Find the agents here.
- Reward Machines: Includes Reward Machine implementation and environment wrappers. More details.
The foundational code for this project, particularly the single-agent OfficeWorld environment and Reward Machine implementation for the single-agent setting, was sourced from Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning by Icarte et al.
- Clone the repository:
git clone
- Install dependencies:
pip install -r requirements.txt
To play the human version of the game, run:
To modify map or objects placement, refer to
, to modify Reward Machines type, refer to
. Make sure to choose correspondimg options.
Note: MAP_2 RMs refer to RMs for maps with decorations, MAP_3 RMs refer to RMs for maps without decorations. MapCollection's map numbers have nothing to do with MAP_2 or MAP_3 in RM files, will be changed in the future. Sorryyyy
To run the experiments, run:
Training configurations are in
To evaluate policies against other agents, run:
Evaluation configurations are in
predator_prey: bool
, True - predator_prey option (where agent 2's task is to catch agent 1 while it delivers coffee to the office), False - both agents' task is to deliver coffee to the office. -
map_type: MapType.SIMPLIFIED or MapType.BASE
, simplified - 6 by 9, base - 12 by 9 -
use_crms: (bool, bool)
, True - add crm, False - do not add crm -
can_be_in_same_cell: bool
, True - can, False - cannot -
coffee_type: CoffeeType.SINGLE or CoffeeType.UNLIMITED
, single - one coffee per coffee machine -
agent_types = (AgentType, AgentType)
, AgentType.MINMAX or AgentType.QLEARNING or AgentType.RANDOM -
total_timesteps: int
, total number of steps -
max_episode_length: int
, upper booundary for the episode length (in timesteps) -
print_freq: int
, print/log frequency of the progress -
q_init: float
learning_rate: float
discount_factor: float
exploration_rate: float
exploration_decay_after: ExplorationDecay.EPISODE or ExplorationDecay.STEP
, specifies when the exploration rate should decay: either after each episode (ExplorationDecay.EPISODE) or after each step (ExplorationDecay.STEP) -
n_episodes_for_decay: int
, the number of episodes over which to exponentially decay the exploration rate. If exploration_decay_after is set to ExplorationDecay.STEP, then the exploration rate is decayed exponentially over 70% of the total_timesteps. -
group: str
, details, used to group runs in wandb -
details: str
, details of the training to capture in .log file name and run name in wandb