[FEAT]: Unify evaluation prompt and episode rendering for human readers #164
Open
Description
Description
Human readers have the potential to serve as judges. However, the discrepancy between evaluation prompt episode rendering for human readers causes trouble on the issue.
Additional Information
I am specifically talking about:
class EpisodeLog(JsonModel):
...
def render_for_humans(self) -> tuple[list[AgentProfile], list[str]]:
...
And there's discrepancy between how evaluation prompt is composed:
@gin.configurable
@beartype
async def __acall__(
self,
turn_number: int,
messages: list[tuple[str, Message]] | None,
history: str = "",
temperature: float = 0.0,
) -> list[tuple[str, tuple[tuple[str, int | float | bool], str]]]:
# filter did nothing
if not history and messages:
messages_filtered = [
(x, y)
for x, y in messages
if "did nothing" not in y.to_natural_language()
]
history = "\n".join(
[
(
f"{x} {y.to_natural_language()}"
if x != "Environment"
else y.to_natural_language()
)
for x, y in messages_filtered
]
)