Skip to content

[FEAT]: Unify evaluation prompt and episode rendering for human readers #164

Open
@XuhuiZhou

Description

Description

Human readers have the potential to serve as judges. However, the discrepancy between evaluation prompt episode rendering for human readers causes trouble on the issue.

Additional Information

I am specifically talking about:

class EpisodeLog(JsonModel):
...
    def render_for_humans(self) -> tuple[list[AgentProfile], list[str]]:
...

And there's discrepancy between how evaluation prompt is composed:

    @gin.configurable
    @beartype
    async def __acall__(
        self,
        turn_number: int,
        messages: list[tuple[str, Message]] | None,
        history: str = "",
        temperature: float = 0.0,
    ) -> list[tuple[str, tuple[tuple[str, int | float | bool], str]]]:
        # filter did nothing
        if not history and messages:
            messages_filtered = [
                (x, y)
                for x, y in messages
                if "did nothing" not in y.to_natural_language()
            ]
            history = "\n".join(
                [
                    (
                        f"{x} {y.to_natural_language()}"
                        if x != "Environment"
                        else y.to_natural_language()
                    )
                    for x, y in messages_filtered
                ]
            )

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions