[FEAT]: Unify evaluation prompt and episode rendering for human readers

### Description

Human readers have the potential to serve as judges. However, the discrepancy between evaluation prompt episode rendering for human readers causes trouble on the issue.

### Additional Information

I am specifically talking about:

```python
class EpisodeLog(JsonModel):
...
    def render_for_humans(self) -> tuple[list[AgentProfile], list[str]]:
...
```

And there's discrepancy between how evaluation prompt is composed:
```python
    @gin.configurable
    @beartype
    async def __acall__(
        self,
        turn_number: int,
        messages: list[tuple[str, Message]] | None,
        history: str = "",
        temperature: float = 0.0,
    ) -> list[tuple[str, tuple[tuple[str, int | float | bool], str]]]:
        # filter did nothing
        if not history and messages:
            messages_filtered = [
                (x, y)
                for x, y in messages
                if "did nothing" not in y.to_natural_language()
            ]
            history = "\n".join(
                [
                    (
                        f"{x} {y.to_natural_language()}"
                        if x != "Environment"
                        else y.to_natural_language()
                    )
                    for x, y in messages_filtered
                ]
            )
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: Unify evaluation prompt and episode rendering for human readers #164

Description

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT]: Unify evaluation prompt and episode rendering for human readers #164

Description

Description

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions