WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting (2024)

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting (1)

We create some very challenging task templates by combining actions across multiple domains. Figure4 shows the distribution of the number of actions required to complete each task in our database. Many tasks require multiple actions, with some requiring up to 12 actions. Here is an example of a complex multi-domain task:

If our website page views fell by more than 10% in the past week, schedule a 30-minute meeting with Sam called "Urgent Analytics Update" at the earliest free time tomorrow. Otherwise email them saying "Site traffic was stable the past week, nice work."

We can define a simple function to automatically find the correct outcome. However, the agent must combine Analytics, Calendar and Email tools to complete this task. AppendixA.2 provides more examples of tasks.

3.3 Task Execution

Agents execute tasks using 26 tools across the five domains, which are summarised in Table 4. Each tool consists of a function that interacts with the sandbox databases, and documentation (a docstring) showing the agent how to use the tool. Each docstring contains a high-level description, parameters, return values, an example of tool usage, and any limits. An example limit is the search_events(start_time, end_time), which returns a maximum of five events. If the agent needs to find a large number of events, it must search multiple times with different time windows and then concatenate the results. Appendix A.5 contains the full docstring for each tool.

Email	Calendar	Web Analytics	CRM	Projects
get_email_info	get_event_info	get_visitor_info	get_customer_info	get_task_info
search_emails	search_events	count_traffic_source	search_customers	search_tasks
send_email	create_event	count_engaged_users	update_customer	create_task
delete_email	delete_event	count_total_visits	add_customer	delete_task
forward_email	update_event	average_visit_duration	delete_customer	update_task
reply_email		create_plot

3.4 Outcome-Centric Evaluation

Figure3 shows that the correct outcome, which is the ground truth, is always known. We then evaluate whether the outcome resulting from the agent’s actions matches the ground truth. We call this methodology outcome-centric evaluation. Figure 5 compares our outcome-centric evaluation against prior works, which evaluate the agent’s function calls.

An agent can follow any action path provided the resulting sandbox databases match the ground truth outcome. For example, sometimes the agent recovers from its error and takes the correct action:

Task:
Make a task on the Front end board for Sam to improve conversion.
Ground truth tool use:
create_task(name="improve conversion", board="Front end", assigned_to="Sam")
Agent’s tool use:
create_task(name="improve conversion", board="Front End", assigned_to="Sam")
Observation: "‘Front End’ board does not exist, but ‘Front end’ does..."
create_task(name="improve conversion", board="Front end", assigned_to="Sam")

Evaluation methods based on matching the function calls could unfairly find the agent had failed due to the extra calls. However, outcome-centric evaluation recognises that the agent was able to recover because the final change in state matches the ground truth outcome. As a result, the agent is not unfairly penalised.

Some prior benchmarks such as Gaia(Mialon etal., 2023) and ToolQA(Zhuang etal., 2023) also have tasks with unique outcomes, but they are limited to information retrieval. WorkBench is the first dataset to evaluate tasks that require actions in this manner, due to our outcome-centric evaluation methodology.

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting (2)

4 Results

We assess the performance of LLM agents using the ReAct framework(Yao etal., 2023). This enables the LLM to perform multiple action steps and update its action plan based on results from previous steps.

4.1 Performance Metrics

Our primary metric is accuracy. This is the % of tasks where the outcome from the agent’s actions match the expected outcome, which is the ground truth.

Our secondary metric is side effects¹¹1This computer science term refers to a program altering variables outside its local environment.. Some tools have negative consequences if used incorrectly, such as sending emails to the wrong person. If the agent’s actions modify the sandbox databases in a way that does not match the ground truth exactly, we consider this a side effect. If the agent fails to complete the task, but does not alter the sandbox databases, then there are no side effects.

4.2 Comparing Large Language Models

Table5 compares five LLMs: GPT-3.5(Brown etal., 2020), GPT-4(OpenAI etal., 2023), Claude-2(Anthropic, 2023), Llama2-70B(Touvron etal., 2023) and Mixtral-8x7B(Jiang etal., 2024). GPT-4 greatly outperforms other models. For the worse-performing models, the main errors are insufficient context window length and failing to follow the ReAct framework.

	GPT-4	GPT-3.5	Claude-2	Llama2-70B	Mixtral-8x7B
Accuracy (required tools)	49%	14%	23%	3%	20%
Accuracy (all tools)	43%	0%	26%	0%	16%

Our benchmark is challenging for all models, including GPT-4. Given how poorly other models perform, we restrict further analysis to GPT-4. When giving this agent all 26 tools, rather than just the required toolkits, and find accuracy falls from 49% to 43%. This suggests the agent is negatively affected by redundant tools, which is consistent with prior studies(Hao etal., 2023). The next sections explore in further depth why the GPT-4 agent fails.

4.3 Performance across domains

Table4.3 compares the GPT-4 agent’s performance across our five individual domains, and tasks that require tools from multiple domains. Performance varies from 23% accuracy on CRM tasks to 65% for Calendar tasks. The agent is capable of combining tools across multiple domains. Its performance (40%) on these tasks is similar to its average performance on single-domain tasks (43%).

	Analytics	Calendar	CRM	Email	Project	Multi
	Analytics	Calendar	CRM	Email	Management	Domain
Number of tasks	120	110	80	90	80	210
\hdashlineAccuracy (↑)	39%	65%	23%	48%	39%	40%
Side Effects (↓)	54%	22%	6%	6%	4%	29%

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting (2024)

3.3 Task Execution

3.4 Outcome-Centric Evaluation

4 Results

4.1 Performance Metrics

4.2 Comparing Large Language Models

4.3 Performance across domains

4.4 Sources of Error

5 Discussion and Future Work

References

Appendix A Appendix

A.1 Simulated database creation

A.1.1 Analytics

A.1.2 Calendar

A.1.3 Customer Relationship Manager

A.1.4 Email

A.1.5 Project management

A.2 Example tasks

A.2.1 Analytics

A.2.2 Calendar

A.2.3 Customer Relationship Manager

A.2.4 Email

A.2.5 Project Management

A.2.6 Multi-domain

A.3 Template dependence

A.4 Impact of “no action” tasks

A.5 Prompt and tool descriptions

References