Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions (#2) · Issues · Nannette Odriscoll / h-2meta · GitLab

Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

I ran a quick experiment investigating how DeepSeek-R1 performs on agentic tasks, regardless of not supporting tool use natively, and I was rather impressed by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only prepares the actions but also formulates the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 outshines Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% right, and 35.237.164.2 other designs by an even larger margin:

The experiment followed design usage guidelines from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, avoid adding a system prompt, and set the temperature to 0.5 - 0.7 (0.6 was utilized). You can discover more assessment details here.

Approach

DeepSeek-R1's strong coding abilities enable it to act as a representative without being explicitly trained for . By allowing the design to create actions as Python code, it can flexibly connect with environments through code execution.

Tools are carried out as Python code that is consisted of straight in the prompt. This can be a simple function definition or fakenews.win a module of a larger package - any valid Python code. The design then generates code actions that call these tools.

Results from performing these actions feed back to the model as follow-up messages, driving the next actions till a final answer is reached. The agent structure is an easy iterative coding loop that moderates the discussion in between the model and its environment.

Conversations

DeepSeek-R1 is used as chat model in my experiment, where the design autonomously pulls additional context from its environment by using tools e.g. by utilizing a search engine or fetching information from web pages. This drives the conversation with the environment that continues till a last answer is reached.

On the other hand, o1 designs are understood to perform improperly when utilized as chat models i.e. they do not try to pull context throughout a conversation. According to the connected short article, o1 models perform best when they have the complete context available, with clear guidelines on what to do with it.

Initially, I likewise attempted a complete context in a single timely method at each action (with arise from previous actions included), however this led to substantially lower scores on the GAIA subset. Switching to the conversational approach explained above, surgiteams.com I had the ability to reach the reported 65.6% efficiency.

This raises a fascinating concern about the claim that o1 isn't a chat model - possibly this observation was more pertinent to older o1 models that did not have tool usage abilities? After all, isn't tool usage support an important mechanism for making it possible for models to pull extra context from their environment? This conversational technique certainly appears efficient for DeepSeek-R1, yogicentral.science though I still need to carry out similar experiments with o1 models.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, forum.batman.gainedge.org it is remarkable that generalization to agentic jobs with tool usage through code actions works so well. This ability to generalize to agentic jobs reminds of recent research by DeepMind that shows that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't investigated because work.

Despite its capability to generalize to tool use, opentx.cz DeepSeek-R1 frequently produces long thinking traces at each action, compared to other designs in my experiments, limiting the usefulness of this model in a single-agent setup. Even simpler tasks in some cases take a long period of time to complete. Further RL on agentic tool use, be it via code actions or not, could be one option to improve performance.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning model often switches in between different reasoning ideas without sufficiently checking out promising paths to reach a proper solution. This was a major factor for extremely long reasoning traces produced by DeepSeek-R1. This can be seen in the recorded traces that are available for download.

Future experiments

Another typical application of reasoning models is to use them for preparing just, while using other designs for creating code actions. This could be a potential new function of freeact, if this separation of roles shows helpful for more complex jobs.

I'm likewise curious about how reasoning models that already support tool usage (like o1, o3, ...) carry out in a single-agent setup, surgiteams.com with and without generating code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise utilizes code actions, look interesting.