Prompt Generation Guide
在基于大语言模型(LLM)的强化学习 Agent 体系中,Prompt
是 LLM 与环境进行交互的唯一介质。LLM 不像传统 Agent 那样直接接收数值状态或输出离散动作 ID,而是通过文本形式的 Prompt
In the architecture of Large Language Model (LLM)-based Reinforcement Learning Agents, the Prompt
serves as the sole medium for LLMs to interact with the environment. Unlike traditional agents that directly receive numerical states or output discrete action IDs, LLMs "perceive" the environment (observations) and "express" their decisions (actions) through prompts in text format.
Core Concepts
In our framework, the generation of prompts adheres to several key principles:
- LLM Input is Text: Whether the environment's original observation is an image, a grid, or another structure, it will ultimately be converted into a text format that LLMs can understand.
- Prompts are Dynamic and Contextual: A prompt is not merely the current environmental observation; it also includes historical dialogue, previous actions, received rewards, and other information, forming a coherent conversational context.
- Prompts are Structured Conversational Formats: Prompts typically follow the LLM's chat template (e.g., System/User/Assistant roles) to help the LLM better understand the intent of different parts.
- Prompts can guide LLM's behavior: Through precise instructions, output format requirements, and Chain-of-Thought (CoT) prompting, prompts can guide the LLM to generate responses in the expected style.
The generation of prompts is primarily managed by the _format_messages
method within the EnvManager
class.
Prompt Generation Steps and Rules
The _format_messages
method is the core of prompt generation. It receives env_output
(containing current observations and historical information) and transforms it into LLM input based on a series of rules.
Step 1: Initialization Conversation and Basic Instructions
Prompt generation begins by constructing the skeleton of a conversation, including system instructions and the first user instruction.
messages = [
# System Prompt: Defines the role and goal of the LLM
{"role": "system", "content": "You're a helpful assistant. You are a good game player. You are aiming to get high reward in the game."},
# First User Prompt: Contains the overall introduction to the environment and initial instructions
{"role": "user", "content": first_user_content}
]
- System Prompt: This is a fixed instruction used to set the LLM's general role ("helpful assistant," "good game player") and overall goal ("aiming to get high reward"), which provides the LLM with a global guiding principle for action.
- First User Prompt (
first_user_content
): 这是最关键的初始化部分,它会详细介绍当前环境的规则、符号含义、可用动作和响应格式。它的内容由 This is the most critical initialization part, which introduces the environment's rules, symbol meanings, available actions, and response format. Its content is pre-generated by theEnvManager._init_prefix_lookup
method, combiningenv_instruction
,grid_vocab
,action_lookup
from the environment configuration.
Sokoban Example: Generating the First User Prompt
Assume the SokobanEnvConfig
is configured as follows:
env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"
grid_vocab:
"#": "wall"
"_": "empty"
"O": "target"
"√": "box on target"
"X": "box"
"P": "player"
"S": "player on target"
action_lookup:
1: "Up"
2: "Down"
3: "Left"
4: "Right"
Then, first_user_content
(i.e., the first User Prompt) will be constructed as a string similar to the following:
You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>
The meaning of each symbol in the state is:
#: wall, _: empty, O: target, √: box on target, X: box, P: player, S: player on target
Your available actions are:
Up, Down, Left, Right
This Prompt block comprehensively describes the rules of the Sokoban game, the meaning of visual symbols, and the executable actions to the LLM, providing a foundational understanding for subsequent decision-making.
Step 2: Iterate Through Environment History to Build Multi-turn Conversation Context
After the initial Prompt, _format_messages
will iterate through env_output['history']
, adding observations, LLM responses, and rewards from each previous step to the conversation, forming a continuous context.
# Iterate through environment history to build multi-turn conversation Prompt
for idx, content in enumerate(env_output["history"]):
# 1. Add turn number
messages[-1]["content"] += f"\nTurn {idx + 1}:\n"
# 2. Process environment state
if "state" in content:
FORMAT_PROMPT = "<think> [Your thoughts] </think> <answer> [your answer] </answer>" if self.pipeline_config.enable_think else "<answer> [your answer] </answer>"
LENGTH_PROMPT = f"Max response length: {self.env_config_lookup[env_output['env_id']]['max_tokens']} words (tokens)."
messages[-1]["content"] += (
f"State:\n{content['state']}\n"
f"You have {content['actions_left']} actions left. "
f"Always output: {FORMAT_PROMPT} with no extra text."
f"Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. {LENGTH_PROMPT}\n"
f"Decide the next action:\n"
)
# 3. Process LLM's response
if "llm_raw_response" in content:
messages.append({"role": "assistant", "content": content["llm_response"]})
# 4. Process reward
if "reward" in content:
messages.append({"role": "user", "content": f"Reward:\n{content['reward']}\n"})
- Turn Number:
\nTurn {idx + 1}:\n
explicitly labels the current turn of the conversation, helping the LLM understand the temporal sequence. - Environment State: The environmental observation for the current turn. For Sokoban, this is the grid layout in text form.
- Actions Remaining:
You have {content['actions_left']} actions left
informs the LLM about the action limits for the current turn, aiding long-term planning. - Forced Output Format:Usually includes [Your thoughts][your answer] (if
enable_think
= true) or [your answer], which compels the LLM to return its thoughts and final action in a structured style. LENGTH_PROMPT
: Hints at the maximum length for the LLM's response.Strictly follow this format...
: Emphasizes the importance of the format and warns that non-conforming responses will be marked as 'INVALID'.- LLM Response (
Assistant
role): The action generated by the LLM in the previous turn is added to the history as an Assistant message. - Reward (
User
role): The reward feedback from the environment for the LLM's previous action is added to the history as a User message, providing an RL signal.
Sokoban Example: Multi-turn Prompt Construction
Assume the initial state of the environment is:
#####
#__O# <- Target O
#P_X# <- Player P, Box X
#___#
#####
- Turn 1 (LLM receives Prompt for the first time)
Before the LLM generates its first action, the Prompt it receives might look like this (simplified format, actual conversion uses apply_chat_template
):
<|im_start|>system
You're a helpful assistant. You are a good game player. You are aiming to get high reward in the game.<|im_end|>
<|im_start|>user
You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>
The meaning of each symbol in the state is:
#: wall, _: empty, O: target, √: box on target, X: box, P: player, S: player on target
Your available actions are:
Up, Down, Left, Right
Turn 1:
State:
#####
#__O#
#P_X#
#___#
#####
You have 100 actions left. Always output: <answer> [your answer] </answer> with no extra text. Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. Max response length: 100 words (tokens).
Decide the next action:<|im_end|>
<|im_start|>assistant
The LLM might generate <answer>Right</answer>
- Turn 2 (LLM receives new state and reward)
Assume the LLM chose Right. After the environment responds, the box is pushed one cell to the right, and the reward is -0.1. The new state is:
#####
#__O#
#_PX#
#___#
#####
At this point, the LLM will receive a Prompt containing all interactions from the first turn:
<|im_start|>system
You're a helpful assistant. You are a good game player. You are aiming to get high reward in the game.<|im_end|>
<|im_start|>user
You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>
The meaning of each symbol in the state is:
#: wall, _: empty, O: target, √: box on target, X: box, P: player, S: player on target
Your available actions are:
Up, Down, Left, Right
Turn 1:
State:
#####
#__O#
#P_X#
#___#
#####
You have 100 actions left. Always output: <answer> [your answer] </answer> with no extra text. Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. Max response length: 100 words (tokens).
Decide the next action:<|im_end|>
<|im_start|>assistant
<answer>Right</answer><|im_end|>
<|im_start|>user
Reward:
-0.1
<|im_end|>
<|im_start|>user
Turn 2:
State:
#####
#__O#
#_PX#
#___#
#####
You have 99 actions left. Always output: <answer> [your answer] </answer> with no extra text. Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. Max response length: 100 words (tokens).
Decide the next action:<|im_end|>
<|im_start|>assistant
In this way, the LLM can see the completed conversation history each time, including its own decisions and the feedback from the environment, which is crucial for learning and long-term planning.
Step 3: Apply Chat Template and Finally Generate Prompt Text
The final step is to convert the constructed messages list into a single string format of the Prompt that the LLM actually accepts.
# Apply chat template to generate final Prompt text
if self.processor: # For multi-modal models using ProcessorMixin
text = self.processor.apply_chat_template(messages, add_generation_prompt=(not prepare_for_update), tokenize=False)
else: # For text-only models using PreTrainedTokenizer
text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=(not prepare_for_update), tokenize=False)
# Force LLM to generate specific starting token (in inference mode)
if not prepare_for_update:
if self.pipeline_config.enable_think:
text += "<think>" # Force LLM to think before answering
else:
text += "<answer>" # Force LLM to answer
# Clean up special tokens
text = text.replace("<|im_end|>\n", "<|im_end|>")
apply_chat_template
:This is a method provided by the Hugging Face transformers library. It converts the messages list into a flattened string according to the specific format of the LLM being used (e.g., Qwen's<|im_start|>role\ncontent<|im_end|>
structure).add_generation_prompt
:In inference mode (not prepare_for_update), this parameter typically adds a special token to the end of the Prompt, such as<|im_start|>assistant\n
, explicitly telling the LLM that it is now its turn to generate as the assistant role.- Force the Generation of Starting Token: When the LLM performs inference (generates a response), to ensure its output strictly follow the predefined format, we add a specific starting token, such as
<think>
or<answer>
, to the end of the Prompt. This is a technique known as "Prompt Injection" or "Conditional Generation."- Guide LLM to Continue: The essence of an LLM is to predict the next most probable token in a given sequence. When we place a marker like
<answer>
at the end of the Prompt, the LLM treats it as an incomplete sequence and naturally attempts to continue generating content after this marker. - Enforce Format Adherence: If the Prompt has clearly specified that the response must be in the format
<answer>[your answer]</answer>
, then by placing</answer>
at the end of the Prompt, we are effectively pre-filling a part of the response format. Upon receiving this incomplete format, the LLM is "forced" to generate[your answer]
part and ultimately complete the</answer>
tag。
- Guide LLM to Continue: The essence of an LLM is to predict the next most probable token in a given sequence. When we place a marker like
Completed Prompt Generation Process
- Environment Configuration(
SokobanEnvConfig
): Defines the static information of the environment (instructions, symbol meanings, action names). _init_prefix_lookup
: DuringEnvManager
initialization, combines the static information from the environment configuration intofirst_user_content
._format_messages
: a. Called when starting a new turn or receiving new environmental feedback. b. UseSystem Prompt
andfirst_user_content
as the beginning of the conversation. c. Iterates throughenv_output['history']
, successively adding turn numbers, environmental states, remaining actions, historical LLM responses, rewards, and other dynamic information. d. Repeats mandatory format requirements after the environmental state in each turn. e. Usetokenizer.apply_chat_template
to convert the constructed structuredmessages
list into the final Prompt string acceptable by the LLM. f. In inference mode, add the forced generation starting token<think>
or<answer>
.- LLM Receives the Prompt:Receives this carefully constructed Prompt string, performs inference, and generates a response.
Through this layered, structured, and dynamic Prompt generation mechanism, our framework effectively combines complex environments with the powerful language capabilities of LLMs, enabling them to understand tasks, perceive environments, learn rules, and execute complex operations.