Prompt Generation Guide

在基于大语言模型（LLM）的强化学习 Agent 体系中，Prompt 是 LLM 与环境进行交互的唯一介质。LLM 不像传统 Agent 那样直接接收数值状态或输出离散动作 ID，而是通过文本形式的 Prompt

In the architecture of Large Language Model (LLM)-based Reinforcement Learning Agents, the Prompt serves as the sole medium for LLMs to interact with the environment. Unlike traditional agents that directly receive numerical states or output discrete action IDs, LLMs "perceive" the environment (observations) and "express" their decisions (actions) through prompts in text format.

Core Concepts

In our framework, the generation of prompts adheres to several key principles:

LLM Input is Text: Whether the environment's original observation is an image, a grid, or another structure, it will ultimately be converted into a text format that LLMs can understand.
Prompts are Dynamic and Contextual: A prompt is not merely the current environmental observation; it also includes historical dialogue, previous actions, received rewards, and other information, forming a coherent conversational context.
Prompts are Structured Conversational Formats: Prompts typically follow the LLM's chat template (e.g., System/User/Assistant roles) to help the LLM better understand the intent of different parts.
Prompts can guide LLM's behavior: Through precise instructions, output format requirements, and Chain-of-Thought (CoT) prompting, prompts can guide the LLM to generate responses in the expected style.

The generation of prompts is primarily managed by the _format_messages method within the EnvManager class.

Prompt Generation Steps and Rules

The _format_messages method is the core of prompt generation. It receives env_output (containing current observations and historical information) and transforms it into LLM input based on a series of rules.

Step 1: Initialization Conversation and Basic Instructions

Prompt generation begins by constructing the skeleton of a conversation, including system instructions and the first user instruction.

messages = [
    # System Prompt: Defines the role and goal of the LLM
    {"role": "system", "content": "You're a helpful assistant. You are a good game player. You are aiming to get high reward in the game."},
    # First User Prompt: Contains the overall introduction to the environment and initial instructions
    {"role": "user", "content": first_user_content}
]

System Prompt: This is a fixed instruction used to set the LLM's general role ("helpful assistant," "good game player") and overall goal ("aiming to get high reward"), which provides the LLM with a global guiding principle for action.
First User Prompt (first_user_content): 这是最关键的初始化部分，它会详细介绍当前环境的规则、符号含义、可用动作和响应格式。它的内容由 This is the most critical initialization part, which introduces the environment's rules, symbol meanings, available actions, and response format. Its content is pre-generated by the EnvManager._init_prefix_lookup method, combining env_instruction, grid_vocab, action_lookup from the environment configuration.

Sokoban Example: Generating the First User Prompt

Assume the SokobanEnvConfig is configured as follows:

env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"

grid_vocab:
  "#": "wall"
  "_": "empty"
  "O": "target"
  "√": "box on target"
  "X": "box"
  "P": "player"
  "S": "player on target"

action_lookup:
  1: "Up"
  2: "Down"
  3: "Left"
  4: "Right"

Then, first_user_content (i.e., the first User Prompt) will be constructed as a string similar to the following:

You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>

The meaning of each symbol in the state is:
#: wall, _: empty, O: target, √: box on target, X: box, P: player, S: player on target

Your available actions are:
Up, Down, Left, Right

This Prompt block comprehensively describes the rules of the Sokoban game, the meaning of visual symbols, and the executable actions to the LLM, providing a foundational understanding for subsequent decision-making.

Step 2: Iterate Through Environment History to Build Multi-turn Conversation Context

After the initial Prompt, _format_messages will iterate through env_output['history'], adding observations, LLM responses, and rewards from each previous step to the conversation, forming a continuous context.

# Iterate through environment history to build multi-turn conversation Prompt
for idx, content in enumerate(env_output["history"]):
    # 1. Add turn number
    messages[-1]["content"] += f"\nTurn {idx + 1}:\n"

    # 2. Process environment state
    if "state" in content:
        FORMAT_PROMPT = "<think> [Your thoughts] </think> <answer> [your answer] </answer>" if self.pipeline_config.enable_think else "<answer> [your answer] </answer>"
        LENGTH_PROMPT = f"Max response length: {self.env_config_lookup[env_output['env_id']]['max_tokens']} words (tokens)."
        messages[-1]["content"] += (
            f"State:\n{content['state']}\n"
            f"You have {content['actions_left']} actions left. "
            f"Always output: {FORMAT_PROMPT} with no extra text."
            f"Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. {LENGTH_PROMPT}\n"
            f"Decide the next action:\n"
        )
    
    # 3. Process LLM's response
    if "llm_raw_response" in content:
        messages.append({"role": "assistant", "content": content["llm_response"]})

    # 4. Process reward
    if "reward" in content:
        messages.append({"role": "user", "content": f"Reward:\n{content['reward']}\n"})

Turn Number: \nTurn {idx + 1}:\n explicitly labels the current turn of the conversation, helping the LLM understand the temporal sequence.
Environment State: The environmental observation for the current turn. For Sokoban, this is the grid layout in text form.
Actions Remaining: You have {content['actions_left']} actions left informs the LLM about the action limits for the current turn, aiding long-term planning.
Forced Output Format：Usually includes [Your thoughts][your answer] (if enable_think = true) or [your answer], which compels the LLM to return its thoughts and final action in a structured style.
LENGTH_PROMPT: Hints at the maximum length for the LLM's response.
Strictly follow this format...: Emphasizes the importance of the format and warns that non-conforming responses will be marked as 'INVALID'.
LLM Response (Assistant role): The action generated by the LLM in the previous turn is added to the history as an Assistant message.
Reward (User role): The reward feedback from the environment for the LLM's previous action is added to the history as a User message, providing an RL signal.

Sokoban Example: Multi-turn Prompt Construction

Assume the initial state of the environment is:

#####
#__O#  <- Target O
#P_X#  <- Player P, Box X
#___#
#####

Turn 1 (LLM receives Prompt for the first time)

Before the LLM generates its first action, the Prompt it receives might look like this (simplified format, actual conversion uses apply_chat_template):

<|im_start|>system
You're a helpful assistant. You are a good game player. You are aiming to get high reward in the game.<|im_end|>
<|im_start|>user
You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>

The meaning of each symbol in the state is:
#: wall, _: empty, O: target, √: box on target, X: box, P: player, S: player on target

Your available actions are:
Up, Down, Left, Right

Turn 1:
State:
#####
#__O#  
#P_X#  
#___#
#####
You have 100 actions left. Always output: <answer> [your answer] </answer> with no extra text. Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. Max response length: 100 words (tokens).
Decide the next action:<|im_end|>
<|im_start|>assistant

The LLM might generate <answer>Right</answer>

Turn 2 (LLM receives new state and reward)

Assume the LLM chose Right. After the environment responds, the box is pushed one cell to the right, and the reward is -0.1. The new state is:

#####
#__O#
#_PX#
#___#
#####

At this point, the LLM will receive a Prompt containing all interactions from the first turn:

<|im_start|>system
You're a helpful assistant. You are a good game player. You are aiming to get high reward in the game.<|im_end|>
<|im_start|>user
You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>

The meaning of each symbol in the state is:
#: wall, _: empty, O: target, √: box on target, X: box, P: player, S: player on target

Your available actions are:
Up, Down, Left, Right

Turn 1:
State:
#####
#__O#  
#P_X#  
#___#
#####
You have 100 actions left. Always output: <answer> [your answer] </answer> with no extra text. Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. Max response length: 100 words (tokens).
Decide the next action:<|im_end|>
<|im_start|>assistant
<answer>Right</answer><|im_end|>
<|im_start|>user
Reward:
-0.1
<|im_end|>
<|im_start|>user
Turn 2:
State:
#####
#__O#
#_PX#
#___#
#####
You have 99 actions left. Always output: <answer> [your answer] </answer> with no extra text. Strictly follow this format, history response that do not follow the format will be set as 'INVALID'. Max response length: 100 words (tokens).
Decide the next action:<|im_end|>
<|im_start|>assistant

In this way, the LLM can see the completed conversation history each time, including its own decisions and the feedback from the environment, which is crucial for learning and long-term planning.

Step 3: Apply Chat Template and Finally Generate Prompt Text

The final step is to convert the constructed messages list into a single string format of the Prompt that the LLM actually accepts.

# Apply chat template to generate final Prompt text
if self.processor: # For multi-modal models using ProcessorMixin
    text = self.processor.apply_chat_template(messages, add_generation_prompt=(not prepare_for_update), tokenize=False)
else: # For text-only models using PreTrainedTokenizer
    text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=(not prepare_for_update), tokenize=False)

# Force LLM to generate specific starting token (in inference mode)
if not prepare_for_update:
    if self.pipeline_config.enable_think:
        text += "<think>" # Force LLM to think before answering
    else:
        text += "<answer>" # Force LLM to answer

# Clean up special tokens
text = text.replace("<|im_end|>\n", "<|im_end|>")

apply_chat_template：This is a method provided by the Hugging Face transformers library. It converts the messages list into a flattened string according to the specific format of the LLM being used (e.g., Qwen's <|im_start|>role\ncontent<|im_end|> structure).
add_generation_prompt：In inference mode (not prepare_for_update), this parameter typically adds a special token to the end of the Prompt, such as <|im_start|>assistant\n, explicitly telling the LLM that it is now its turn to generate as the assistant role.
Force the Generation of Starting Token: When the LLM performs inference (generates a response), to ensure its output strictly follow the predefined format, we add a specific starting token, such as <think> or <answer>, to the end of the Prompt. This is a technique known as "Prompt Injection" or "Conditional Generation."
- Guide LLM to Continue: The essence of an LLM is to predict the next most probable token in a given sequence. When we place a marker like <answer> at the end of the Prompt, the LLM treats it as an incomplete sequence and naturally attempts to continue generating content after this marker.
- Enforce Format Adherence: If the Prompt has clearly specified that the response must be in the format <answer>[your answer]</answer>, then by placing </answer> at the end of the Prompt, we are effectively pre-filling a part of the response format. Upon receiving this incomplete format, the LLM is "forced" to generate [your answer] part and ultimately complete the </answer> tag。

Completed Prompt Generation Process

Environment Configuration（SokobanEnvConfig）: Defines the static information of the environment (instructions, symbol meanings, action names).
_init_prefix_lookup: During EnvManager initialization, combines the static information from the environment configuration into first_user_content.
_format_messages： a. Called when starting a new turn or receiving new environmental feedback. b. Use System Prompt and first_user_content as the beginning of the conversation. c. Iterates through env_output['history'], successively adding turn numbers, environmental states, remaining actions, historical LLM responses, rewards, and other dynamic information. d. Repeats mandatory format requirements after the environmental state in each turn. e. Use tokenizer.apply_chat_template to convert the constructed structured messages list into the final Prompt string acceptable by the LLM. f. In inference mode, add the forced generation starting token <think> or <answer>.
LLM Receives the Prompt：Receives this carefully constructed Prompt string, performs inference, and generates a response.

Through this layered, structured, and dynamic Prompt generation mechanism, our framework effectively combines complex environments with the powerful language capabilities of LLMs, enabling them to understand tasks, perceive environments, learn rules, and execute complex operations.

Prompt Generation Guide

Core Concepts​

Prompt Generation Steps and Rules​

Step 1: Initialization Conversation and Basic Instructions​

Sokoban Example: Generating the First User Prompt​

Step 2: Iterate Through Environment History to Build Multi-turn Conversation Context​

Sokoban Example: Multi-turn Prompt Construction​

Step 3: Apply Chat Template and Finally Generate Prompt Text​

Completed Prompt Generation Process​