The AI agent optimization framework for end-side devices is launched, with an accuracy rate of up to 97% in the field.-AI-php.cn

The AIxiv column is a column where academic and technical content is published on this site. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The article was developed by the NEXA AI team in conjunction with the MIT-IBM Watson AI Lab. The first author, Wei Chen (Chen Wei), is the co-founder, CEO and chief scientist of NEXA AI. He has a PhD from Stanford University and has rich experience in artificial intelligence research. Co-author Zhiyuan Li is the co-founder and CTO of NEXA AI, an alumnus of Stanford University, and has many years of front-line R&D experience in end-side AI at Google and Amazon Lab126. The other two co-authors are Zhen Guo and Yikang Shen from MIT and IBM.

AI agents are becoming more and more important, capable of autonomous decision-making and problem-solving. To function effectively, these agents require a planning process that determines the best course of action and then executes the planned actions.

In this paper, we propose an efficient device-side plan-action framework that separates planning and action execution into two components: a planning agent optimized for edge devices, or Octo-planner, and an An action agent using the Octopus model to execute functions. Octo-planner first responds to user queries by breaking down the task into a series of sub-steps, which are then executed by the Octopus action agent. To optimize performance on resource-constrained devices, we employ model fine-tuning instead of contextual learning, reducing computational cost and energy consumption while improving response time.

Our approach involves using GPT-4 to generate diverse planning queries and responses based on available functions, with subsequent validation to ensure data quality. We fine-tuned the Phi-3 Mini model on a curated dataset, achieving a 97% success rate in an in-domain test environment.

To address multi-domain planning challenges, we developed a multi-LoRA training method that merges LoRA weights trained on different subsets of functions. This approach flexibly handles complex multi-domain queries while maintaining computational efficiency on resource-constrained devices.

Paper: https://arxiv.org/pdf/2406.18082
Demo: https://www.nexa4ai.com/octo-planner#video
Model Page: https: //huggingface.co/NexaAIDev/octopus-planning

1 Introduction

Artificial intelligence (AI) agents have significantly transformed various industries by enabling autonomous decision-making and improving operational efficiency. These agents rely on a critical planning process that involves determining the best course of action, executing the planned actions, and summarizing the results. Large language models (LLMs) such as Gemini-Pro and GPT-4 show potential in this area.

Although these models face challenges in performing complex planning tasks and struggle to reach a level comparable to human performance, they are still effective in handling simple tasks, thus facilitating practical applications. One such application is the AI assistant tools from companies like MultiOn, Simular AI, and Adept AI, which leverage the power of LLM to provide intelligent assistants in various fields.

Additionally, consumer-oriented AI hardware products, such as the Rabbit R1, Humane AI Pin, and Limitless Pendant, integrate LLM into user-friendly devices, making smart assistants more accessible and driving significant traction. The success of the AI agent depends on the performance of the underlying LLM. Agents that used pretrained models without fine-tuning on task demonstrations had relatively low success rates, ranging from 12% for desktop applications to 46% for mobile applications, while agents that leveraged fine-tuned models performed better on tasks similar to their training data. Achieving a success rate of up to 80% on tasks.

However, AI agents using LLM are costly due to high computational requirements and infrastructure expenses, limiting widespread adoption. The lack of on-device AI agents limits applications that require real-time processing, offline functionality, or enhanced privacy. On-device AI agents provide benefits including reduced latency, offline operation, reduced costs, and improved data security. Although action models such as Octopus V2 achieve over 95% accuracy in function calls, there is still a lack of a device-side planning model. General agent frameworks use single-model context learning and require lengthy function descriptions and planning instructions in each prompt. This approach is impractical for device-side models with limited context length, resulting in high latency and battery consumption on edge devices.

In this paper, we introduce Octo-planner, an on-device planning agent that addresses the key challenges of efficiency, adaptability, and resource constraints. Our plan-action framework separates planning and action execution into two components: a planning agent optimized for use on edge devices, or Octo-planner, and an action agent that executes functions using the Octopus model.

By prioritizing fine-tuning over few-shot hints, we reduce computational costs and minimize key-value (KV) caching requirements. Our approach uses GPT-4 to generate and verify planning data, which is then used to fine-tune Phi-3 Mini for on-device deployment. In-domain testing showed that this fine-tuning improved planning success to 97%. To address the multi-domain planning challenge, we develop a multi-LoRA training method that merges LoRA weights trained on different subsets of functions. This approach flexibly handles complex multi-domain queries while maintaining computational efficiency on resource-constrained devices.

By focusing on predefined functions for simple tasks and leveraging fine-tuning, we aim to make AI agents more practical, accessible, and cost-effective in real-world applications.

This work aims to contribute to ongoing efforts to make AI more accessible and useful. By bridging the gap between the potential of AI agents and the limitations of edge computing, we hope to promote the adoption of smart on-device assistants in various fields. By open sourcing our approach, we hope to inspire further innovation in on-device AI and expand the scope of advanced planning capabilities.

2 Related Work

Planning Agents: Language models have become key in planning agent systems. Proprietary models like OpenAI’s Assistant API excel at generating policies based on user queries and available functions. Recent advances further expand the capabilities of language models in the scheme of things. The ReAct framework integrates planning and action in a limited action space, while Alibaba Group's research highlights the effectiveness of separate planning and action models in complex tasks. In robotics, language models are also increasingly used for task-level planning. Notable examples include SayCan, which uses LLM to decompose high-level tasks into concrete subtasks, and Video Language Planning (VLP), which augments long-term planning with a text-to-video dynamic model. The wide range of applications of language models in planning systems, from general policies to specific robotic tasks, highlights their increasingly important and adaptable role in a variety of decision-making processes.

Fine-tuned alternatives to long context: Fine-tuning language models to internalize specific cues or contextual information can reduce input length and increase efficiency. This approach involves training models on carefully curated task-specific datasets. This technique is particularly valuable for models with limited context windows, as it can improve query processing efficiency without sacrificing response quality. The success of fine-tuning depends heavily on using diverse, high-quality datasets to ensure that the model can generalize across a variety of prompt wordings. If implemented properly, fine-tuning can simplify application-specific interactions and resolve context length constraints and computational challenges in real-world deployments.

LoRA and Multi-LoRA: Low-rank adaptation (LoRA) can efficiently adapt pre-trained language models to specific tasks. Unlike fine-tuning, which updates all parameters, LoRA freezes pre-trained weights and adds trainable low-rank matrices at each layer, significantly reducing trainable parameters and computational requirements. Multi-LoRA extends this concept so that multiple task-specific adapters can be trained, combined, or switched at inference time, allowing a single base model to efficiently handle a variety of tasks. On the basis of these methods, researchers have developed several related variants to address different aspects of model adaptation: LoRA + optimized learning rate, VeRA uses random projection, AdaLoRA implements adaptive rank, DoRA decomposes weights, Delta-LoRA Update pre-trained weights. These variants are designed to further improve efficiency or performance in specific scenarios.

3 Method

This section introduces our framework for on-device planning - action agents. We first describe the integration of planning and action agents to enable efficient problem solving. We then detail our dataset design and training process for planning agents, including support for a wide range of functions and plug-and-play capabilities for additional function sets. Finally, we outline the benchmarks used to evaluate agent performance.

3.1 Plan and Action Agent Framework

Our plan-action approach differentiates from the general agent framework by splitting the planning and action execution process into two components. This separation increases modularity and enables dedicated optimization of each component. The framework operates as follows:

Planning phase: Given a user query q, our planning model πplan decomposes the task into a series of sub-steps. Formally:

{τ1, τ2, ..., τn} - πplan (q;F )

where F is the set that can be described by functions and τi is the i-th execution step. πplan internalizes F during instruction trimming.

Action phase: For each step in the execution sequence, we use the action model πaction. At step i, given the current state observation Oi, the action model executes:

Oi+1 = πaction (τi, Oi), (2)

where Oi+1 and τi+1 are passed to the next step to continue execution. This iterative process ensures coherent progression of task sub-steps.

For the action model, we use the Octopus model designed for device-side function calls. Figure 2 illustrates the difference between our plan-action framework and the single-model LLM agent.

^{Figure 2: Comparison of single LLM agent and plan-action agent frameworks. (Left) Single LLM agent: Unified model for task planning and action execution. (Right) Plan-Action Agent: A specialized planning model decomposes a task into subtasks, while a separate action model executes each subtask in turn.}

The modular design of our framework offers several advantages:

Specialization: Separating planning and action execution allows each model to be optimized for its specific role, improving performance on complex tasks.
Scalability: independently expand planning and action capabilities, and can efficiently adapt to the complexity of different tasks.
Explainability: Explicit separation of stages improves the transparency of the decision-making process.
Adaptability: Easier to integrate domain-specific knowledge or constraints into any phase without requiring system-wide changes.

3.2 Planning Dataset

Our framework uses the Octopus model as the action model and only needs to train the planning agent. We use the following dataset format to fine-tune the planning agent:

Special markers like and for chat model pre-training are optional. We set n to 1-5, based on our finding that most tasks on mobile apps consist of less than 5 steps. The dataset generation and curation process includes:

1. Dataset collection: Given available functions F, we use a large language model (GPT-4) to generate diverse queries answered by these functions. We increase the model's temperature setting to ensure query diversity. The response is then generated in the specified data set format. It is important to use functional descriptions during the generation process but not include them in the final dataset. Instead, the planning model internalizes this function information during training.

2. Data validation: We use the same language model as a validation tool to evaluate the correctness of query-response pairs. Although there were some errors during the initial generation, we found that the model effectively classified generated content as valid or invalid, allowing us to filter out erroneous output and maintain dataset quality.

Example data points for different numbers of sub-steps are shown below:

See Figure 3 for a visualization of the dataset collection. Example functions are described in Appendix 7.1.

3.3 Baseline Design

Our evaluation relies on a carefully constructed test dataset. This dataset is designed to represent the complexity of real-world planning, using a multi-stage approach that combines automatic generation, expert validation, and empirical testing.

The process starts with an initial dataset of 1000 data points automatically generated using GPT-4. These data points then undergo a rigorous quality assurance process to ensure their completeness and relevance. The quality assessment criteria are as follows:

Each step must correspond to an existing function;
The order of steps must be correct.

To ensure the reliability of the assessment, we have included an additional human verification stage. This phase involves selecting a subset of examples for end-to-end model execution, thereby validating the accuracy of the results and conducting a comprehensive assessment of model performance.

To evaluate our proposed planning model, we use GPT-4 as an Oracle to determine the correctness of the generated plans. This choice is based on empirical observations showing that GPT-4 performs efficiently in our specific use case.

4 Experimental Design

Our experimental design evaluates the performance of Octo-planner in on-device AI agent planning. Our goal is to identify optimal configurations for deploying efficient and accurate planning models on resource-constrained devices while maintaining adaptability to new domains and functions. Our experiments focus on four key areas:

Performance and efficiency trade-offs between full fine-tuning and LoRA.
Accuracy of Multi-LoRA when processing different sets of functions simultaneously.
Performance comparison of various base models and scales.
The impact of dataset size on accuracy, ranging from 100 to 1000 training examples.

We perform supervised fine-tuning on a curated dataset, using Phi-3 Mini and a few other alternatives as the base model. Training includes full fine-tuning and LoRA technology. For all experiments, we set the dataset size to 800 times the number of available functions and fine-tuned on an NVIDIA A100 GPU. We use optimized hyperparameters on both techniques: learning rate 5×10-6, batch size 4, warm-up ratio 0.2, training for 2 epochs. For LoRA, we set target_modules to all linear.

5 Results

5.1 Full Fine-tuning vs. LoRA

Table 1 shows the detailed comparison of our planning model on Full Fine-tuning and LoRA methods. Our experiments show significant differences in the performance of these methods. Full fine-tuning achieves the highest performance at 98.1% accuracy, showing superior performance. In contrast, the performance of LoRA depends on the rank size. At rank 64 and alpha 256, LoRA achieves 85.1% accuracy, while reducing to rank 16 and alpha 32, the accuracy drops to 72.9%. These results highlight the trade-off between model performance and computational efficiency when using LoRA. Although full fine-tuning provides better accuracy, LoRA offers a more attractive alternative in terms of resource efficiency, and the performance depends on the rank configuration.

^{Table 1: Full fine-tuning and LoRA benchmark}

5.2 Multi-LoRA training and merging

valid on a specific set of functions, real-world applications often need to deal with new or an extended set of functions. To address this challenge, we propose a method to merge each LoRA weight trained on a different subset of functions into the same base model. This approach creates a compositional model that combines knowledge from various function sets to provide scalable solutions for complex multi-domain queries in resource-constrained environments.

To evaluate this approach, we constructed a benchmark dataset by randomly selecting functions for each LoRA domain and combining them into workflows. Queries and plans are generated by GPT-4. For example, when testing two merged LoRAs, the query might involve Android functions, e-commerce functions, or both with equal probability.

The following code block shows an example query in our benchmark dataset and the corresponding inference results for the multi-LoRA merge model:

The AI agent optimization framework for end-side devices is launched, with an accuracy rate of up to 97% in the field.

Table 2 shows the performance results of our multi-LoRA merge technique. Each individual LoRA was trained with consistent hyperparameters: rank 64, lora_alpha 256, target_modules set to “all-linear”. Single-domain Android function set LoRA achieves 85.1% accuracy. When combining LoRA from both domains (Android and e-commerce), the accuracy drops slightly to 82.2%. Accuracy drops with further merging as follows: 78.9% for three domains (added video streaming) and 69.7% for four domains (added travel). These results reveal a tendency for accuracy to gradually decrease as we integrate more function sets, especially after adding a third domain.

^{Table 2: Multi-LoRA benchmark}

5.3 Full fine-tuning using different base models

Baseline accuracy using different base models after tuning. The Google Gemma 2b achieved an accuracy of 85.6%, while the larger Gemma 7b excelled with an accuracy of 99.7%. The Microsoft Phi-3 Mini also performed strongly, achieving 98.1% accuracy. These results demonstrate that our framework adapts to a variety of device-side LLMs, with larger models generally achieving higher accuracy.

5.4 Full fine-tuning using different dataset sizes

Our default training dataset contains 1000 data points, evenly distributed in 1-5 step sequences (200 each) to represent different tasks complexity. We study the impact of dataset size on model performance to optimize function set integration efficiency and address synthetic data generation costs. Table 4 shows the baseline accuracy for different training dataset sizes:

The results show a clear correlation between dataset size and accuracy. The full 1000-point data set achieved 98.1% accuracy, while reducing it to 500 data points reduced the accuracy to 92.5%. Further reducing to 250 and 100 data points, the accuracy is 85.3% and 78.1% respectively. These findings suggest that for optimal performance, it is recommended to use training datasets with more than 1000 data points.

6 Conclusion

This article introduces Octo-planner, a device-side planning agent designed to work with mobile agents such as Octopus V2.

By separating planning and action execution, we increase specialization and adaptability. Our approach fine-tunes Phi-3 Mini, a 3.8 billion parameter LLM, to run natively on edge devices, achieving a 97% success rate in in-domain testing. We reduced computational requirements, improved latency and battery life, and implemented multi-LoRA technology for scaling model capabilities without complete retraining. Octo-planner contributes to solving AI deployment issues, including data privacy, latency, and offline functions. It represents progress toward practical, sophisticated AI agents for personal devices.

By open sourcing our model weights, we aim to drive innovation in on-device AI, facilitating the development of efficient, privacy-respecting applications that enhance daily life without compromising performance or security.

7. Limitations and future work

While our current model performs effectively in the specific mobile phone use case, it has limitations in terms of broader applicability.

Unlike frameworks like ReAct, which alternate between planning steps and executing actions based on real-time feedback, our model does all the planning upfront. This pre-planned approach is more efficient at handling simple tasks, but may be less adaptable in complex or unpredictable scenarios where conditions may change during execution.

Future work will focus on exploring iterative planning methods based on real-time observation to improve adaptability to dynamic environments. We also plan to investigate integrating our planning model with diverse action models to extend its capabilities beyond mobile applications, such as the Internet of Things, robotics, and smart home systems. These advances will address current limitations, expand the multifunctionality of our on-device planning models, and bridge the gap between efficient, localized AI processing and complex real-world needs.

The above is the detailed content of The AI agent optimization framework for end-side devices is launched, with an accuracy rate of up to 97% in the field.. For more information, please follow other related articles on the PHP Chinese website!