Story
To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation
Key takeaway
Researchers taught AI language models to generate code that uses private software libraries, enabling them to create more useful applications for developers. This advance could help make AI-generated code more practical and powerful for real-world software projects.
Quick Explainer
PriCoder enables large language models (LLMs) to better utilize private libraries for code generation, an important capability that LLMs often lack due to the scarcity of private library data in training corpora. The core idea is to automatically synthesize diverse and high-quality training samples that teach the LLM how to effectively invoke private library APIs. This is achieved through a two-step process: progressive graph evolution to iteratively generate more complex coding requirements, and multidimensional graph pruning to filter out low-quality samples. By alternating between these steps, PriCoder constructs a synthetic dataset that allows the LLM to learn how to use private library APIs without relying on external sources of knowledge.
Deep Dive
Technical Deep Dive: Improving LLMs for Private-Library Code Generation
Overview
Large Language Models (LLMs) have shown strong potential for code generation, but they struggle to use private libraries effectively. This is because private libraries are rarely included in public training corpora, leaving LLMs with limited prior knowledge. This paper proposes PriCoder, an approach that enables LLMs to automatically learn how to invoke private-library APIs through synthesized training data.
Problem & Context
- LLMs typically lack prior knowledge of private libraries, which are widely used in real-world software development but rarely included in public training corpora.
- Existing approaches rely on Retrieval-Augmented Generation (RAG) to inject relevant API knowledge into the prompt. However, this is insufficient - even when given complete API specifications, LLMs still struggle to effectively invoke these APIs.
- Directly synthesizing training data about private libraries is challenging, as LLMs tend to generate overly basic requirements and low-quality samples.
Methodology
PriCoder models the data synthesis process as constructing a graph, and uses two key operators:
- Progressive Graph Evolution: Progressively synthesizes more diverse training samples by starting from basic API nodes and iteratively evolving them into complex coding requirements.
- Multidimensional Graph Pruning: Verifies the synthesized samples for syntax, executability, and overall functionality, removing low-quality samples.
By alternating between these two operators, PriCoder constructs a high-diversity, high-quality synthetic dataset to fine-tune LLMs for private-library-oriented code generation.
Data & Experimental Setup
- Constructed two new benchmarks, NdonnxEval and NumbaEval, based on recently released libraries (ndonnx and numba-cuda) to enable a more rigorous evaluation.
- Evaluated PriCoder on three mainstream LLMs: DeepSeek-6.7B, Qwen-7B, and LLaMa-8B.
- Compared PriCoder against multiple baselines, including Naive RAG, EpiGen, CAPIR, and an Oracle setting.
- Assessed both private-library-oriented code generation and general code generation capabilities.
Results
- PriCoder substantially improves private-library-oriented code generation, yielding gains of over 20% in pass@1 in many settings.
- These gains come with negligible impact on the models' general code generation capabilities. In some cases, PriCoder even enhances general capabilities.
- Both Progressive Graph Evolution and Multidimensional Graph Pruning are essential components, as removing either leads to significant performance degradation.
- Increasing the scale of synthesized data and using a stronger model for synthesis both help improve the effectiveness of PriCoder.
Interpretation
- LLMs struggle to effectively invoke private-library APIs even when provided with complete API knowledge, suggesting that the key bottleneck is not merely acquiring the right information, but learning how to use it.
- PriCoder's automated data synthesis approach can enable LLMs to learn private-library knowledge and master API invocation without human intervention.
- The negligible impact on general code generation and the robustness to synthesis model choice indicate that PriCoder is a practical and effective solution for enterprises to adapt LLMs to their private-library ecosystems.
Limitations & Uncertainties
- The benchmarks, while carefully constructed, are still proxies for real-world private libraries. Evaluating on genuine enterprise private libraries would provide more reliable insights.
- The study focuses on code generation tasks and does not explore other potential applications of PriCoder, such as program understanding or debugging.
What Comes Next
- Explore ways to further improve the quality and diversity of synthesized data, potentially by incorporating more sophisticated graph operators or leveraging additional feedback signals.
- Investigate the application of PriCoder to other domains beyond code generation, such as using private APIs for data analysis or scientific computing.
- Extend PriCoder to handle evolving private libraries, where the synthesis process needs to continuously adapt to library changes over time.
Sources:
