Skip to content

Open-source framework for holistic, structured repository-level documentation across multilingual codebases

Notifications You must be signed in to change notification settings

FSoft-AI4Code/CodeWiki

Repository files navigation

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

AI-Powered Repository Documentation Generation โ€ข Multi-Language Support โ€ข Architecture-Aware Analysis

Generate holistic, structured documentation for large-scale codebases โ€ข Cross-module interactions โ€ข Visual artifacts and diagrams

Python versionLicense: MITGitHub starsarXiv

Quick Start โ€ข CLI Commands โ€ข Output Structure โ€ข Paper

CodeWiki Framework


Quick Start

1. Install CodeWiki

# Install from source pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git # Verify installation codewiki --version

2. Configure Your Environment

CodeWiki supports multiple models via an OpenAI-compatible SDK layer.

codewiki config set \ --api-key YOUR_API_KEY \ --base-url https://api.anthropic.com \ --main-model claude-sonnet-4 \ --cluster-model claude-sonnet-4

3. Generate Documentation

# Navigate to your projectcd /path/to/your/project # Generate documentation codewiki generate # Generate with HTML viewer for GitHub Pages codewiki generate --github-pages --create-branch

That's it! Your documentation will be generated in ./docs/ with comprehensive repository-level analysis.

Usage Example

CLI Usage Example


What is CodeWiki?

CodeWiki is an open-source framework for automated repository-level documentation across seven programming languages. It generates holistic, architecture-aware documentation that captures not only individual functions but also their cross-file, cross-module, and system-level interactions.

Key Innovations

InnovationDescriptionImpact
Hierarchical DecompositionDynamic programming-inspired strategy that preserves architectural contextHandles codebases of arbitrary size (86K-1.4M LOC tested)
Recursive Agentic SystemAdaptive multi-agent processing with dynamic delegation capabilitiesMaintains quality while scaling to repository-level scope
Multi-Modal SynthesisGenerates textual documentation, architecture diagrams, data flows, and sequence diagramsComprehensive understanding from multiple perspectives

Supported Languages

๐Ÿ Python โ€ข โ˜• Java โ€ข ๐ŸŸจ JavaScript โ€ข ๐Ÿ”ท TypeScript โ€ข โš™๏ธ C โ€ข ๐Ÿ”ง C++ โ€ข ๐ŸชŸ C#


CLI Commands

Configuration Management

# Set up your API configuration codewiki config set \ --api-key <your-api-key> \ --base-url <provider-url> \ --main-model <model-name> \ --cluster-model <model-name># Show current configuration codewiki config show # Validate your configuration codewiki config validate

Documentation Generation

# Basic generation codewiki generate # Custom output directory codewiki generate --output ./documentation # Create git branch for documentation codewiki generate --create-branch # Generate HTML viewer for GitHub Pages codewiki generate --github-pages # Enable verbose logging codewiki generate --verbose # Full-featured generation codewiki generate --create-branch --github-pages --verbose

Configuration Storage

  • API keys: Securely stored in system keychain (macOS Keychain, Windows Credential Manager, Linux Secret Service)
  • Settings: ~/.codewiki/config.json

Documentation Output

Generated documentation includes both textual descriptions and visual artifacts for comprehensive understanding.

Textual Documentation

  • Repository overview with architecture guide
  • Module-level documentation with API references
  • Usage examples and implementation patterns
  • Cross-module interaction analysis

Visual Artifacts

  • System architecture diagrams (Mermaid)
  • Data flow visualizations
  • Dependency graphs and module relationships
  • Sequence diagrams for complex interactions

Output Structure

./docs/ โ”œโ”€โ”€ overview.md # Repository overview (start here!) โ”œโ”€โ”€ module1.md # Module documentation โ”œโ”€โ”€ module2.md # Additional modules... โ”œโ”€โ”€ module_tree.json # Hierarchical module structure โ”œโ”€โ”€ first_module_tree.json # Initial clustering result โ”œโ”€โ”€ metadata.json # Generation metadata โ””โ”€โ”€ index.html # Interactive viewer (with --github-pages) 

Experimental Results

CodeWiki has been evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment.

Performance by Language Category

Language CategoryCodeWiki (Sonnet-4)DeepWikiImprovement
High-Level (Python, JS, TS)79.14%68.67%+10.47%
Managed (C#, Java)68.84%64.80%+4.04%
Systems (C, C++)53.24%56.39%-3.15%
Overall Average68.79%64.06%+4.73%

Results on Representative Repositories

RepositoryLanguageLOCCodeWiki-Sonnet-4DeepWikiImprovement
All-Hands-AI--OpenHandsPython229K82.45%73.04%+9.41%
puppeteer--puppeteerTypeScript136K83.00%64.46%+18.54%
sveltejs--svelteJavaScript125K71.96%68.51%+3.45%
Unity-Technologies--ml-agentsC#86K79.78%74.80%+4.98%
elastic--logstashJava117K57.90%54.80%+3.10%

View comprehensive results: See paper for complete evaluation on 21 repositories spanning all supported languages.


How It Works

Architecture Overview

CodeWiki employs a three-stage process for comprehensive documentation generation:

  1. Hierarchical Decomposition: Uses dynamic programming-inspired algorithms to partition repositories into coherent modules while preserving architectural context across multiple granularity levels.

  2. Recursive Multi-Agent Processing: Implements adaptive multi-agent processing with dynamic task delegation, allowing the system to handle complex modules at scale while maintaining quality.

  3. Multi-Modal Synthesis: Integrates textual descriptions with visual artifacts including architecture diagrams, data-flow representations, and sequence diagrams for comprehensive understanding.

Data Flow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Codebase โ”‚โ”€โ”€โ”€โ–ถโ”‚ Hierarchical โ”‚โ”€โ”€โ”€โ–ถโ”‚ Multi-Agent โ”‚ โ”‚ Analysis โ”‚ โ”‚ Decomposition โ”‚ โ”‚ Processing โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Visual โ”‚โ—€โ”€โ”€โ”€โ”‚ Multi-Modal โ”‚โ—€โ”€โ”€โ”€โ”‚ Structured โ”‚ โ”‚ Artifacts โ”‚ โ”‚ Synthesis โ”‚ โ”‚ Content โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 

Requirements

  • Python 3.12+
  • Node.js (for Mermaid diagram validation)
  • LLM API access (Anthropic Claude, OpenAI, etc.)
  • Git (for branch creation features)

Additional Resources

Documentation & Guides

Academic Resources

  • Paper - Full research paper with detailed methodology and results
  • Citation - How to cite CodeWiki in your research

Citation

If you use CodeWiki in your research, please cite:

@misc{hoang2025codewikievaluatingaisability, title={CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases}, author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le and Nghi D. Q. Bui}, year={2025}, eprint={2510.24428}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2510.24428}, }

Star History

Star History Chart


License

This project is licensed under the MIT License.