Skip to content

laleye/langquality

LangQuality - Language Quality Toolkit for Low-Resource Languages

PyPI versionCI StatusCoverageLicense: MITPython 3.8+Documentation

A modular, extensible Python toolkit for analyzing the quality of text and audio datasets for low-resource languages. LangQuality helps researchers and developers ensure high-quality datasets for training NLP models (ASR, machine translation, language models) across diverse languages.

✨ Key Features

🌍 Multi-Language Support via Language Packs

  • Language-agnostic architecture: Works with any language through configurable Language Packs
  • Pre-built packs: Fongbe, French, English, and more
  • Easy customization: Create your own Language Pack in minutes
  • Community-driven: Share and discover Language Packs from the community

🔍 Comprehensive Quality Analysis

  • Structural Analysis: Sentence length distribution, outlier detection, statistical metrics
  • Linguistic Analysis: Readability scores, lexical complexity, morphological features
  • Diversity Analysis: Vocabulary richness (TTR), n-gram distributions, duplicate detection
  • Domain Analysis: Thematic balance, under/over-represented categories
  • Gender Bias Detection: Gender representation, stereotype detection, balance metrics

🔌 Extensible Plugin System

  • Custom analyzers: Add your own analysis modules without modifying core code
  • Automatic discovery: Drop plugins into a directory and they're automatically loaded
  • Language-specific analyzers: Create analyzers tailored to specific languages

📊 Rich Output Formats

  • Interactive Dashboard: Beautiful HTML visualizations with Plotly
  • Actionable Recommendations: Prioritized suggestions based on best practices
  • Multiple Exports: JSON, CSV, PDF reports, execution logs
  • Per-sentence annotations: Quality scores and flags for each sentence

🚀 Quick Start

Installation

# Install from PyPI pip install langquality # Install with all optional dependencies pip install langquality[all] # Download language models (if using spaCy-based packs) python -m spacy download fr_core_news_md # For French python -m spacy download en_core_web_md # For English

Basic Usage

Analyze a dataset with a specific language:

# Analyze Fongbe data langquality analyze --input data/fongbe_sentences --output results --language fon # Analyze French data langquality analyze --input data/french_sentences --output results --language fra # Analyze English data langquality analyze --input data/english_sentences --output results --language eng

View Results

# Open the interactive dashboard open results/dashboard.html

Python API

fromlangquality.pipelineimportPipelineControllerfromlangquality.language_packsimportLanguagePackManagerfromlangquality.dataimportGenericDataLoader# Load a language packpack_manager=LanguagePackManager() language_pack=pack_manager.load_language_pack("fon") # Load your dataloader=GenericDataLoader(language_pack) sentences=loader.load_from_csv("data/sentences.csv") # Run analysiscontroller=PipelineController(language_pack) results=controller.run(sentences) # Access resultsprint(f"Total sentences: {results.structural.total_sentences}") print(f"Average readability: {results.linguistic.avg_readability_score}")

📦 Language Packs

Language Packs are self-contained configurations that adapt LangQuality to specific languages. Each pack includes:

  • Language-specific configuration (tokenization, thresholds, etc.)
  • Linguistic resources (lexicons, stopwords, gender terms, etc.)
  • Optional custom analyzers

Available Language Packs

LanguageCodeStatusResources
Fongbefon✅ StableFull (lexicon, gender terms, ASR vocabulary)
Frenchfra✅ StableFull (lexicon, stopwords, gender terms, professions)
Englisheng✅ StableFull (lexicon, stopwords, gender terms, professions)
Minangkabaumin🚧 MinimalBasic configuration only
Your Languagexxx💡 Create one!See Language Pack Guide

Managing Language Packs

# List installed packs langquality pack list # Show pack details langquality pack info fon # Create a new pack template langquality pack create <language_code># Validate a pack langquality pack validate path/to/pack

Creating Your Own Language Pack

Creating a Language Pack for your language is straightforward:

  1. Generate a template:

    langquality pack create <your_language_code>
  2. Configure the pack: Edit config.yaml with language-specific settings

  3. Add resources (optional): Add lexicons, stopwords, or other linguistic resources

  4. Test it:

    langquality pack validate path/to/your_pack langquality analyze --input test_data --output results --language <your_language_code>

See the Language Pack Guide for detailed instructions.

📖 Documentation

🎯 Use Cases

LangQuality is designed for researchers and developers working with low-resource languages:

  • ASR Dataset Preparation: Ensure text quality before audio recording
  • Machine Translation: Validate parallel corpora quality
  • Language Model Training: Assess dataset diversity and balance
  • Corpus Linguistics: Analyze linguistic properties of text collections
  • Data Curation: Filter and improve existing datasets

🔧 Advanced Features

Custom Configuration

Override default thresholds and settings:

langquality analyze --input data --output results --language fon --config my_config.yaml

Example configuration:

thresholds: structural: min_words: 5max_words: 15diversity: target_ttr: 0.65gender: target_ratio: [0.45, 0.55]

Custom Analyzers

Create custom analyzers for specialized analysis:

fromlangquality.analyzersimportAnalyzerclassToneAnalyzer(Analyzer): """Analyze tone and sentiment of sentences."""defanalyze(self, sentences): # Your analysis logicreturnmetricsdefget_requirements(self): return ["tone_lexicon"] # Required resources

Place your analyzer in the plugins directory and it will be automatically discovered.

See Creating Analyzers for details.

🤝 Contributing

We welcome contributions from the community! Whether you're:

  • 🌍 Creating a Language Pack for your language
  • 🔧 Adding new analyzers or features
  • 📝 Improving documentation
  • 🐛 Reporting bugs or issues
  • 💡 Suggesting enhancements

Please see our Contributing Guide for:

  • Code of Conduct
  • Development setup
  • Contribution workflow
  • Coding standards
  • Testing requirements

Quick Contribution Steps

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and add tests
  4. Ensure tests pass: pytest
  5. Commit: git commit -m 'Add amazing feature'
  6. Push: git push origin feature/amazing-feature
  7. Open a Pull Request

👥 Community

Join our community to get help, share ideas, and collaborate:

Support Channels

📊 Project Status

LangQuality is actively maintained and under continuous development. See our CHANGELOG for recent updates and our Roadmap for planned features.

Current version: 1.0.0 (Stable)

📜 License

LangQuality is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, including commercial applications.

🙏 Acknowledgments

LangQuality evolved from the Fongbe Data Quality Pipeline, originally developed to support dataset creation for Fongbe, a low-resource language from Benin. We're grateful to:

  • The linguistic community working on African language preservation and NLP development
  • Contributors who have created Language Packs and shared their expertise
  • The open-source NLP community for tools and libraries that make this work possible

📚 Citation

If you use LangQuality in your research, please cite:

@software{langquality_toolkit, title={LangQuality: Language Quality Toolkit for Low-Resource Languages}, author={LangQuality Community}, year={2024}, url={https://github.com/langquality/langquality}, version={1.0.0} }

🔗 Related Projects


Made with ❤️ for low-resource language communities worldwide

Get Started | Documentation | Community | Contributing

About

Language Quality Toolkit for Low-Resource Languages.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published