Welcome to the CodeXpert, an advanced, state-of-the-art framework designed to analyze, explain, and optimize Python codebases. This repository leverages CodeLlama, LangChain, and FAISS to deliver a seamless, interactive experience for code comprehension and improvement.
The Code Analysis Pipeline provides an automated solution for:
- Code Understanding: Analyze Python code for functionality and structure.
- Knowledge Extraction: Generate clear and actionable insights using LLMs.
- Code Optimization: Suggest performance improvements and best practices.
- Technical Education: Simplify complex code concepts for learners and professionals.
- Document Loading & Splitting:
- Recursively scans the specified directory for Python files.
- Splits large files into manageable chunks for efficient processing.
- Semantic Embedding Generation:
- Extracts embeddings using a HuggingFace embedding model.
- Vector Store Creation:
- Builds a FAISS vector store for semantic search and retrieval.
- Question Answering (QA):
- Processes user queries through a QA Chain with a retriever.
- Code Analysis & Explanation:
- Analyzes results using CodeLlama and simplifies explanations with templates.
- Improvement Suggestions:
- Leverages LLMs to suggest actionable optimizations.
- π Recursive Document Loading: Processes entire directories with customizable file extensions.
- βοΈ Text Splitting: Splits large files into smaller chunks for precise embeddings.
- π§ Advanced Embedding Models: Uses HuggingFace's embeddings for high-quality vector representations.
- π Efficient Retrieval: Semantic search powered by FAISS.
- π¦ LLM-Powered Analysis: Code analysis and explanations via CodeLlama.
- π Optimization Suggestions: Provides practical tips for code improvements.
- π Seamless Integration: Designed to integrate with other AI tools and pipelines.
Technology | Purpose |
---|---|
LangChain | Modular framework for building LLM-based workflows. |
FAISS | Vector similarity search for efficient code retrieval. |
CodeLlama | Advanced code understanding via LLMs. |
HuggingFace Hub | Hosting and serving LLMs and embeddings. |
Python | Primary programming language. |
git clone https://github.com/MohammedNasserAhmed/CodeXpert.git
cd code-analysis-pipeline
Install required libraries with:
pip install -r requirements.txt
Create a .env
file or export these variables directly:
MODEL=<YOUR_LLAMA_MODEL_VERSION>
HUGGINGFACEHUB_API_TOKEN=<Your_HuggingFace_Token>
REPO_ID=<Your_HuggingFace_Repo_ID>
CODEBASE_DIR=<Path_to_Your_Codebase>
EMBEDDING_MODEL=<HuggingFace_Embedding_Model>
python app.py
Provide a query like:
How to replace FAISS with CHORMA .
+--------------------+ +--------------------+ +----------------------+
| Document Loader |-----> | Text Splitter |-----> | Embedding Generator |
+--------------------+ +--------------------+ +----------------------+
|
v
+----------------------------------+
| FAISS Vector Store |
+----------------------------------+
|
v
+----------------------------------+
| Retrieval-Based QA Chain |
+----------------------------------+
|
v
+--------------------------------------------+
| CodeLlama Agent for Analysis & Explanations |
+--------------------------------------------+
|
v
+----------------------------------+
| Suggestions for Code Improvement |
+----------------------------------+
CodeXpert/
β
βββ codexpert/
β βββ components/
β β βββ load_document.py # Handles >> document loading from the codebase
β β βββ split_text.py # Splits >> documents into manageable chunks
β β βββ get_embeddings.py # Generates >> embeddings using HuggingFace models
β β βββ codellama_agent.py # Code analysis agent >> powered by Llama models
β β βββ vector_store.py # Manages >> FAISS vector store initialization
β β βββ llm_agent.py # Handles LLM setup and question-answering
β β
β βββ config/
β β βββ constants.py # Contains >> configurations like API tokens and file paths
β
βββ tests/ # Contains >> unit tests for all components
β βββ test_load_document.py # Tests >> for the document loader
β βββ test_split_text.py # Tests >> for the text splitter
β βββ test_get_embeddings.py # Tests >> for the embedding generator
β βββ test_codellama_agent.py # Tests >> for the CodeLlama agent
β βββ test_vector_store.py # Tests >> for the FAISS vector store
β βββ test_llm_agent.py # Tests >> for the LLM setup and QA chain
β
βββ .gitignore # Specifies >> files and folders to ignore in version control
βββ requirements.txt # Dependencies >> required for the project
βββ README.md # Project documentation (you are here!)
To verify the functionality of the components, use pytest
:
Run all tests:
pytest CodeXpert/tests/
Run tests with detailed output:
pytest -v
Run tests for a specific component:
pytest CodeXpert/tests/test_<component_name>.py
Generate a coverage report (requires pytest-cov
):
pip install pytest-cov
pytest --cov=CodeXpert/codechat
- Developers: Enhance understanding of complex codebases.
- Educators: Provide clear code explanations for learners.
- Researchers: Analyze algorithmic code for optimization.
- Organizations: Maintain clean, optimized, and well-documented repositories.
- File Types: Ensure the target codebase contains supported extensions (e.g.,
.py
). - Environment Setup: Use a virtual environment to isolate dependencies.
- Model Performance: Adjust embedding and LLM parameters for optimal results.
We welcome contributions! If you'd like to improve the pipeline, please:
- Fork this repository.
- Create a new branch for your feature or fix.
- Submit a pull request with a detailed description.
This project is licensed under the Apache License. See the LICENSE
file for details.
Feel free to reach out for questions or feedback:
- π§ Email: abunasserip@gmail.com
- π¦ LinkedIn: @M.N.Gaber
Special thanks to:
- HuggingFace for hosting world-class AI models.
- LangChain for simplifying LLM workflows.
- FAISS for fast and efficient retrieval.
π Ready to revolutionize code analysis? Dive in today and supercharge your development process! π¦Ύ