Code Quality
Libraries that facilitate consistent coding practices:
- Scalafmt - Code formatter
- Scalafix - Code refactoring tool
- Wartremover - Static code analyzer
- Trunk - Code quality tool
- ArchUnit - Architecture testing tool
Large Language Models (LLMs) have gained significant popularity in recent years due to their remarkable question answering capabilities. However, when tackling a large codebase, the quality of the answers varies, largely due to the model's inability to focus on contextualized information. To address this challenge, this work presents a Retrieval Augmented Generation (RAG) pipeline designed to facilitate code retrieval and understanding with Github repositories. The approach empowers LLMs by combining non-parametric memory (retrieved code snippets) with parametric memory (pre-trained LLM weights) to generate insightful information.
The project emphasises the engineering process, adhering to the agile methodology and documenting the development process. Notably, the system utilizes open-source technologies such as Ollama and Qdrant, enabling the utilisation of various open LLMs through local indexing and retrieval of code snippets without reliance on proprietary services.
The work is meant to reduce the steep learning curve associated with understanding large codebases, and provide insightful explanations for complex coding concepts. The library is built in Scala, on top of the Langchain4j framework, and facilitates the integration with the LLM through interfaces built with Gradio and Scala.js. The library is available at https://github.com/atomwalk12/git-inspector.
Libraries that facilitate consistent coding practices:
Dependencies that ease development:
The project also uses conventional commits and the the Gemini chatbot to review pull requests.
Below are the summarized requirements, while the full requirements document is available here.
Acceptance tests that verify correct behavior according to the specification document are present under the test directory, and the traceability matrix is available here.
Two indexes are produced for each Git Repositories containing the two data modalities (code and natural language instructions). The user can query at runtime whether to retrieve information from either collection using the QueryRoutingStrategy class. This is done by assessing the query using a separate LLM.
The code index supports multiple extension types. To narrow down the search to specific code file types, it is possible to use the LLMClassificationFilterStrategy class. This class enables runtime filtering to include one specific extension type in the search.
The parsing algorithm was ported from the Langchain Python library available here. Below is pseudocode describing the key ideas:
This recursive text splitting algorithm efficiently parses text into manageable chunks:
The application allows for local inference using the Ollama backend with minimal configuration, however there is also the posibility to use it with a pay-to-use API using Gemini or Anthropic. This requires setting the correct API key in the application.conf file.
Visualizations of code and text embeddings across different repositories
Distribution of code embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories
Similarity analysis of code embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories
Distribution of text embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories
Similarity analysis of text embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories
The questionnaire complements the requirements document and is available here. The final average score is 85%.
User Evaluation of the Web Interface
The release folder on Github contains the necessary files to run the application, however setting up the environment by installing dependencies listed under the prerequisites section is still necessary.
The release files are available in the releases section.
@article{vasile2025gitinspector,
author = {Razvan Florian Vasile},
title = {Git Inspector! Inspecting Github Repositories with Open-Source LLM},
year = {2025},
}