Git Inspector! Inspecting Github Repositories with Open-Source LLMs

1University of Bologna
Git Inspector Teaser

Git Inspector is a tool that allows to search and understand codebases using Open-Source LLMs.

Abstract

Large Language Models (LLMs) have gained significant popularity in recent years due to their remarkable question answering capabilities. However, when tackling a large codebase, the quality of the answers varies, largely due to the model's inability to focus on contextualized information. To address this challenge, this work presents a Retrieval Augmented Generation (RAG) pipeline designed to facilitate code retrieval and understanding with Github repositories. The approach empowers LLMs by combining non-parametric memory (retrieved code snippets) with parametric memory (pre-trained LLM weights) to generate insightful information.

The project emphasises the engineering process, adhering to the agile methodology and documenting the development process. Notably, the system utilizes open-source technologies such as Ollama and Qdrant, enabling the utilisation of various open LLMs through local indexing and retrieval of code snippets without reliance on proprietary services.

The work is meant to reduce the steep learning curve associated with understanding large codebases, and provide insightful explanations for complex coding concepts. The library is built in Scala, on top of the Langchain4j framework, and facilitates the integration with the LLM through interfaces built with Gradio and Scala.js. The library is available at https://github.com/atomwalk12/git-inspector.

Video

Workflow

My Happy SVG
An illustration of the development process

Tools

Code Quality

Libraries that facilitate consistent coding practices:

  • Scalafmt - Code formatter
  • Scalafix - Code refactoring tool
  • Wartremover - Static code analyzer
  • Trunk - Code quality tool
  • ArchUnit - Architecture testing tool

Libraries

Dependencies that ease development:

  • Langchain4j - Framework for building agents
  • Ollama - LLM provider
  • Qdrant - Vector database
  • Laminar - Web framework built on top of Scala.js
  • Gradio - Web interface for LLMs

The project also uses conventional commits and the the Gemini chatbot to review pull requests.

Requirements

Below are the summarized requirements, while the full requirements document is available here.

Key requirements for the Git Inspector library

Project Requirements

Acceptance Tests

Acceptance tests that verify correct behavior according to the specification document are present under the test directory, and the traceability matrix is available here.

Features

Conditional Index Retrieval

Two indexes are produced for each Git Repositories containing the two data modalities (code and natural language instructions). The user can query at runtime whether to retrieve information from either collection using the QueryRoutingStrategy class. This is done by assessing the query using a separate LLM.


Filtering by extension type

The code index supports multiple extension types. To narrow down the search to specific code file types, it is possible to use the LLMClassificationFilterStrategy class. This class enables runtime filtering to include one specific extension type in the search.


Parsing algorithm

The parsing algorithm was ported from the Langchain Python library available here. Below is pseudocode describing the key ideas:

Recursive text splitting algorithm

Algorithm Description

This recursive text splitting algorithm efficiently parses text into manageable chunks:

  1. Choose a high-priority separator from the available options
  2. Split the text using the selected separator
  3. Process each resulting split:
    • Small chunks are buffered for later merging
    • Large chunks are processed recursively with remaining separators
  4. Finally, merge any remaining small chunks in the buffer

Local Inference with optional pay-to-use inference

The application allows for local inference using the Ollama backend with minimal configuration, however there is also the posibility to use it with a pay-to-use API using Gemini or Anthropic. This requires setting the correct API key in the application.conf file.

Evaluation

Visualizations of code and text embeddings across different repositories

System Usability Scale (SUS)

The questionnaire complements the requirements document and is available here. The final average score is 85%.

SUS Questionnaire Bar Chart

SUS results chart

User Evaluation of the Web Interface

Related Notes

The release folder on Github contains the necessary files to run the application, however setting up the environment by installing dependencies listed under the prerequisites section is still necessary.

The release files are available in the releases section.

BibTeX

@article{vasile2025gitinspector,
  author    = {Razvan Florian Vasile},
  title     = {Git Inspector! Inspecting Github Repositories with Open-Source LLM},
  year      = {2025},
}