Git Inspector: Chatbot for Codebases

Large Language Models (LLMs) have gained significant popularity in recent years due to their remarkable question answering capabilities. However, when tackling a large codebase, the quality of the answers varies, largely due to the model's inability to focus on contextualized information. To address this challenge, this work presents a Retrieval Augmented Generation (RAG) pipeline designed to facilitate code retrieval and understanding with Github repositories. The approach empowers LLMs by combining non-parametric memory (retrieved code snippets) with parametric memory (pre-trained LLM weights) to generate insightful information.

The project emphasises the engineering process, adhering to the agile methodology and documenting the development process. Notably, the system utilizes open-source technologies such as Ollama and Qdrant, enabling the utilisation of various open LLMs through local indexing and retrieval of code snippets without reliance on proprietary services.

The work is meant to reduce the steep learning curve associated with understanding large codebases, and provide insightful explanations for complex coding concepts. The library is built in Scala, on top of the Langchain4j framework, and facilitates the integration with the LLM through interfaces built with Gradio and Scala.js. The library is available at https://github.com/atomwalk12/git-inspector.

Code Quality

Libraries that facilitate consistent coding practices:

Scalafmt - Code formatter
Scalafix - Code refactoring tool
Wartremover - Static code analyzer
Trunk - Code quality tool
ArchUnit - Architecture testing tool

Dependencies that ease development:

Langchain4j - Framework for building agents
Ollama - LLM provider
Qdrant - Vector database
Laminar - Web framework built on top of Scala.js
Gradio - Web interface for LLMs

The project also uses conventional commits and the the Gemini chatbot to review pull requests.

Key requirements for the Git Inspector library

Acceptance tests that verify correct behavior according to the specification document are present under the test directory, and the traceability matrix is available here.

Two indexes are produced for each Git Repositories containing the two data modalities (code and natural language instructions). The user can query at runtime whether to retrieve information from either collection using the QueryRoutingStrategy class. This is done by assessing the query using a separate LLM.

The code index supports multiple extension types. To narrow down the search to specific code file types, it is possible to use the LLMClassificationFilterStrategy class. This class enables runtime filtering to include one specific extension type in the search.

The parsing algorithm was ported from the Langchain Python library available here. Below is pseudocode describing the key ideas:

Algorithm Description

This recursive text splitting algorithm efficiently parses text into manageable chunks:

Choose a high-priority separator from the available options
Split the text using the selected separator
Process each resulting split:
- Small chunks are buffered for later merging
- Large chunks are processed recursively with remaining separators
Finally, merge any remaining small chunks in the buffer

The application allows for local inference using the Ollama backend with minimal configuration, however there is also the posibility to use it with a pay-to-use API using Gemini or Anthropic. This requires setting the correct API key in the application.conf file.

Code Embeddings Distribution

t-SNE code — t-SNE visualization of code query embeddings

PCA code — PCA visualization of code query embeddings

Distribution of code embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories

Code Embeddings Similarity

Cosine similarity code — Cosine similarity heatmap between code embeddings

Similarity analysis of code embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories

Text Embeddings Distribution

t-SNE text — t-SNE visualization of text query embeddings

PCA text — PCA visualization of text query embeddings

Distribution of text embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories

Text Embeddings Similarity

Cosine similarity text — Cosine similarity heatmap between text embeddings

Similarity analysis of text embeddings across OpenAI Scala client, SciPy, and Scala 3 repositories

SUS Questionnaire Bar Chart

User Evaluation of the Web Interface

The release folder on Github contains the necessary files to run the application, however setting up the environment by installing dependencies listed under the prerequisites section is still necessary.

The release files are available in the releases section.

BibTeX

@article{vasile2025gitinspector,
  author    = {Razvan Florian Vasile},
  title     = {Git Inspector! Inspecting Github Repositories with Open-Source LLM},
  year      = {2025},
}

Git Inspector! Inspecting Github Repositories with Open-Source LLMs

Git Inspector is a tool that allows to search and understand codebases using Open-Source LLMs.

Abstract

Video

Workflow

Tools

Code Quality

Libraries

Requirements

Project Requirements

Acceptance Tests

Features

Conditional Index Retrieval

Filtering by extension type

Parsing algorithm

Algorithm Description

Local Inference with optional pay-to-use inference

Evaluation