Summary

Req ID Description
BR1 Allow users to efficiently search and understand code within Git repositories.
BR2 Improve developer productivity by facilitating code search/understanding workflows.
FR1.1 As a user, I can specify a Git repository URL to inspect its code.
FR1.2 As a user, I can search for code using keywords or natural language queries.
FR1.3 As a user, I can view the search results with code snippets and links to the original files in the repository.
FR1.4 As a user, I can filter search results by programming language.
FR1.5 As a user, I can view the context around a code snippet in the search results.
FR1.6 As a user, I can ask code-related questions via chat, and the chat history is preserved.
FR2.1 As a developer, I need the system to fetch and clone Git repositories from provided URLs.
FR2.2 As a developer, I need the system to index the code of the fetched repositories, to generate fast responses.
FR2.3 As a developer, I need the system to use a vector database to store code embeddings for semantic search.
FR2.4 As a developer, I need the system to integrate with an LLM to process natural language queries.
NFR1 The system will index code for targeted repositories within 10 seconds on the specified hardware.
NFR2 The system should achieve a System Usability Scale (SUS) score of 70+ based on at least 5 target users.
NFR3 The system should sanitize user search query inputs to prevent Cross-Site Scripting (XSS) attacks.
NFR4 A 2D visualization tool will display code embeddings to help analyze and improve indexing and search.
IR1 The system should be implemented in Scala, following functional programming principles.
IR2 The system should use Qdrant as the vector database for code embeddings.
IR3 The system should integrate with Ollama for LLM functionalities.
IR4 The system should follow a layered architecture approach, ensuring better modularity.

Requirements Specification

Below is a more through description of the requirements:

Code Search Productivity (BR1)

- Choice: Enable developers to efficiently search and understand code within Git repositories.
  - Rationale: Developers spend significant time searching through codebases, and improving this process directly impacts productivity.
  - Validation Criteria:
    - At least 85\% of test users report improved workflow efficiency in post-usage surveys (SUS).
    - Average query-to-result time under 10 seconds for the predetermined repositories.
  - Implementation Considerations:
    - Ensure integration with common development workflows. This can be done using Gradio interfaces.
    - Focus on search result quality and relevance (i.e. analyze the effectiveness of the generated embeddings).
  - Related Requirements:
    - FR1.2 (Search)
    - FR1.5 (Context)
    - FR1.6 (Chat)
Improving Code Understanding (BR2)

- Choice: Improve developer productivity by facilitating code search/understanding workflows.
  - Rationale: Understanding existing code is often more time-consuming than writing new code.
  - Validation Criteria:
    - Code explanations rated as "accurate and helpful" by at least 70\% of test users (SUS).
  - Implementation Considerations:
    - Implement contextual code explanations (i.e. use separate models that understand code and natural language).
    - Provide relationship visualization between embedding by using a 2D visualization tool.
    - Prioritize speed and accuracy in responses by allowing the users to select any open source model.
  - Related Requirements:
    - FR1.6 (Chat)
    - FR2.4 (LLM integration)
    - NFR2 (Usability)
    - NFR4 (Visualization)
Repository URL Input Interface (FR1.1)

- Choice: As a user, I can specify a Git repository URL to inspect its code, so that I can access and analyze specific codebases I'm interested in.
  - Rationale: The system needs a secure and user-friendly way to fetch Git repository URLs.
  - Validation Criteria:
    - The acceptance tests parse valid GitHub URLs successfully.
    - Error feedback displayed within 2 seconds of validation failure.
    - URL validation completes within half a second for all inputs.
  - Implementation Considerations:
    - Implement URL validation and display clear feedback in case of parsing errors.
  - Related Requirements:
    - FR2.1 (Repository Cloning)
    - NFR3 (Security)
Code Search Using Markdown (FR1.2)

- Choice: As a user, I can search for code using keywords or natural language, so that I can quickly find relevant code sections without manually browsing through files.
  - Rationale: This allows users to search for code using natural language, making it easier to find relevant code sections.
  - Validation Criteria:
    - Search results return in under 2 seconds for the predetermined repositories.
    - Language filtering correctly categorizes at least 95\% of code files.
  - Implementation Considerations:
    - Use embedding model to convert natural language queries to vectors.
    - Support filtering by language, content type and extension.
  - Related Requirements:
    - FR2.2 (Code Indexing)
    - FR2.3 (Vector Database)
    - NFR1 (Performance)
Search Results Display System (FR1.3)

- Choice: As a user, I can view the search results with code snippets and links to the original files in the repository, so that I can efficiently evaluate search results and navigate to the full context when needed.
  - Validation Criteria:
    - 95\% of users can correctly identify file locations from the display assessed via the SUS survey.
    - Code snippets maintain proper indentation and formatting (Python, Scala frontend).
  - Implementation Considerations:
    - Display code snippets with syntax highlighting (Python).
    - Show file path and location information.
  - Related Requirements:
    - FR1.2 (Search)
    - FR1.5 (Context)
    - NFR2 (Usability)
Code Search using Code Embeddings (FR1.4)

- Choice: As a user, I can filter search results by programming language, so that I can focus on code written in languages relevant to my current task.
  - Rationale: Developers often need to restrict searches to specific languages or file types.
  - Validation Criteria:
    - Language detection accuracy >95\% across all common programming languages (see partser impl.).
    - Multiple simultaneous filters function correctly in 100\% as assessed by the acceptance tests.
  - Implementation Considerations:
    - Detect and classify programming languages during indexing.
    - Create efficient language metadata for fast filtering.
    - Support multiple simultaneous language filters.
    - Include language identification in UI.
  - Related Requirements:
    - FR1.2 (Search)
    - FR1.3 (Search Results)
    - FR2.2 (Code Indexing)
Code Context Visualization (FR1.5)

- Choice: As a user, I can view the context around a code snippet in the search results, so that I can better understand how the code fits into the broader implementation.
  - Rationale: This allows users to view the context around a code snippet in the search results, making it easier to understand how the code fits into the broader implementation.
  - Validation Criteria:
    - Fetching a repository loads within 5 seconds for the predetermined repositories.
    - 90\% of users report sufficient context for understanding code purpose (SUS)
  - Implementation Considerations:
    - Display the entire code being indexed in the search results.
    - When answering questions, display relevant snippets from the codebase.
    - Allow the user to switch between the full text and retrieved snippets.
  - Related Requirements:
    - FR1.3 (Search Results)
    - NFR2 (Usability)
Model with Past Chat History (FR1.6)

- Choice: As a user, I can ask code-related questions via chat, and the chat history is preserved, so that I can have a continuous conversation with the system.
  - Rationale: This allows users to have a continuous conversation with the system, making it easier to understand how the code fits into the broader implementation.
  - Validation Criteria:
    - Context-aware responses remain relevant for at least 2 consecutive related questions.
    - Chat history can be cleared by regenerating the index.
  - Implementation Considerations:
    - Maintain chat history within session scope.
    - Structure LLM prompts to include chat history and retrieved code.
  - Related Requirements:
    - FR2.4 (LLM Integration)
    - FR2.3 (Vector Database)
    - NFR2 (Usability)
Repository Cloning and Management (FR2.1)

- Choice: As a developer, I need the system to fetch and clone Git repositories from provided URLs, so that I can work with up-to-date code without performing these operations manually.
  - Rationale: This allows developers to work with up-to-date code without performing these operations manually.
  - Validation Criteria:
    - Predetermined repositories clone successfully within 30 seconds.
    - UI remains responsive (no blocking) during 100\% of cloning operations.
    - Invalid repository URLs are handled gracefully.
  - Implementation Considerations:
    - Use Uithub for extracting the repository code.
    - Implement caching mechanism for previously cloned repositories.
    - Add repository verification to ensure valid Git URLs.
  - Related Requirements:
    - FR2.2 (Code Indexing)
    - NFR1 (Performance)
    - NFR3 (Security)
Vector Database Generation for RAG (FR2.2)

- Choice: As a developer, I need the system to index the code of the fetched repositories, so that I can perform fast and accurate searches across the entire codebase.
  - Validation Criteria:
    - The clusters generated by the embeddings are well defined, suggesting successful embedding generation.
    - Metadata correctly captures language and file type.
  - Implementation Considerations:
    - Use Qdrant for caching code embeddings.
    - Utilize metadata to enhance search results relevance.
  - Related Requirements:
    - FR2.2 (Code Indexing)
    - NFR1 (Performance)
Semantic Search (FR2.3)

- Choice: As a developer, I need the system to use a vector database to store code embeddings, so that I can perform semantic searches that understand code context beyond simple keyword matching.
  - Validation Criteria:
    - Queries complete in less than 300ms for the predetermined repositories.
    - Vector similarity scores correctly correlate with semantic relevance as determined by the cluster analysis.
  - Implementation Considerations:
    - Configure Qdrant collection schema for code embeddings.
    - Use metadata for filtering specific file types.
  - Related Requirements:
    - FR1.2 (Search)
    - FR2.2 (Code Indexing)
    - NFR1 (Performance)
    - IR2 (Qdrant)
LLM Integration for Natural Language Queries (FR2.4)

- Choice: As a developer, I need the system to integrate with an LLM to process natural language queries, so that I can interact with the codebase using plain English rather than specialized query syntax.
  - Validation Criteria:
    - Ollama integration successfully handles queries within tests without errors.
  - Implementation Considerations:
    - Implement prompt engineering techniques to guide responses (i.e. conditional RAG, search by file type).
    - Design context management for large repositories (limit the amount of tokens being processed).
  - Related Requirements:
    - FR1.6 (Chat Functionality)
    - NFR1 (Performance)
    - NFR3 (Security)
System Performance Optimization (NFR1)

- Choice: The system will index code for targeted repositories within 10 seconds on the specified hardware.
  - Success Criteria:
    - Search queries return results in under 40 seconds for the predetermined repositories.
    - Embedding generation completes in under 30 seconds for the predetermined repositories.
    - Chat responses for simple search (without code context) arrives within 20 seconds for all tests.
  - Implementation Considerations:
    - Embeddings are stored in the Qdrant vector database.
    - The embeddings for retrieving chunks are generated using Ollama.
  - Related Requirements:
    - FR2.1 (Repository Cloning)
    - FR2.2 (Code Indexing)
    - FR1.2 (Search)
    - FR1.6 (Chat)
System Usability Testing (NFR2)

- Choice: The system should achieve a System Usability Scale (SUS) score of 70+ based on at least 5 target users.
  - Rationale: This ensures that the system is usable, as evaluated by a group of users.
  - Validation Criteria:
    - First-time users find the interface easy to use without assistance in >70\% of cases (SUS)
    - 80\% of users rate UI intuitiveness as "good" or "excellent" (SUS)
  - Implementation Considerations:
    - Implement clean, intuitive UI. Rely on established UX design patterns by using Gradio components.
  - Related Requirements:
    - FR1.3 (Search Results)
    - FR1.5 (Context)
    - FR1.6 (Chat)
User Interface Security (NFR3)

- Choice: The system should sanitize user search query inputs to prevent Cross-Site Scripting (XSS) attacks.
  - Validation Criteria:
    - 100\% of malformed/malicious URLs rejected before processing
  - Implementation Considerations:
    - Validate repository URLs against known valid patterns.
    - Test against standard Github URLs.
  - Related Requirements:
    - FR1.1 (Repository Input)
    - FR1.2 (Search)
    - FR1.6 (Chat)
Embedding Visualization Requirement (NFR4)

- Choice: A 2D visualization tool will display code embeddings to help analyze and improve indexing and search.
  - Validation Criteria:
    - Visualization correctly clusters similar code types.
    - Report analysis identifies strategies for improving search quality.
  - Implementation Considerations:
    - Implement dimension reduction techniques (t-SNE, UMAP) for 2D visualization.
    - Used to make informed decisions about the search quality.
  - Related Requirements:
    - FR2.2 (Code Indexing)
    - FR2.3 (Vector Database)
Scala Implementation Requirement (IR1)

- Choice: The system should be implemented in Scala, following functional programming principles.
  - Success Criteria:
    - Scala tools are used to ensure consisent.
    - Functional programming patterns are used to ensure consistent code quality.
  - Implementation Considerations:
    - Use appropriate abstraction mechanisms (strategies, factories, memoization, etc.)
    - Implement error handling using functional approaches (Try, Option)
  - Related Requirements:
    - FR2.2 (Code Indexing)
    - FR2.3 (Vector Database)
    - FR2.4 (LLM Integration)
Qdrant Vector Database Requirement (IR2)

- Choice: The system should use Qdrant as the vector database for code embeddings.
  - Rationale: Qdrant provides efficient vector search capabilities with filtering options needed for code search.
  - Validation Criteria:
    - Qdrant client wrapper handles all required vector operations.
    - Collection schemas designed for code embeddings and text embeddings.
  - Implementation Considerations:
    - Implement the AIServices wrapper around the Qdrant module. Configure distance metrics.
  - Related Requirements:
    - FR2.3 (Vector Database)
    - NFR1 (Performance)
Ollama Integration Requirement (IR3)

- Choice: The system should integrate with Ollama for LLM functionalities.
  - Rationale: Ollama provides locally-hosted LLM models, yielding good privacy and reduced latency.
  - Success Criteria:
    - The application successfully communicates with Ollama.
  - Implementation Considerations:
    - Implement the AIServices wrapper around the Ollama module.
    - Use prompt templates optimized for code understanding.
  - Related Requirements:
    - FR1.6 (Chat)
    - FR2.4 (LLM Integration)
    - NFR1 (Performance)
Layered Architecture (IR4)

- Choice: The system should follow a layered architecture approach, ensuring better modularity.
  - Success Criteria:
    - All components separated into Presentation, Application, Domain, and Infrastructure layers.
    - The separation is assessed via ArchUnit tests.