Project Tasks Based on the Requirements Specifications document. The project was divided into 5 separate sprints, each with its focus described below: Sprint 1: project setup Sprint 2: design patterns and indexing Sprint 3: search and chat interface Sprint 4: performance, visualization and documentation Sprint 5: final report and presentation ID Task Priority Related Requirements Status Sprint (click) Github I1 Build Configuration - Initialize SBT project with Scala 3.6.4 - Configure assembly plugin for JAR creation - Set up test environment with ScalaTest - Configure code coverage & documentation HIGHEST IR1, IR4, NFR1 ✓ S1 I2 Code Quality Tools - Set up Scalafmt with formatting rules - Implement Wartremover for code analysis - Configure Scalafix and semantic DB - Set up Trunk and Gemini bot HIGH IR1, NFR2 ✓ S1 I3 Git Workflow - Implement git hooks system - Set up semantic release system MEDIUM NFR1, NFR2 ✓ S1 I4 Project Infrastructure - Set up logging infrastructure - Configure CI/CD pipeline - Define high-level architecture HIGHEST IR4, NFR1, NFR2 ✓ S1 I5 Core Domain Model - Design repository data model - Design initial API contracts - Generate API documentation HIGH IR1, IR4, FR2.1 ✓ S1, S5 I6 Basic Git Operations - Implement repository loading - Extract repository metadata - Create error handling HIGH FR1.1, FR2.1, NFR3 ✓ S1 Foundation F1 Layered Architecture Implementation - Implement modules from architecture diagram - Apply design patterns from lectures (dependency injection, layered architecture, strategy, factory, etc) HIGHEST IR4, IR1 ✓ S2 F2 Repository Input and Processing - Accept/validate Git repository URLs - Fetch repository contents - Support file type filtering HIGHEST FR1.1, FR2.1, NFR3, BR1, NFR1 ✓ S2 F3 Code Indexing System - Process code into searchable representations - Generate/store code embeddings - Integrate with Langchain4J for vector storage HIGHEST FR2.2, FR2.3, IR2, IR3, BR2, NFR1 ✓ S2 Core Value C1 Natural Language Code Search - Enable semantic search across codebase - Support language/extension filtering - Display relevant results with context HIGH FR1.2, FR1.4, FR1.3, FR1.5, BR1, NFR2 ✓ S3 C2 Code Understanding Chat Interface - Provide chat interface for code questions - Retrieve relevant code context for queries - Generate context-aware responses HIGH FR1.6, FR1.5, FR2.4, BR2, NFR2 ✓ S3 Additional E1 User Interface Implementation - Create intuitive frontend for all functionality - Support Scala.js and Python (Gradio) interfaces - Ensure responsive and accessible design MEDIUM NFR2, IR1, NFR3 ✓ S3 E2 Performance Optimization - Optimize repository lookup speed - Implement caching/reuse of embeddings - Ensure responsive search/chat experience MEDIUM NFR1, FR2.2, FR2.3, NFR2 ✓ S4 E3 Security Implementation - Sanitize user inputs - Secure Restful API - Secure data storage MEDIUM NFR3, FR2.3 ✓ S4 E4 Testing and Quality Assurance - Implement comprehensive testing strategy - Define quality metrics (see requirements document) MEDIUM IR1, NFR1, NFR2 ✓ S4, S5 E5 Visualization and Analysis - Visualize code embeddings for quality analysis - Provide metrics on search effectiveness - Generate insights for documentation MEDIUM IR5, FR1.2, BR2 ✓ S5 E6 Documentation and Reporting - Create user documentation - Generate project report - Document architecture and design decisions MEDIUM NFR2, IR4 ✓ S5 Traceability Matrix The following table shows evidence for the requirements in the Requirements Specifications document. ...

4 min · Theme PaperMod

Code Search Productivity (BR1) - Choice: Enable developers to efficiently search and understand code within Git repositories. - Rationale: Developers spend significant time searching through codebases, and improving this process directly impacts productivity. - Validation Criteria: - At least 85\% of test users report improved workflow efficiency in post-usage surveys (SUS). - Average query-to-result time under 10 seconds for the predetermined repositories. - Implementation Considerations: - Ensure integration with common development workflows. This can be done using Gradio interfaces. - Focus on search result quality and relevance (i.e. analyze the effectiveness of the generated embeddings). - Related Requirements: - FR1.2 (Search) - FR1.5 (Context) - FR1.6 (Chat) Improving Code Understanding (BR2) - Choice: Improve developer productivity by facilitating code search/understanding workflows. - Rationale: Understanding existing code is often more time-consuming than writing new code. - Validation Criteria: - Code explanations rated as "accurate and helpful" by at least 70\% of test users (SUS). - Implementation Considerations: - Implement contextual code explanations (i.e. use separate models that understand code and natural language). - Provide relationship visualization between embedding by using a 2D visualization tool. - Prioritize speed and accuracy in responses by allowing the users to select any open source model. - Related Requirements: - FR1.6 (Chat) - FR2.4 (LLM integration) - NFR2 (Usability) - NFR4 (Visualization) Repository URL Input Interface (FR1.1) - Choice: As a user, I can specify a Git repository URL to inspect its code, so that I can access and analyze specific codebases I'm interested in. - Rationale: The system needs a secure and user-friendly way to fetch Git repository URLs. - Validation Criteria: - The acceptance tests parse valid GitHub URLs successfully. - Error feedback displayed within 2 seconds of validation failure. - URL validation completes within half a second for all inputs. - Implementation Considerations: - Implement URL validation and display clear feedback in case of parsing errors. - Related Requirements: - FR2.1 (Repository Cloning) - NFR3 (Security) Code Search Using Markdown (FR1.2) - Choice: As a user, I can search for code using keywords or natural language, so that I can quickly find relevant code sections without manually browsing through files. - Rationale: This allows users to search for code using natural language, making it easier to find relevant code sections. - Validation Criteria: - Search results return in under 2 seconds for the predetermined repositories. - Language filtering correctly categorizes at least 95\% of code files. - Implementation Considerations: - Use embedding model to convert natural language queries to vectors. - Support filtering by language, content type and extension. - Related Requirements: - FR2.2 (Code Indexing) - FR2.3 (Vector Database) - NFR1 (Performance) Search Results Display System (FR1.3) - Choice: As a user, I can view the search results with code snippets and links to the original files in the repository, so that I can efficiently evaluate search results and navigate to the full context when needed. - Validation Criteria: - 95\% of users can correctly identify file locations from the display assessed via the SUS survey. - Code snippets maintain proper indentation and formatting (Python, Scala frontend). - Implementation Considerations: - Display code snippets with syntax highlighting (Python). - Show file path and location information. - Related Requirements: - FR1.2 (Search) - FR1.5 (Context) - NFR2 (Usability) Code Search using Code Embeddings (FR1.4) - Choice: As a user, I can filter search results by programming language, so that I can focus on code written in languages relevant to my current task. - Rationale: Developers often need to restrict searches to specific languages or file types. - Validation Criteria: - Language detection accuracy >95\% across all common programming languages (see partser impl.). - Multiple simultaneous filters function correctly in 100\% as assessed by the acceptance tests. - Implementation Considerations: - Detect and classify programming languages during indexing. - Create efficient language metadata for fast filtering. - Support multiple simultaneous language filters. - Include language identification in UI. - Related Requirements: - FR1.2 (Search) - FR1.3 (Search Results) - FR2.2 (Code Indexing) Code Context Visualization (FR1.5) - Choice: As a user, I can view the context around a code snippet in the search results, so that I can better understand how the code fits into the broader implementation. - Rationale: This allows users to view the context around a code snippet in the search results, making it easier to understand how the code fits into the broader implementation. - Validation Criteria: - Fetching a repository loads within 5 seconds for the predetermined repositories. - 90\% of users report sufficient context for understanding code purpose (SUS) - Implementation Considerations: - Display the entire code being indexed in the search results. - When answering questions, display relevant snippets from the codebase. - Allow the user to switch between the full text and retrieved snippets. - Related Requirements: - FR1.3 (Search Results) - NFR2 (Usability) Model with Past Chat History (FR1.6) - Choice: As a user, I can ask code-related questions via chat, and the chat history is preserved, so that I can have a continuous conversation with the system. - Rationale: This allows users to have a continuous conversation with the system, making it easier to understand how the code fits into the broader implementation. - Validation Criteria: - Context-aware responses remain relevant for at least 2 consecutive related questions. - Chat history can be cleared by regenerating the index. - Implementation Considerations: - Maintain chat history within session scope. - Structure LLM prompts to include chat history and retrieved code. - Related Requirements: - FR2.4 (LLM Integration) - FR2.3 (Vector Database) - NFR2 (Usability) Repository Cloning and Management (FR2.1) - Choice: As a developer, I need the system to fetch and clone Git repositories from provided URLs, so that I can work with up-to-date code without performing these operations manually. - Rationale: This allows developers to work with up-to-date code without performing these operations manually. - Validation Criteria: - Predetermined repositories clone successfully within 30 seconds. - UI remains responsive (no blocking) during 100\% of cloning operations. - Invalid repository URLs are handled gracefully. - Implementation Considerations: - Use Uithub for extracting the repository code. - Implement caching mechanism for previously cloned repositories. - Add repository verification to ensure valid Git URLs. - Related Requirements: - FR2.2 (Code Indexing) - NFR1 (Performance) - NFR3 (Security) Vector Database Generation for RAG (FR2.2) - Choice: As a developer, I need the system to index the code of the fetched repositories, so that I can perform fast and accurate searches across the entire codebase. - Validation Criteria: - The clusters generated by the embeddings are well defined, suggesting successful embedding generation. - Metadata correctly captures language and file type. - Implementation Considerations: - Use Qdrant for caching code embeddings. - Utilize metadata to enhance search results relevance. - Related Requirements: - FR2.2 (Code Indexing) - NFR1 (Performance) Semantic Search (FR2.3) - Choice: As a developer, I need the system to use a vector database to store code embeddings, so that I can perform semantic searches that understand code context beyond simple keyword matching. - Validation Criteria: - Queries complete in less than 300ms for the predetermined repositories. - Vector similarity scores correctly correlate with semantic relevance as determined by the cluster analysis. - Implementation Considerations: - Configure Qdrant collection schema for code embeddings. - Use metadata for filtering specific file types. - Related Requirements: - FR1.2 (Search) - FR2.2 (Code Indexing) - NFR1 (Performance) - IR2 (Qdrant) LLM Integration for Natural Language Queries (FR2.4) - Choice: As a developer, I need the system to integrate with an LLM to process natural language queries, so that I can interact with the codebase using plain English rather than specialized query syntax. - Validation Criteria: - Ollama integration successfully handles queries within tests without errors. - Implementation Considerations: - Implement prompt engineering techniques to guide responses (i.e. conditional RAG, search by file type). - Design context management for large repositories (limit the amount of tokens being processed). - Related Requirements: - FR1.6 (Chat Functionality) - NFR1 (Performance) - NFR3 (Security) System Performance Optimization (NFR1) - Choice: The system will index code for targeted repositories within 10 seconds on the specified hardware. - Success Criteria: - Search queries return results in under 40 seconds for the predetermined repositories. - Embedding generation completes in under 30 seconds for the predetermined repositories. - Chat responses for simple search (without code context) arrives within 20 seconds for all tests. - Implementation Considerations: - Embeddings are stored in the Qdrant vector database. - The embeddings for retrieving chunks are generated using Ollama. - Related Requirements: - FR2.1 (Repository Cloning) - FR2.2 (Code Indexing) - FR1.2 (Search) - FR1.6 (Chat) System Usability Testing (NFR2) - Choice: The system should achieve a System Usability Scale (SUS) score of 70+ based on at least 5 target users. - Rationale: This ensures that the system is usable, as evaluated by a group of users. - Validation Criteria: - First-time users find the interface easy to use without assistance in >70\% of cases (SUS) - 80\% of users rate UI intuitiveness as "good" or "excellent" (SUS) - Implementation Considerations: - Implement clean, intuitive UI. Rely on established UX design patterns by using Gradio components. - Related Requirements: - FR1.3 (Search Results) - FR1.5 (Context) - FR1.6 (Chat) User Interface Security (NFR3) - Choice: The system should sanitize user search query inputs to prevent Cross-Site Scripting (XSS) attacks. - Validation Criteria: - 100\% of malformed/malicious URLs rejected before processing - Implementation Considerations: - Validate repository URLs against known valid patterns. - Test against standard Github URLs. - Related Requirements: - FR1.1 (Repository Input) - FR1.2 (Search) - FR1.6 (Chat) Embedding Visualization Requirement (NFR4) - Choice: A 2D visualization tool will display code embeddings to help analyze and improve indexing and search. - Validation Criteria: - Visualization correctly clusters similar code types. - Report analysis identifies strategies for improving search quality. - Implementation Considerations: - Implement dimension reduction techniques (t-SNE, UMAP) for 2D visualization. - Used to make informed decisions about the search quality. - Related Requirements: - FR2.2 (Code Indexing) - FR2.3 (Vector Database) Scala Implementation Requirement (IR1) - Choice: The system should be implemented in Scala, following functional programming principles. - Success Criteria: - Scala tools are used to ensure consisent. - Functional programming patterns are used to ensure consistent code quality. - Implementation Considerations: - Use appropriate abstraction mechanisms (strategies, factories, memoization, etc.) - Implement error handling using functional approaches (Try, Option) - Related Requirements: - FR2.2 (Code Indexing) - FR2.3 (Vector Database) - FR2.4 (LLM Integration) Qdrant Vector Database Requirement (IR2) - Choice: The system should use Qdrant as the vector database for code embeddings. - Rationale: Qdrant provides efficient vector search capabilities with filtering options needed for code search. - Validation Criteria: - Qdrant client wrapper handles all required vector operations. - Collection schemas designed for code embeddings and text embeddings. - Implementation Considerations: - Implement the AIServices wrapper around the Qdrant module. Configure distance metrics. - Related Requirements: - FR2.3 (Vector Database) - NFR1 (Performance) Ollama Integration Requirement (IR3) - Choice: The system should integrate with Ollama for LLM functionalities. - Rationale: Ollama provides locally-hosted LLM models, yielding good privacy and reduced latency. - Success Criteria: - The application successfully communicates with Ollama. - Implementation Considerations: - Implement the AIServices wrapper around the Ollama module. - Use prompt templates optimized for code understanding. - Related Requirements: - FR1.6 (Chat) - FR2.4 (LLM Integration) - NFR1 (Performance) Layered Architecture (IR4) - Choice: The system should follow a layered architecture approach, ensuring better modularity. - Success Criteria: - All components separated into Presentation, Application, Domain, and Infrastructure layers. - The separation is assessed via ArchUnit tests.

10 min · Theme PaperMod

Req ID Description BR1 Allow users to efficiently search and understand code within Git repositories. BR2 Improve developer productivity by facilitating code search/understanding workflows. FR1.1 As a user, I can specify a Git repository URL to inspect its code. FR1.2 As a user, I can search for code using keywords or natural language queries. FR1.3 As a user, I can view the search results with code snippets and links to the original files in the repository. FR1.4 As a user, I can filter search results by programming language. FR1.5 As a user, I can view the context around a code snippet in the search results. FR1.6 As a user, I can ask code-related questions via chat, and the chat history is preserved. FR2.1 As a developer, I need the system to fetch and clone Git repositories from provided URLs. FR2.2 As a developer, I need the system to index the code of the fetched repositories, to generate fast responses. FR2.3 As a developer, I need the system to use a vector database to store code embeddings for semantic search. FR2.4 As a developer, I need the system to integrate with an LLM to process natural language queries. NFR1 The system will index code for targeted repositories within 10 seconds on the specified hardware. NFR2 The system should achieve a System Usability Scale (SUS) score of 70+ based on at least 5 target users. NFR3 The system should sanitize user search query inputs to prevent Cross-Site Scripting (XSS) attacks. NFR4 A 2D visualization tool will display code embeddings to help analyze and improve indexing and search. IR1 The system should be implemented in Scala, following functional programming principles. IR2 The system should use Qdrant as the vector database for code embeddings. IR3 The system should integrate with Ollama for LLM functionalities. IR4 The system should follow a layered architecture approach, ensuring better modularity.

2 min · Theme PaperMod

Requirement Design element Implementation Evidence Done BR1: Search Productivity Project-wide BusinessRequirementsSuite.scala SUS Questionnaire Embedding Diagrams ✓ BR2: Improve Code Understanding Project-wide createTextEmbeddingModel createCodeEmbeddingModel Python/Scala Frontend SUS Questionnaire Embedding Diagrams ✓ FR1.1: Repository URL Input Interface GithubWrapperService.scala UserFunctionalRequirementsSuite ✓ FR1.2: Code Search Using Markdown QdrantEmbeddingStore.scala GithubWrapperService.scala UserFunctionalRequirementsSuite ✓ FR1.3: Search Results Display System Scala frontend Python frontend SUS Questionnaire ✓ FR1.4: Code Search using Code QdrantEmbeddingStore.scala GithubWrapperService.scala UserFunctionalRequirementsSuite ✓ (see related FR1.2) FR1.5: Code Context Visualization Scala frontend Python frontend RepositoryWithLanguages GithubWrapperService UserFunctionalRequirementsSuite SUS Questionnaire ✓ FR1.6: Model with Past Chat History Pipeline.scala RAGComponentFactory.scala UserFunctionalRequirementsSuite ✓ FR2.1: Repository Cloning GithubWrapperService.scala FetchingService.scala SystemFunctionalRequirementsSuite ✓ FR2.2: Vector Database Generation IngestorService.scala CacheService.scala QdrantEmbeddingStore.scala SystemFunctionalRequirementsSuite ✓ FR2.3: Vector Database Implementation QdrantEmbeddingStore.scala GithubWrapperService.scala UserFunctionalRequirementsSuite ✓ FR2.4: LLM Integration for Code QueryRoutingStrategy.scala QueryFilterService.scala ChatService.scala SystemFunctionalRequirementsSuite ✓ NFR1: Performance Optimization ChatService.scala CacheService.scala IngestorService.scala GithubWrapperService.scala NonFunctionalRequirementsSuite ✓ NFR2: System Usability Optimization GithubWrapperService.scala Scala frontend Python frontend NonFunctionalRequirementsSuite SUS Questionnaire ✓ NFR3: User Interface Security Scala frontend Python frontend NonFunctionalRequirementsSuite ✓ NFR4: Embedding Visualization IngestorService.scala Final Report ✓ IR1: Scala Implementation (declarative programming) Project-wide Adherence to the Gemini style guide ✓ IR2: Qdrant Vector Database IngestorService.scala ComponentFactory.scala application.conf ✓ IR3: Ollama Integration QueryRoutingStrategy.scala QueryFilterService.scala ChatService.scala application.conf ✓ IR4: Layered Architecture Project-wide ArchUnit tests ✓

1 min · Theme PaperMod