Summary
Req ID | Description |
---|---|
BR1 | Allow users to efficiently search and understand code within Git repositories. |
BR2 | Improve developer productivity by facilitating code search/understanding workflows. |
FR1.1 | As a user, I can specify a Git repository URL to inspect its code. |
FR1.2 | As a user, I can search for code using keywords or natural language queries. |
FR1.3 | As a user, I can view the search results with code snippets and links to the original files in the repository. |
FR1.4 | As a user, I can filter search results by programming language. |
FR1.5 | As a user, I can view the context around a code snippet in the search results. |
FR1.6 | As a user, I can ask code-related questions via chat, and the chat history is preserved. |
FR2.1 | As a developer, I need the system to fetch and clone Git repositories from provided URLs. |
FR2.2 | As a developer, I need the system to index the code of the fetched repositories, to generate fast responses. |
FR2.3 | As a developer, I need the system to use a vector database to store code embeddings for semantic search. |
FR2.4 | As a developer, I need the system to integrate with an LLM to process natural language queries. |
NFR1 | The system will index code for targeted repositories within 10 seconds on the specified hardware. |
NFR2 | The system should achieve a System Usability Scale (SUS) score of 70+ based on at least 5 target users. |
NFR3 | The system should sanitize user search query inputs to prevent Cross-Site Scripting (XSS) attacks. |
NFR4 | A 2D visualization tool will display code embeddings to help analyze and improve indexing and search. |
IR1 | The system should be implemented in Scala, following functional programming principles. |
IR2 | The system should use Qdrant as the vector database for code embeddings. |
IR3 | The system should integrate with Ollama for LLM functionalities. |
IR4 | The system should follow a layered architecture approach, ensuring better modularity. |
Requirements Specification
Below is a more through description of the requirements:
Code Search Productivity (BR1)
- Choice: Enable developers to efficiently search and understand code within Git repositories.
- Rationale: Developers spend significant time searching through codebases, and improving this process directly impacts productivity.
- Validation Criteria:
- At least 85\% of test users report improved workflow efficiency in post-usage surveys (SUS).
- Average query-to-result time under 10 seconds for the predetermined repositories.
- Implementation Considerations:
- Ensure integration with common development workflows. This can be done using Gradio interfaces.
- Focus on search result quality and relevance (i.e. analyze the effectiveness of the generated embeddings).
- Related Requirements:
- FR1.2 (Search)
- FR1.5 (Context)
- FR1.6 (Chat)
Improving Code Understanding (BR2)
- Choice: Improve developer productivity by facilitating code search/understanding workflows.
- Rationale: Understanding existing code is often more time-consuming than writing new code.
- Validation Criteria:
- Code explanations rated as "accurate and helpful" by at least 70\% of test users (SUS).
- Implementation Considerations:
- Implement contextual code explanations (i.e. use separate models that understand code and natural language).
- Provide relationship visualization between embedding by using a 2D visualization tool.
- Prioritize speed and accuracy in responses by allowing the users to select any open source model.
- Related Requirements:
- FR1.6 (Chat)
- FR2.4 (LLM integration)
- NFR2 (Usability)
- NFR4 (Visualization)
Repository URL Input Interface (FR1.1)
- Choice: As a user, I can specify a Git repository URL to inspect its code, so that I can access and analyze specific codebases I'm interested in.
- Rationale: The system needs a secure and user-friendly way to fetch Git repository URLs.
- Validation Criteria:
- The acceptance tests parse valid GitHub URLs successfully.
- Error feedback displayed within 2 seconds of validation failure.
- URL validation completes within half a second for all inputs.
- Implementation Considerations:
- Implement URL validation and display clear feedback in case of parsing errors.
- Related Requirements:
- FR2.1 (Repository Cloning)
- NFR3 (Security)
Code Search Using Markdown (FR1.2)
- Choice: As a user, I can search for code using keywords or natural language, so that I can quickly find relevant code sections without manually browsing through files.
- Rationale: This allows users to search for code using natural language, making it easier to find relevant code sections.
- Validation Criteria:
- Search results return in under 2 seconds for the predetermined repositories.
- Language filtering correctly categorizes at least 95\% of code files.
- Implementation Considerations:
- Use embedding model to convert natural language queries to vectors.
- Support filtering by language, content type and extension.
- Related Requirements:
- FR2.2 (Code Indexing)
- FR2.3 (Vector Database)
- NFR1 (Performance)
Search Results Display System (FR1.3)
- Choice: As a user, I can view the search results with code snippets and links to the original files in the repository, so that I can efficiently evaluate search results and navigate to the full context when needed.
- Validation Criteria:
- 95\% of users can correctly identify file locations from the display assessed via the SUS survey.
- Code snippets maintain proper indentation and formatting (Python, Scala frontend).
- Implementation Considerations:
- Display code snippets with syntax highlighting (Python).
- Show file path and location information.
- Related Requirements:
- FR1.2 (Search)
- FR1.5 (Context)
- NFR2 (Usability)
Code Search using Code Embeddings (FR1.4)
- Choice: As a user, I can filter search results by programming language, so that I can focus on code written in languages relevant to my current task.
- Rationale: Developers often need to restrict searches to specific languages or file types.
- Validation Criteria:
- Language detection accuracy >95\% across all common programming languages (see partser impl.).
- Multiple simultaneous filters function correctly in 100\% as assessed by the acceptance tests.
- Implementation Considerations:
- Detect and classify programming languages during indexing.
- Create efficient language metadata for fast filtering.
- Support multiple simultaneous language filters.
- Include language identification in UI.
- Related Requirements:
- FR1.2 (Search)
- FR1.3 (Search Results)
- FR2.2 (Code Indexing)
Code Context Visualization (FR1.5)
- Choice: As a user, I can view the context around a code snippet in the search results, so that I can better understand how the code fits into the broader implementation.
- Rationale: This allows users to view the context around a code snippet in the search results, making it easier to understand how the code fits into the broader implementation.
- Validation Criteria:
- Fetching a repository loads within 5 seconds for the predetermined repositories.
- 90\% of users report sufficient context for understanding code purpose (SUS)
- Implementation Considerations:
- Display the entire code being indexed in the search results.
- When answering questions, display relevant snippets from the codebase.
- Allow the user to switch between the full text and retrieved snippets.
- Related Requirements:
- FR1.3 (Search Results)
- NFR2 (Usability)
Model with Past Chat History (FR1.6)
- Choice: As a user, I can ask code-related questions via chat, and the chat history is preserved, so that I can have a continuous conversation with the system.
- Rationale: This allows users to have a continuous conversation with the system, making it easier to understand how the code fits into the broader implementation.
- Validation Criteria:
- Context-aware responses remain relevant for at least 2 consecutive related questions.
- Chat history can be cleared by regenerating the index.
- Implementation Considerations:
- Maintain chat history within session scope.
- Structure LLM prompts to include chat history and retrieved code.
- Related Requirements:
- FR2.4 (LLM Integration)
- FR2.3 (Vector Database)
- NFR2 (Usability)
Repository Cloning and Management (FR2.1)
- Choice: As a developer, I need the system to fetch and clone Git repositories from provided URLs, so that I can work with up-to-date code without performing these operations manually.
- Rationale: This allows developers to work with up-to-date code without performing these operations manually.
- Validation Criteria:
- Predetermined repositories clone successfully within 30 seconds.
- UI remains responsive (no blocking) during 100\% of cloning operations.
- Invalid repository URLs are handled gracefully.
- Implementation Considerations:
- Use Uithub for extracting the repository code.
- Implement caching mechanism for previously cloned repositories.
- Add repository verification to ensure valid Git URLs.
- Related Requirements:
- FR2.2 (Code Indexing)
- NFR1 (Performance)
- NFR3 (Security)
Vector Database Generation for RAG (FR2.2)
- Choice: As a developer, I need the system to index the code of the fetched repositories, so that I can perform fast and accurate searches across the entire codebase.
- Validation Criteria:
- The clusters generated by the embeddings are well defined, suggesting successful embedding generation.
- Metadata correctly captures language and file type.
- Implementation Considerations:
- Use Qdrant for caching code embeddings.
- Utilize metadata to enhance search results relevance.
- Related Requirements:
- FR2.2 (Code Indexing)
- NFR1 (Performance)
Semantic Search (FR2.3)
- Choice: As a developer, I need the system to use a vector database to store code embeddings, so that I can perform semantic searches that understand code context beyond simple keyword matching.
- Validation Criteria:
- Queries complete in less than 300ms for the predetermined repositories.
- Vector similarity scores correctly correlate with semantic relevance as determined by the cluster analysis.
- Implementation Considerations:
- Configure Qdrant collection schema for code embeddings.
- Use metadata for filtering specific file types.
- Related Requirements:
- FR1.2 (Search)
- FR2.2 (Code Indexing)
- NFR1 (Performance)
- IR2 (Qdrant)
LLM Integration for Natural Language Queries (FR2.4)
- Choice: As a developer, I need the system to integrate with an LLM to process natural language queries, so that I can interact with the codebase using plain English rather than specialized query syntax.
- Validation Criteria:
- Ollama integration successfully handles queries within tests without errors.
- Implementation Considerations:
- Implement prompt engineering techniques to guide responses (i.e. conditional RAG, search by file type).
- Design context management for large repositories (limit the amount of tokens being processed).
- Related Requirements:
- FR1.6 (Chat Functionality)
- NFR1 (Performance)
- NFR3 (Security)
System Performance Optimization (NFR1)
- Choice: The system will index code for targeted repositories within 10 seconds on the specified hardware.
- Success Criteria:
- Search queries return results in under 40 seconds for the predetermined repositories.
- Embedding generation completes in under 30 seconds for the predetermined repositories.
- Chat responses for simple search (without code context) arrives within 20 seconds for all tests.
- Implementation Considerations:
- Embeddings are stored in the Qdrant vector database.
- The embeddings for retrieving chunks are generated using Ollama.
- Related Requirements:
- FR2.1 (Repository Cloning)
- FR2.2 (Code Indexing)
- FR1.2 (Search)
- FR1.6 (Chat)
System Usability Testing (NFR2)
- Choice: The system should achieve a System Usability Scale (SUS) score of 70+ based on at least 5 target users.
- Rationale: This ensures that the system is usable, as evaluated by a group of users.
- Validation Criteria:
- First-time users find the interface easy to use without assistance in >70\% of cases (SUS)
- 80\% of users rate UI intuitiveness as "good" or "excellent" (SUS)
- Implementation Considerations:
- Implement clean, intuitive UI. Rely on established UX design patterns by using Gradio components.
- Related Requirements:
- FR1.3 (Search Results)
- FR1.5 (Context)
- FR1.6 (Chat)
User Interface Security (NFR3)
- Choice: The system should sanitize user search query inputs to prevent Cross-Site Scripting (XSS) attacks.
- Validation Criteria:
- 100\% of malformed/malicious URLs rejected before processing
- Implementation Considerations:
- Validate repository URLs against known valid patterns.
- Test against standard Github URLs.
- Related Requirements:
- FR1.1 (Repository Input)
- FR1.2 (Search)
- FR1.6 (Chat)
Embedding Visualization Requirement (NFR4)
- Choice: A 2D visualization tool will display code embeddings to help analyze and improve indexing and search.
- Validation Criteria:
- Visualization correctly clusters similar code types.
- Report analysis identifies strategies for improving search quality.
- Implementation Considerations:
- Implement dimension reduction techniques (t-SNE, UMAP) for 2D visualization.
- Used to make informed decisions about the search quality.
- Related Requirements:
- FR2.2 (Code Indexing)
- FR2.3 (Vector Database)
Scala Implementation Requirement (IR1)
- Choice: The system should be implemented in Scala, following functional programming principles.
- Success Criteria:
- Scala tools are used to ensure consisent.
- Functional programming patterns are used to ensure consistent code quality.
- Implementation Considerations:
- Use appropriate abstraction mechanisms (strategies, factories, memoization, etc.)
- Implement error handling using functional approaches (Try, Option)
- Related Requirements:
- FR2.2 (Code Indexing)
- FR2.3 (Vector Database)
- FR2.4 (LLM Integration)
Qdrant Vector Database Requirement (IR2)
- Choice: The system should use Qdrant as the vector database for code embeddings.
- Rationale: Qdrant provides efficient vector search capabilities with filtering options needed for code search.
- Validation Criteria:
- Qdrant client wrapper handles all required vector operations.
- Collection schemas designed for code embeddings and text embeddings.
- Implementation Considerations:
- Implement the AIServices wrapper around the Qdrant module. Configure distance metrics.
- Related Requirements:
- FR2.3 (Vector Database)
- NFR1 (Performance)
Ollama Integration Requirement (IR3)
- Choice: The system should integrate with Ollama for LLM functionalities.
- Rationale: Ollama provides locally-hosted LLM models, yielding good privacy and reduced latency.
- Success Criteria:
- The application successfully communicates with Ollama.
- Implementation Considerations:
- Implement the AIServices wrapper around the Ollama module.
- Use prompt templates optimized for code understanding.
- Related Requirements:
- FR1.6 (Chat)
- FR2.4 (LLM Integration)
- NFR1 (Performance)
Layered Architecture (IR4)
- Choice: The system should follow a layered architecture approach, ensuring better modularity.
- Success Criteria:
- All components separated into Presentation, Application, Domain, and Infrastructure layers.
- The separation is assessed via ArchUnit tests.