Skip to content

Commit

Permalink
fix benchmarking, add results
Browse files Browse the repository at this point in the history
  • Loading branch information
gmickel committed Aug 16, 2024
1 parent 5ccbc4c commit 6f0be7f
Show file tree
Hide file tree
Showing 6 changed files with 317 additions and 171 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -170,3 +170,4 @@ demotask.md

# benchmark reports
benchmark/reports/
!benchmark/reports/*_reference.md
59 changes: 36 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ AI-Powered End-to-End Task Implementation & blazingly fast Codebase-to-LLM Conte
[Templates](#-templates)
[Configuration](#-configuration)
[API](#-api)
[Benchmarking](#-benchmarking)
[Benchmarking](#benchmarking)
[Contributing](#-contributing)
[Roadmap](#-roadmap)
[FAQ](#-faq)
Expand All @@ -28,7 +28,7 @@ AI-Powered End-to-End Task Implementation & blazingly fast Codebase-to-LLM Conte

CodeWhisper is a powerful tool that bridges the gap between your codebase and Large Language Models (LLMs). It serves two primary functions:

1. **AI-Powered End-to-End Task Implementation**: Tackle complex, codebase-spanning tasks with ease. CodeWhisper doesn't just suggest snippets; it plans, generates, and applies comprehensive code changes across your entire project, from backend logic to frontend integration.
1. **AI-Powered End-to-End Task Implementation**: Tackle complex, codebase-spanning tasks with ease. CodeWhisper doesn't just suggest snippets; it plans, generates, and applies comprehensive code changes across your entire project, from backend logic to frontend integration. CodeWhisper's generations are SOTA and outperform other AI-code generation tools in benchmarks. See [Benchmarking](#benchmarking) for more details.

2. **Precision-Guided Context Curation for LLMs**: Harness the power of human insight to feed AI exactly what it needs. Quickly transform carefully selected parts of your codebase into rich, relevant context for LLMs, ensuring more accurate and project-aligned results.

Expand Down Expand Up @@ -112,26 +112,27 @@ While CodeWhisper excels at performing individual coding tasks and even large fe

## ✨ Key Features

| Feature | Description |
| ----------------------------------------------- | ----------------------------------------------------------------- |
| 🧠 AI-powered task planning and code generation | Leverage AI to plan and implement complex coding tasks |
| 🔄 Full git integration | Version control of AI-generated changes |
| 🔄 Diff-based code modifications | Handle larger edits within output token limits |
| 🌍 Support for various LLM providers | Compatible with Anthropic, OpenAI, Ollama and Groq |
| 🔐 Support for local models | Use local models via Ollama |
| 🚀 Blazingly fast code processing | Concurrent workers for improved performance |
| 🎯 Customizable file filtering and exclusion | Fine-tune which files to include in the context |
| 📊 Intelligent caching | Improved performance through smart caching |
| 🔧 Extensible template system | Interactive variable prompts for flexible output |
| 🖊️ Custom variables in templates | Support for single-line and multi-line custom variables |
| 💾 Value caching | Quick template reuse with cached values |
| 🖥️ CLI and programmatic API | Use CodeWhisper in scripts or as a library |
| 🔒 Respect for .gitignore | Option to use custom include and exclude globs |
| 🌈 Full language support | Compatible with all text-based file types |
| 🤖 Interactive mode | Granular file selection and template customization |
| ⚡ Optimized for large repositories | Efficient processing of extensive codebases |
| 📝 Detailed logging | Log AI prompts, responses, and parsing results |
| 🔗 GitHub integration | Fetch and work with issues (see [Configuration](#-configuration)) |
| Feature | Description |
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| 🧠 AI-powered task planning and code generation | Leverage AI to plan and implement complex coding tasks |
| 🚀 SOTA generations | CodeWhisper's generations are SOTA and outperform other AI-code generation tools in benchmarks. See [Benchmarking](#benchmarking) for more details. |
| 🔄 Full git integration | Version control of AI-generated changes |
| 🔄 Diff-based code modifications | Handle larger edits within output token limits |
| 🌍 Support for various LLM providers | Compatible with Anthropic, OpenAI, Ollama and Groq |
| 🔐 Support for local models | Use local models via Ollama |
| 🚀 Blazingly fast code processing | Concurrent workers for improved performance |
| 🎯 Customizable file filtering and exclusion | Fine-tune which files to include in the context |
| 📊 Intelligent caching | Improved performance through smart caching |
| 🔧 Extensible template system | Interactive variable prompts for flexible output |
| 🖊️ Custom variables in templates | Support for single-line and multi-line custom variables |
| 💾 Value caching | Quick template reuse with cached values |
| 🖥️ CLI and programmatic API | Use CodeWhisper in scripts or as a library |
| 🔒 Respect for .gitignore | Option to use custom include and exclude globs |
| 🌈 Full language support | Compatible with all text-based file types |
| 🤖 Interactive mode | Granular file selection and template customization |
| ⚡ Optimized for large repositories | Efficient processing of extensive codebases |
| 📝 Detailed logging | Log AI prompts, responses, and parsing results |
| 🔗 GitHub integration | Fetch and work with issues (see [Configuration](#-configuration)) |

## 📺 Video

Expand Down Expand Up @@ -387,7 +388,7 @@ For more detailed instructions on using the GitHub integration and other CodeWhi

CodeWhisper can be used programmatically in your Node.js projects. For detailed API documentation and examples, please refer to [USAGE.md](USAGE.md).

## 🏋️ Benchmarking
## Benchmarking

CodeWhisper includes a benchmarking tool to evaluate its performance on Exercism Python exercises. This tool allows you to assess the capabilities of different AI models and configurations.

Expand Down Expand Up @@ -424,6 +425,18 @@ Reports are saved in `benchmark/reports/` with timestamped filenames.

For full details on running benchmarks, interpreting results, and available options, please refer to the [Benchmark README](./benchmark/README.md).

### Results

CodeWhisper's performance has been evaluated across different models using the Exercism Python exercises. Below is a summary of the benchmark results:

| Model | Tests Passed | Time (s) | Cost ($) | Command |
| -------------------------- | ------------ | -------- | -------- | ---------------------------------------------------- |
| claude-3-5-sonnet-20240620 | 80.27% | 1619.49 | 3.4000 | `./benchmark/run_benchmark.sh --workers 5 --no-plan` |

These results provide insights into the efficiency and accuracy of different models when used with CodeWhisper. The "Tests Passed" percentage indicates the proportion of Exercism tests successfully completed, while the time and cost metrics offer a view of the resource requirements for each model.

As we continue to run benchmarks with various models and configurations, this table will be updated to provide a comprehensive comparison, helping users make informed decisions about which model might best suit their needs.

## 🤝 Contributing

We welcome contributions to CodeWhisper! Please read our [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.
Expand Down
15 changes: 15 additions & 0 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,21 @@ This benchmark tool is designed to evaluate the performance of CodeWhisper on Ex
- Running the full benchmark will use a significant amount of tokens.
- Too many concurrent workers is likely to cause rate limiting issues.

## Results

CodeWhisper's performance has been evaluated across different models using the Exercism Python exercises. Below is a summary of the benchmark results:

| Model | Tests Passed | Time (s) | Cost ($) | Command |
| -------------------------- | ------------ | -------- | -------- | ------------------------------------------------------------------------------ |
| claude-3-5-sonnet-20240620 | 80.27% | 1619.49 | 3.4000 | `./benchmark/run_benchmark.sh --workers 5 --no-plan` |
| gpt-4o-2024-08-06 | 81.51% | 986.68 | 1.6800 | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model gpt-4o-2024-08-06` |

The full reports used to generate these results are available in the `benchmark/reports/` directory.

These results provide insights into the efficiency and accuracy of different models when used with CodeWhisper. The "Tests Passed" percentage indicates the proportion of Exercism tests successfully completed, while the time and cost metrics offer a view of the resource requirements for each model.

As we continue to run benchmarks with various models and configurations, this table will be updated to provide a comprehensive comparison, helping users make informed decisions about which model might best suit their needs.

## Usage

1. Build the Docker image:
Expand Down
Loading

0 comments on commit 6f0be7f

Please sign in to comment.