Benchmarking the Giants: Together AI Adds Commercial Model Support to Evaluations

A new update allows developers to pit open-source fine-tunes directly against GPT-4, Gemini, and Claude within a single workflow.

In a recent post, the team at Together AI announced a substantial expansion to their evaluation infrastructure, "Together Evaluations." The platform, originally designed to assess models within the open-source ecosystem, now supports direct benchmarking against top-tier commercial APIs from OpenAI, Anthropic, and Google.

The Context

For AI engineers and product leaders, the decision between utilizing a proprietary foundation model (like GPT-4 or Claude 3) versus deploying an open-source alternative (like Llama 3 or Mixtral) is rarely straightforward. While public leaderboards provide general rankings, they often fail to reflect performance on domain-specific tasks. Consequently, teams are forced to build custom evaluation pipelines that can interface with multiple disparate providers, normalize the results, and calculate cost-performance ratios. This fragmentation slows down the iteration cycle and complicates the "build vs. buy" analysis.

The Gist

Together AI's update aims to consolidate this fragmented workflow. By integrating support for external commercial models, the platform now serves as a unified arena for model selection. Developers can upload their specific evaluation datasets and run them simultaneously against a diverse lineup: their own fine-tuned models, standard open-source checkpoints, and proprietary commercial APIs.

The core argument presented in the announcement is that effective model selection requires a holistic view of three variables: quality, cost, and performance (latency/throughput). By centralizing these metrics, Together Evaluations allows users to objectively verify if a smaller, cheaper open-source model can match the quality of a larger commercial model for their specific use case, or conversely, confirm when a proprietary model is worth the premium.

Why It Matters

This development is significant because it lowers the barrier to rigorous testing. Instead of relying on intuition or generic benchmarks, organizations can now perform data-driven audits of their model strategy. It facilitates a direct comparison where a fine-tuned open-source model can be vetted against the industry standard (e.g., GPT-4o) to determine if the investment in fine-tuning yields a better ROI than simply calling an external API.

For teams currently navigating the complexities of model selection or considering a move from proprietary APIs to self-hosted open-source alternatives, this update offers a streamlined path to validation. We recommend reading the full announcement to understand the specific workflows enabled by this release.

Read the full post on the Together AI Blog

Key Takeaways

Together Evaluations now supports benchmarking against OpenAI, Anthropic, and Google models.
The platform enables side-by-side comparison of proprietary APIs, open-source models, and custom fine-tunes.
Evaluations focus on a trifecta of metrics: output quality, operational cost, and inference performance.
The update removes the need for developers to build custom evaluation harnesses for multi-provider testing.

Read the original post at together-blog

The Context

The Gist

Why It Matters

Key Takeaways

Sources