Make way, Devin: Cosine's genius takes the AI programming crown

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more

It was not long ago that the startup Cognition caused a stir with its product Devin, an artificial intelligence-based software developer that is based on the Large Language Model (LLM) of OpenAI’s GPT-4 foundation on the backend and can independently write and edit code according to instructions in natural language.

But Devin emerged in March 2024 – five months ago – an eternity in the fast-moving field of generative AI.

Now, Cosine, another startup called “C” that was launched through the prestigious San Francisco startup accelerator Y Combinator, has announced its own new autonomous AI-driven engineer Genie, which it says significantly outperforms Devin, scoring 30% on the third-party benchmark test SWE-Bench compared to Devin’s 13.8%, even beating the 19% of Amazon’s Q and Factory’s Code Droid.

Screenshot from the Cosine website showing Genie’s performance on SWE-Bench compared to other AI coding engineering models.

“This model is so much more than a benchmark score: it has been trained from the beginning to think and behave like a human SWE (software developer),” wrote Cosine’s co-founder and CEO Alistair Pullen in a post on his account on the social network X.

I’m excited to share that we’ve built the world’s most capable AI software engineer, scoring 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it’s been trained from the start to think and behave like a human SWE. pic.twitter.com/OyvqKLxcGV
— Alistair (@AlistairPullen) 12 August 2024

What is genius and what can it do?

Genie is an advanced AI software development model designed to autonomously handle a wide range of coding tasks, from bug fixing to feature creation and code refactoring to validation through comprehensive testing under the direction of human engineers or managers.

It operates either completely autonomously or in collaboration with users and is designed to give the feeling of working with an experienced colleague.

“We’ve been pursuing a dream of building something that can truly automatically perform end-to-end programming tasks without intervention and with a high degree of reliability – an artificial coworker. Genie is the first step toward achieving just that,” Pullen wrote in the Cosine blog post announcing Genie’s performance and limited, invite-only availability.

The AI can write software in a variety of languages. Its technical report lists 15 of them as data sources, including:

JavaScript
python
typescript
TSX
Java
C#
C++
C
rust
Scale
Kotlin
Fast
Golang
PHP
ruby

Cosine claims that Genie can emulate the cognitive processes of human engineers.

“My thesis here is simple: let it watch a human engineer do his job and mimic that process,” Pullen explained in the blog post.

The code generated by Genie is stored in a user’s GitHub repository, meaning Cosine does not retain a copy nor the security risks associated with it.

In addition, Cosine’s software platform is already integrated with Slack and system notifications, allowing the company to inform users of its status, ask questions or flag issues, just like a good human colleague would.

“Genie can also ask users clarifying questions and respond to reviews/comments on the PRs (pull requests) it generates,” Pullen wrote to VentureBeat. “We’re trying to get Genie to behave like a peer, so it makes the most sense to get the model to use the channels a peer would use.”

Supported by a long-context OpenAI model

Unlike many AI models that are based on foundational models supplemented by some tools, Genie was developed through a proprietary process that involves training and fine-tuning an AI model using long token output from OpenAI.

“The model we used is a (currently) not generally available GPT-4o variant that we were allowed to train as part of OpenAI’s experimental access program,” Pullen wrote to VentureBeat via email. “The model performed well and we subsequently shared our findings with the OpenAI tuning team and engineering leadership. This was a real game changer for us as it convinced them to invest resources and attention in our novel techniques.”

While Cosine does not specify the specific model, OpenAI recently announced the limited availability of a new GPT-4o Long Output Context model that can issue up to 64,000 output tokens, up from GPT-4o’s initial 4,000—a 16x increase.

The training data was crucial

“For its final training run, Genie was trained on billions of data tokens, the mix of which was chosen to make the model as competent as possible in the languages that currently matter most to our users,” Pullen wrote in Cosine’s technical report on the agent.

With its comprehensive context window and continuous improvement cycle, Genie iterates and refines its solutions until they achieve the desired outcome.

Cosine says in its blog post that it spent nearly a year compiling a dataset with a wide range of software development activities from real engineers.

“In practice, however, it is extremely difficult to obtain such data and then use it effectively because it essentially doesn’t exist,” Pullen explained in his blog post, adding, “Our data pipeline uses a combination of artifacts, static analysis, self-play, step-by-step verification, and fine-tuned AI models trained on a large amount of labeled data to forensically infer the detailed process that must have taken place to arrive at the final result. The impact of data labeling cannot be underestimated. It is difficult to obtain high-quality data from competent software developers, but the results were worth it because they provided so much insight into how developers implicitly think about approaching problems.”

In an email to VentureBeat, Pullen clarified: “We started with artifacts from SWEs doing their work, such as PRs, commits, issues from OSS repos (MIT licensed), and then ran that data through our pipeline to forensically derive the conclusions and reconstruct how people came to their conclusions. Using this proprietary dataset, we trained version 1 and then used self-play and self-improvement to go the rest of the way.”

This dataset not only represents perfect information provenance and incremental knowledge discovery, but also captures the step-by-step decision-making process of human engineers.

“By Training “We test our models on this dataset instead of just building base models like everyone else does. We found that instead of just generating random code until one works, we approach problems like a human would,” Pullen explained.

Prices

In a follow-up email, Pullen described how Genie’s pricing structure will work.

He said the program will initially be divided into two stages:

“1. An accessible option, priced competitively with existing AI tools, around $20. This tier will have some feature and usage limitations, but will demonstrate Genie’s capabilities for individuals and small teams.

2. An enterprise-level offering with advanced features, virtually unlimited usage, and the ability to create a perfect AI colleague who is an expert on every line of code ever written in-house. This tier will be significantly more expensive, reflecting its value as a full-fledged AI engineering colleague.”

Impacts and future developments

The introduction of Genie has far-reaching implications for software development teams, especially those looking to increase productivity and reduce time spent on routine tasks. With its ability to handle complex programming tasks autonomously, Genie could potentially transform the way development resources are allocated, allowing teams to focus on more strategic initiatives.

“The idea that engineering resources are no longer a constraint is a huge driver for me, especially since starting a company,” Pullen wrote. “The value of having an AI colleague who can jump into an unknown code base and solve unforeseen problems in timescales orders of magnitude faster than a human is obvious and has a huge impact on the world.”

Cosine has ambitious plans for the future development of Genie. The company intends to expand its model portfolio to include smaller models for simpler tasks and larger models for more complex challenges. In addition, Cosine plans to extend its work to open source communities by context-enhancing one of the leading open source models and pre-training it on a huge dataset.

Availability and next steps

While Genie is already rolling out to select users, broader access is still being managed.

Interested parties can request early access to try Genie in their projects by filling out a web form on the Cosine website.

Cosine remains committed to continuous improvement and plans to provide regular updates to Genie features based on customer feedback.

“SWE-Bench recently changed its submission requirements to include the entire working process of AI models, which is challenging for us because it would require us to disclose proprietary methods,” noted Pullen. “For now, we have decided to keep these internal processes confidential, but we have made Genie’s final results publicly available on GitHub for independent review.”

More about cosine

Cosine is a human thought laboratory focused on researching and codifying human task completion, with the goal of teaching AI to mimic, excel at, and augment those tasks.

Founded in 2022 by Pullen, Sam Stenner and Yang Li, the company’s mission is to push the boundaries of AI by applying human thinking to solve complex problems, starting with software development.

Cosine has already raised $2.5 million in seed funding from Uphonest and SOMA Capital, with participation from Lakestar, Focal and others.

With a small but highly skilled team, Cosine has already made significant progress in the AI field and Genie is just the beginning.

“We firmly believe in our ability to codify human thinking for every job and every industry,” Pullen explained in his announcement blog post. “Software development is simply the most intuitive place to start, and we can’t wait to show you everything else we’re working on.”

VB Daily

Stay up to date! Get the latest news in your inbox every day

By signing up, you agree to VentureBeat’s Terms of Service.

Thank you for your subscription. You can find more VB newsletters here.

An error has occurred.