Story

Opening the briefing

Loading the article brief, supporting context, and related editorial blocks.

Using MemAlign to Improve Evaluation of Traditional Machine Learning in Genie Code | AI BriefWire

Original article excerpt

Server-side extracted preview paragraphs from the original source.

We built LLM judges to evaluate Genie Code's ML notebooks, found they disagreed with human experts, and used MemAlign to cut judge error by 74-89%.

Announced in March, Genie Code is Databricks’ autonomous AI partner purpose-built for data science and machine learning. It helps data teams run exploratory data analysis, create and validate features, train and evaluate models, and manage and optimize model deployments.

What sets Genie Code apart is its deep integration with Databricks. Genie Code understands your data in Unity Catalog, business context, and ML infrastructure like Model Serving. With this context, it can provide more accurate suggestions, take more meaningful action, and generate workflows that are better tailored to your organization.

That raises an important question: how do we ensure Genie Code uses all of this context effectively and generates outputs that follow ML best practices? For example, it should know when and how to use tools like MLflow to track model quality. Since the generated code greatly depends on the problem the customer is trying to solve, evaluating the quality of generated code is far from trivial.

In this post, we’ll walk through how we built an evaluation pipeline for Genie Code’s traditional ML capabilities, and how we used MemAlign, a new open-source alignment framework in MLflow, to close a large gap between LLM judges and human experts. The improved judges helped us identify and fix gaps in Genie Code’s ML guidance that we would have otherwise missed.

Evaluating traditional ML notebooks is one of the most complex evaluation tasks as it spans evaluation of code quality, best ML practices, and data-informed adaptations/tailoring. To handle a task as broad and messy as evaluating ML notebooks, we use an LLM-as-a-judge - an LLM “expert” taught by humans what exactly a good notebook looks like. We created nine judges which are prompted to evaluate the ML notebooks along nine dimensions that appear in most ML workflows:

For each dimension, we wrote scoring rubrics (reused between human raters and LLM judges) that assign a score from 1 to 3, and 0 for "not applicable":

Opening the briefing

Using MemAlign to Improve Evaluation of Traditional Machine Learning in Genie Code

Original article excerpt

Visa invests in Replit to power agentic payments for developers

YouTube will let you ask AI to make a custom video feed

Vertu wants CEOs to run companies from an AI foldable starting at $6,880