Teaching Machines to Find the Right Reviewers
When you're dealing with hundreds of pull requests a day, the old way of doing things breaks down pretty fast. Most large codebases which use Azure Devops for the code management depend on path-based assignment rules to decide reviewers / reviewer groups. It works, but only to an extent. As teams and repositories grow, those rules quickly go stale.
More importantly, solving this proglem might seem simple on paper by introducing accountability (which quickly goes for a toss unless there is *monetary reward associated with an activity like code review) but finding right reviewers when you have hundreds of pull requets merged in a day is challenging.
This post walks through the problem, the ML approach, and what I learned building a data-driven alternative to configuration files.
Note: I have used AI to help write portions of this post.
The Problem
Let me explain this problem with a concrete example. Remember I am talking about Azure Devops Enterprise (not Github), so some of the terms might be new but any organization that works at scale eventually gets into this same problem.
Here's what the enterprise started (rather stuck) with: a path-based system for assigning reviewers. You touch a file in /src/messaging, you get the messaging team as reviewers. Touch something in /components/chat, you get the chat team. Simple solution to start with.
Except it doesn't actually work when you scale up. đĽ˛
- How about a single PR that touched five different directories?
- Someone who became an expert in a new area six months ago? Too bad, the path rules haven't been updated.
- That one senior engineer who technically owns
/legacy-codebut hasn't looked at it in two years? They're getting pinged (spammed) anyway.
And the worst part? The people who should be reviewing certain changesâthe ones who actually have context, who recently worked on similar code, who could provide valuable feedbackâthey never even see those PRs. Let's call this a discoverability problem.
What If We Just... Asked the Data?
The pros of scale is that you have tremendous data lying around. In this case, the enterprise had years of pull request history just sitting there. Thousand of reviews, hidden-patterns that we just had to identify. What if instead of maintaining these rigid path-based rules, I let a machine learning model figure out who the right reviewers actually are?
Not based on who owns a directory in a config file, but based on who's actually been doing the work.
So I built a machine learning recommendation system that predicts whether a pull request is relevant to a reviewer based on historical patterns.
The Core Approach
Instead of hardcoded rules, I took a classification approach. For any given PR and reviewer pair, the model predicts: is this PR relevant for this person?
The model looks at four main dimensions:
Sociality: Do you know the person who wrote this code? If you've reviewed their PRs before, there's probably a reason. Maybe you work closely together, maybe you're both experts in the same domain. That social connection matters. Over time, I also realised over time that this *indirectly correlates developers reporting to same leadership without using management chart data.
Context: Do you understand what this PR is trying to do? Not just "have you touched these files," but do you understand the broader problem space? Have you worked on related features recently?
Interest alignment: Does this match your review patterns? Some people gravitate toward performance work, others toward UI, others toward infrastructure. The model picks up on these patterns over time.
Expertise: Are you actually knowledgeable here? This is where we look at your contributions, your review history, your engagement with similar code. Not just whether you own a directory, but whether you really know this stuff.
Handling the Time Problem
The clever bitâand this took some iterationâis how I handled time. Instead of looking at the complete git history, I used a rolling windows approach: 30 days, 90 days, 180 days.
This way, if your focus shifts, the model shifts with you. If you stop working on auth code and move to API design, you'll stop getting auth PRs and start seeing API ones. This temporal dimension helps prevent stale recommendations and ensures the model stays current with people's evolving expertise.
We also retrain the model regularly, using only recent historical data. This keeps it from getting stuck in outdated patterns.
Building Features That Actually Matter
The feature engineering was the heart of this. For each PR-reviewer pair, we extract signals across those four dimensions:
- How many times has this reviewer worked with this author?
- What's the overlap between files the reviewer has touched and files in this PR?
- How recently did the reviewer work on similar changes?
- What's the reviewer's contribution pattern in related areas?
Each signal gets computed across multiple time windows. So instead of just "has this person reviewed 10 PRs from this author," I know "they reviewed 2 in the last month, 5 in the last quarter, 8 in the last six months." That temporal granularity turns out to be really important.
Avoiding the Reinforcement Trap
Building this wasn't just about getting the model to perform well. It was about avoiding the traps that other systems fell into.
There's a natural self-reinforcing bias in reviewer recommendation systems. If you review one type of PR once, you might keep getting assigned similar PRs, which means you'd review them, which means you'd get assigned more, and suddenly you're pigeonholed.
I tried to break that cycle by looking beyond just file-level interactions. The sociality introduced diversity based on who you work with, not just what files you touch. The temporal windows let your expertise profile(expertise) evolve naturally. The fact that we retrain frequently helps finding new patterns can emerge without being drowned out by old history (say new teams forming for dedicated areas).
It's not perfect but it's better than pretending we can hardcode expertise in a configuration file.
What I Learned Along the Way
The biggest lesson? Feature engineering matters more than model complexity.
I didn't use anything exoticâjust a solid classification model. The magic was in figuring out which signals actually predicted review relevance. File overlap alone wasn't enough. Recency alone wasn't enough. But the combination of sociality, context, interest, and expertise across multiple time windows? That worked.
Second lesson: Temporal features are crucial for systems that evolve. Codebases change. People's roles change. A static model trained on all-time data gets stale fast. Building time into the feature space itself keeps the recommendations fresh.
Third: Test on realistic data. I validated this on a large corpus of historical pull requests, looking at whether the model correctly predicted actual review relationships. That ground truth testing caught issues early that would have been painful to discover in production.
What This Approach Enables
Beyond just faster code reviews, this pattern opens up interesting possibilities:
- Reduced review fatigue: People only see PRs they're actually qualified to review
- Better knowledge distribution: Junior engineers get matched with senior reviewers who know the relevant domain
- Discovery of hidden expertise: The model can surface people who have relevant knowledge but weren't in the hardcoded reviewer lists
- Adaptive expertise tracking: As people's skills and focus areas evolve, the recommendations evolve with them
It's a shift from "who should review this based on organizational structure" to "who can actually provide valuable feedback based on demonstrated expertise."
Where This Approach Falls Short
Let me be clear about the limitations:
Cold start problems. New engineers have no review history. The model can't recommend them or suggest what they should review.
New code areas. If nobody's touched a module before, there's no expertise signal to learn from.
Bias reinforcement. The model learns from existing patterns. If your review distribution was inequitable before, it'll stay that way.
Rare expertise gets missed. That one person who knows the legacy billing system but hasn't touched it in six months? The model might not surface them when you need them.
Small teams don't need this. With 10 engineers who all know the codebase, simple round-robin works fine. This approach only makes sense at scale.
The Bigger Picture
Code review is just one place where we rely on rigid rules that don't reflect reality. The same pattern could apply to:
- Incident response (who should be paged for this type of issue?)
- Design reviews (who has relevant experience with this architecture?)
- Onboarding (who should a new engineer shadow based on their interests?)
Any time you're matching people to work based on expertise, this kind of data-driven approach beats hardcoded rules.
This approach to reviewer recommendation is something I built to solve a real scaling problem in large engineering organizations. If you're working on similar challenges or want to discuss the technical details, I'm always happy to chat about what worked, what didn't, and what I'd do differently next time.