
Building a GraphRAG System for Professor Recommendations with HelixDB

Founders HelixDB
Introduction
In this comprehensive guide, we'll build a Graph-Vector RAG (Retrieval-Augmented Generation) system using HelixDB to help students find professors based on their research interests. This system combines the power of graph databases with vector embeddings to provide intelligent, context-aware recommendations.
Our system will be able to answer queries like:
"What professors does High-Energy Physics research?"
"What professors are working in X University?"
"What professors are working in the Computer Science department?"
"What professors are working in X University and are in the Computer Science department?"
"I like doing research in Large Language Models, can you recommend some professors at X University?"
Understanding the Dataset
We'll work with a comprehensive professor dataset containing:
Name and Title
Department(s) and University
Personal page URL
Short biography
Key Research Areas with descriptions
Lab affiliations and research focus
Here's an example professor record:
{
"name": "James",
"title": "Assistant Professor",
"page": "<a target="_blank" href="https://james.com">https://james.com</a>",
"department": ["Computer Science"],
"university": ["Uni X"],
"bio": "James is an Assistant Professor whose work sits at the intersection
of basketball analytics, computer vision, and large-scale machine learning.
His research focuses on turning raw player-tracking video, wearable-sensor streams,
and play-by-play logs into actionable insights for teams, coaches, and broadcasters.
Signature projects include ShotNet— a deep learning model that predicts shot success
probability in real time— and DunkGPT, a language model fine-tuned on millions of play descriptions
to generate advanced scouting reports.",
"key_research_areas": [
{
"area": "Computer Vision for Basketball",
"description": "Designing CNN and Transformer architectures that track
player pose, ball trajectory, and court zones to quantify defensive pressure and shooting mechanics."
},
{
"area": "Predictive Modelling & Simulation",
"description": "Building Monte-Carlo and sequence models that forecast
possession outcomes and season performance using play-by-play and spatial data."
},
{
"area": "Sports Analytics with Large Language Models",
"description": "Leveraging LLMs to explain model outputs, auto-generate commentary,
and mine historical game archives for strategic patterns."
},
{
"area": "Wearable Sensor Data Mining",
"description": "Applying time-series and graph learning techniques to
inertial-measurement signals for fatigue monitoring and injury prevention."
},
{
"area": "Fairness & Ethics in Sports AI",
"description": "Studying algorithmic bias and ensuring equitable analytics across different leagues,
genders, and play styles."
}
],
"labs": [
{
"name": "Basketball Data Science Lab",
"research_focus": "An interdisciplinary group combining data science, biomechanics,
and sport psychology to create next-generation analytics tools for basketball."
}
]
}
Why Vector-Graph RAG?
Traditional search systems struggle with semantic understanding and relationship traversal. By combining graph databases with vector embeddings, we get:
Graph Capabilities: Fast traversal of relationships (professor → department → university)
Vector Search: Semantic similarity matching for research areas
Hybrid Queries: Filter by exact matches (university) while finding similar research interests
At scale with 1000+ universities and millions of relationships, this hybrid approach provides both precision and semantic understanding.
System Architecture
Graph Schema Design
Our graph consists of the following components:
Nodes:
Professor: Core entity with
name
,title
,page
,bio
ResearchArea: Research domains with
area
anddescription
Department: Academic departments with
name
University: Institutions with
name
Lab: Research labs with
name
andresearch_focus
ResearchAreaAndDescriptionEmbedding: Vector node for semantic search
Edges:
Professor → ResearchArea
Professor → Department
Professor → University
Professor → Lab
Professor → ResearchAreaAndDescriptionEmbedding
Setting Up HelixDB
Step 1: Initialize HelixDB Project
mkdir professor_rag_system
cd professor_rag_system
helix init
Step 2: Define the Graph Schema
Create schema.hx
:
// NODES //
N::Professor {
name: String,
title: String,
page: String,
bio: String,
}
N::ResearchArea {
research_area: String,
description: String,
}
N::Department {
name: String,
}
N::University {
name: String,
}
N::Lab {
name: String,
research_focus: String,
}
// EDGES //
E::HasLab {
From: Professor,
To: Lab,
}
E::HasResearchArea {
From: Professor,
To: ResearchArea,
}
E::HasUniversity {
From: Professor,
To: University,
Properties: {
since: Date DEFAULT NOW,
}
}
E::HasDepartment {
From: Professor,
To: Department,
Properties: {
since: Date DEFAULT NOW,
}
}
// VECTOR NODES //
V::ResearchAreaAndDescriptionEmbedding {
areas_and_descriptions: String,
}
E::HasResearchAreaAndDescriptionEmbedding {
From: Professor,
To: ResearchAreaAndDescriptionEmbedding,
Properties: {
areas_and_descriptions: String,
}
}
Step 3: Create HelixQL Queries
Create query.hx
:
// Node Creation Queries
QUERY create_professor(name: String, title: String, page: String, bio: String ) =>
professor <- AddN<Professor>({ name: name, title: title, page: page, bio: bio })
RETURN professor
QUERY create_department(name: String) =>
department <- AddN<Department>({ name: name })
RETURN department
QUERY create_university(name: String) =>
university <- AddN<University>({ name: name })
RETURN university
QUERY create_lab(name: String, research_focus: String) =>
lab <- AddN<Lab>({ name: name, research_focus: research_focus })
RETURN lab
QUERY create_research_area(name: String) =>
research_area <- AddN<ResearchArea>({ research_area: name })
RETURN research_area
// Relationship Linking Queries
QUERY link_professor_to_department(professor_id: ID, department_id: ID) =>
professor <- N<Professor>(professor_id)
department <- N<Department>(department_id)
edge <- AddE<HasDepartment>::From(professor)::To(department)
RETURN edge
QUERY link_professor_to_university(professor_id: ID, university_id: ID) =>
professor <- N<Professor>(professor_id)
university <- N<University>(university_id)
edge <- AddE<HasUniversity>::From(professor)::To(university)
RETURN edge
QUERY link_professor_to_lab(professor_id: ID, lab_id: ID) =>
professor <- N<Professor>(professor_id)
lab <- N<Lab>(lab_id)
edge <- AddE<HasLab>::From(professor)::To(lab)
RETURN edge
QUERY link_professor_to_research_area(professor_id: ID, research_area_id: ID) =>
professor <- N<Professor>(professor_id)
research_area <- N<ResearchArea>(research_area_id)
edge <- AddE<HasResearchArea>::From(professor)::To(research_area)
RETURN edge
// Embedding Creation and Search
QUERY create_research_area_embedding(professor_id: ID, areas_and_descriptions: String, vector: [F64]) =>
professor <- N<Professor>(professor_id)
research_area <- AddV<ResearchAreaAndDescriptionEmbedding>(vector, { areas_and_descriptions: areas_and_descriptions })
edge <- AddE<HasResearchAreaAndDescriptionEmbedding>::From(professor)::To(research_area)
RETURN research_area
QUERY search_similar_professors_by_research_area_and_description(query_vector: [F64], k: I64) =>
vecs <- SearchV<ResearchAreaAndDescriptionEmbedding>(query_vector, k)
professors <- vecs::In<HasResearchAreaAndDescriptionEmbedding>
RETURN professors
QUERY get_professor_research_areas_with_descriptions(professor_id: ID) =>
research_areas <- N<Professor>(professor_id)::Out<HasResearchAreaAndDescriptionEmbedding>
RETURN research_areas::{areas_and_descriptions}
// Search Queries
QUERY get_professor_by_research_area_name(research_area_name: String) =>
professors <- N<Professor>::Out<HasResearchArea>::WHERE(_::{research_area}::EQ(research_area_name))
RETURN professors
QUERY get_professors_by_university_name(university_name: String) =>
professors <- N<Professor>::Out<HasUniversity>::WHERE(_::{name}::EQ(university_name))
RETURN professors
QUERY get_professors_by_department_name(department_name: String) =>
professors <- N<Professor>::Out<HasDepartment>::WHERE(_::{name}::EQ(department_name))
RETURN professors
QUERY get_professors_by_university_and_department_name(university_name: String, department_name: String) =>
professors <- N<Professor>::WHERE(AND(
EXISTS(_::Out<HasUniversity>::WHERE(_::{name}::EQ(university_name))),
EXISTS(_::Out<HasDepartment>::WHERE(_::{name}::EQ(department_name)))
))
RETURN professors
Python Implementation
Step 1: Environment Setup
python -m venv venv
source venv/bin/activate
pip install helix-py sentence-transformers
uv venv
uv add helix-py sentence-transformers
Step 2: Initialize Connection and Model
import helix
from sentence_transformers import SentenceTransformer
# Initialize embedding model (Qwen 0.6B for this example)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
# Connect to HelixDB
db = helix.Client(local=True, port=6969, verbose=True)
Step 3: Create Base Nodes
# Define research areas with descriptions
research_areas = {
"Computer Vision for Basketball": "Designing CNN and Transformer architectures that track player pose, ball trajectory, and court zones to quantify defensive pressure and shooting mechanics.",
"Predictive Modelling & Simulation": "Building Monte-Carlo and sequence models that forecast possession outcomes and season performance using play-by-play and spatial data.",
"Sports Analytics with Large Language Models": "Leveraging LLMs to explain model outputs, auto-generate commentary, and mine historical game archives for strategic patterns.",
"Wearable Sensor Data Mining": "Applying time-series and graph learning techniques to inertial-measurement signals for fatigue monitoring and injury prevention.",
"Fairness & Ethics in Sports AI": "Studying algorithmic bias and ensuring equitable analytics across different leagues, genders, and play styles."
}
# Create research area nodes and store IDs
research_area_ids = {}
for research_area in research_areas:
research_area_node = db.query("create_research_area", {"area": research_area})
research_area_ids[research_area] = research_area_node[0]['research_area']['id']
# Create department nodes
departments = ["Computer Science", "Mathematics", "Physics", "Chemistry", "Biology"]
department_ids = {}
for department in departments:
department_node = db.query("create_department", {"name": department})
department_ids[department] = department_node[0]['department']['id']
# Create university nodes
universities = ["Uni X", "Uni Y", "Uni Z"]
university_ids = {}
for university in universities:
university_node = db.query("create_university", {"name": university})
university_ids[university] = university_node[0]['university']['id']
# Create lab nodes
labs = {
"Basketball Data Science Lab": "An interdisciplinary group combining data science, biomechanics, and sport psychology to create next-generation analytics tools for basketball."
}
lab_ids = {}
for lab in labs:
lab_node = db.query("create_lab", {"name": lab, "research_focus": labs[lab]})
lab_ids[lab] = lab_node[0]['lab']['id']
Step 4: Ingest Professor Data
for professor in professors:
# Create Professor Node
professor_node = db.query("create_professor", {
"name": professor["name"],
"title": professor["title"],
"page": professor["page"],
"bio": professor["bio"]
})
professor_id = professor_node[0]['professor']['id']
# Link Professor to Research Areas
for research_area in professor["key_research_areas"]:
if research_area['area'] in research_areas:
research_area_id = research_area_ids[research_area['area']]
db.query("link_professor_to_research_area", {
"professor_id": professor_id,
"research_area_id": research_area_id
})
# Link Professor to Departments
for department in professor["department"]:
if department in department_ids:
department_id = department_ids[department]
db.query("link_professor_to_department", {
"professor_id": professor_id,
"department_id": department_id
})
# Link Professor to Universities
for university in professor["university"]:
if university in university_ids:
university_id = university_ids[university]
db.query("link_professor_to_university", {
"professor_id": professor_id,
"university_id": university_id
})
# Link Professor to Labs
for lab in professor["labs"]:
if lab['name'] in lab_ids:
lab_id = lab_ids[lab['name']]
db.query("link_professor_to_lab", {
"professor_id": professor_id,
"lab_id": lab_id
})
# Create and store research area embeddings
research_area_and_description = "\n".join([
research_area['area'] + ": " + research_area['description']
for research_area in professor['key_research_areas']
])
research_area_and_description_embedding = model.encode(research_area_and_description).astype(float).tolist()
db.query("create_research_area_embedding", {
"professor_id": professor_id,
"areas_and_descriptions": research_area_and_description,
"vector": research_area_and_description_embedding
})
## Querying the System
### Vector Similarity Search
Find professors with similar research interests:
query = "Find me a professor who does computer vision for basketball"
embedded_query_vector = model.encode(query).astype(float).tolist()
results = db.query("search_similar_professors_by_research_area_and_description", {
"query_vector": embedded_query_vector,
"k": 5
})
print(results)
Example Output:
[{'professors': [{
'page': '<a target="_blank" href="https://www.example.com">https://www.example.com</a>',
'label': 'Professor',
'bio': 'James is an Assistant Professor whose work sits at the intersection of basketball analytics...',
'name': 'James',
'id': '...',
'title': 'Assistant Professor'
}]}]
Graph-Based Queries
1. Find professors by research area
professors_by_research_area = db.query("get_professor_by_research_area_name", {
"research_area_name": "Computer Vision for Basketball"
})
2. Find professors by university
professors_by_university = db.query("get_professors_by_university_name", {
"university_name": "Uni X"
})
3. Find professors by department
professors_by_department = db.query("get_professors_by_department_name", {
"department_name": "Computer Science"
})
4. Complex filtering: University AND Department
professors_filtered = db.query("get_professors_by_university_and_department_name", {
"university_name": "Uni X",
"department_name": "Computer Science"
})
Retrieving Research Details
Get the full research areas and descriptions for a specific professor:
prof_research_areas = db.query("get_professor_research_areas_with_descriptions", {
"professor_id": results[0]['professors'][0]['id']
})
print(prof_research_areas)
Integration with LLMs
You can create custom tools for LLMs to interact with your HelixDB system. Here's an example using Google's Gemini:
from google import genai
from google.genai import types
import dotenv
import os
dotenv.load_dotenv()
def search_similar_professors_by_research_area_and_description(query: str) -> dict:
"""Takes the user's query and embeds it then uses the embedded query to search for similar professors
Args:
query (str): The user's query
Returns:
A list of professors who are similar to the user's query
"""
embedded_query_vector = model.encode(query).astype(float).tolist()
results = db.query("search_similar_professors_by_research_area_and_description", {
"query_vector": embedded_query_vector,
"k": 5
})
return results
# Configure Gemini with the custom tool
client = genai.Client()
config = types.GenerateContentConfig(
tools=[search_similar_professors_by_research_area_and_description]
)
# Use the tool in conversation
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Find me a professor who does computer vision for basketball",
config=config,
)
print(response.text)
Best Practices
Data Ingestion: Process data in batches for large datasets
Embedding Models: Choose models based on your domain (academic text vs general)
Index Management: Create appropriate indexes for frequently queried fields
Query Optimization: Use graph traversal for structured queries, vectors for semantic search
Caching: Cache frequently accessed professor profiles and embeddings
Conclusion
This Graph-Vector RAG system demonstrates the power of combining structured graph relationships with semantic vector search. By leveraging HelixDB's capabilities, we've created a system that can handle both exact matches (filtering by university/department) and semantic similarity (finding professors with related research interests).
The system is easily extensible - you can add new node types (publications, grants, collaborations), create more sophisticated embeddings (combining bio + research + publications), or integrate with various LLM providers for natural language interactions.
This approach scales well from small departmental databases to large-scale academic networks with millions of relationships, providing fast, accurate, and semantically-aware professor recommendations.