This comprehensive tutorial, authored by the founders of HelixDB, demonstrates how to build a GraphRAG (Graph-Vector Retrieval-Augmented Generation) system for professor recommendations using HelixDB’s unified graph-vector database. By combining graph traversal (to model relationships like professor-department-university) with vector embeddings (for semantic similarity in research interests), the system delivers highly accurate and context-aware academic recommendations. The guide covers dataset design, schema definition, Python implementation, query examples, and best practices, and also shows integration with LLMs (such as Google Gemini) for natural language querying. The approach is scalable and extensible to other domains, making it suitable for both small and large academic networks. As the founders state, “By leveraging HelixDB’s unified graph and vector capabilities, you can build scalable, semantically-rich recommendation systems that go far beyond keyword search.”
What is the main advantage of using HelixDB for professor recommendations?
* HelixDB enables hybrid search by combining graph traversal for structured relationships and vector embeddings for semantic similarity, resulting in more accurate and context-aware recommendations (HelixDB Blog).
How is the academic dataset structured in this system?
* The dataset includes professor details, departments, universities, labs, and research areas, all modeled as nodes and edges in a graph schema to provide rich relational context.
What types of queries can the system answer?
* The system supports semantic queries (e.g., “Find professors working on computer vision for basketball”), structured filters (by university and department), and combined queries for nuanced recommendations.
How does HelixDB integrate with LLMs and external tools?
* HelixDB can be connected to LLMs (like Google Gemini) via custom tools, enabling natural language queries to be translated into graph-vector searches for real-time recommendations.
Who authored this guide and where can I find more resources?
* The tutorial was authored by the founders of HelixDB; more documentation, SDKs, and community support are available at helix-db.com, GitHub, and their Discord.
“By leveraging HelixDB’s unified graph and vector capabilities, you can build scalable, semantically-rich recommendation systems that go far beyond keyword search.” — Founders, HelixDB
Ready to build your own intelligent academic search or recommendation engine? Explore the HelixDB documentation, try the live demo, or join the community to get started.
Building a GraphRAG System for Professor Recommendations with HelixDB
Founders HelixDB
GraphRAGTutorial
Introduction
In this comprehensive guide, we'll build a Graph-Vector RAG (Retrieval-Augmented Generation) system using HelixDB to help students find professors based on their research interests. This system combines the power of graph databases with vector embeddings to provide intelligent, context-aware recommendations.
Our system will be able to answer queries like:
"What professors does High-Energy Physics research?"
"What professors are working in X University?"
"What professors are working in the Computer Science department?"
"What professors are working in X University and are in the Computer Science department?"
"I like doing research in Large Language Models, can you recommend some professors at X University?"
Understanding the Dataset
We'll work with a comprehensive professor dataset containing:
Name and Title
Department(s) and University
Personal page URL
Short biography
Key Research Areas with descriptions
Lab affiliations and research focus
Here's an example professor record:
{
"name": "James",
"title": "Assistant Professor",
"page": "<a target="_blank" href="https://james.com">https://james.com</a>",
"department": ["Computer Science"],
"university": ["Uni X"],
"bio": "James is an Assistant Professor whose work sits at the intersection
of basketball analytics, computer vision, and large-scale machine learning.
His research focuses on turning raw player-tracking video, wearable-sensor streams,
and play-by-play logs into actionable insights for teams, coaches, and broadcasters.
Signature projects include ShotNet— a deep learning model that predicts shot success
probability in real time— and DunkGPT, a language model fine-tuned on millions of play descriptions
to generate advanced scouting reports.",
"key_research_areas": [
{
"area": "Computer Vision for Basketball",
"description": "Designing CNN and Transformer architectures that track
player pose, ball trajectory, and court zones to quantify defensive pressure and shooting mechanics."
},
{
"area": "Predictive Modelling & Simulation",
"description": "Building Monte-Carlo and sequence models that forecast
possession outcomes and season performance using play-by-play and spatial data."
},
{
"area": "Sports Analytics with Large Language Models",
"description": "Leveraging LLMs to explain model outputs, auto-generate commentary,
and mine historical game archives for strategic patterns."
},
{
"area": "Wearable Sensor Data Mining",
"description": "Applying time-series and graph learning techniques to
inertial-measurement signals for fatigue monitoring and injury prevention."
},
{
"area": "Fairness & Ethics in Sports AI",
"description": "Studying algorithmic bias and ensuring equitable analytics across different leagues,
genders, and play styles."
}
],
"labs": [
{
"name": "Basketball Data Science Lab",
"research_focus": "An interdisciplinary group combining data science, biomechanics,
and sport psychology to create next-generation analytics tools for basketball."
}
]
}
Why Vector-Graph RAG?
Traditional search systems struggle with semantic understanding and relationship traversal. By combining graph databases with vector embeddings, we get:
Graph Capabilities: Fast traversal of relationships (professor → department → university)
Vector Search: Semantic similarity matching for research areas
Hybrid Queries: Filter by exact matches (university) while finding similar research interests
At scale with 1000+ universities and millions of relationships, this hybrid approach provides both precision and semantic understanding.
System Architecture
Graph Schema Design
Our graph consists of the following components:
Nodes:
Professor: Core entity with name, title, page, bio
ResearchArea: Research domains with area and description
Department: Academic departments with name
University: Institutions with name
Lab: Research labs with name and research_focus
ResearchAreaAndDescriptionEmbedding: Vector node for semantic search
Edges:
Professor → ResearchArea
Professor → Department
Professor → University
Professor → Lab
Professor → ResearchAreaAndDescriptionEmbedding
Setting Up HelixDB
Step 1: Initialize HelixDB Project
mkdir professor_rag_system
cd professor_rag_system
helix init
Step 2: Define the Graph Schema
Create schema.hx:
// NODES //
N::Professor {
name: String,
title: String,
page: String,
bio: String,
}
N::ResearchArea {
research_area: String,
description: String,
}
N::Department {
name: String,
}
N::University {
name: String,
}
N::Lab {
name: String,
research_focus: String,
}
// EDGES //
E::HasLab {
From: Professor,
To: Lab,
}
E::HasResearchArea {
From: Professor,
To: ResearchArea,
}
E::HasUniversity {
From: Professor,
To: University,
Properties: {
since: Date DEFAULT NOW,
}
}
E::HasDepartment {
From: Professor,
To: Department,
Properties: {
since: Date DEFAULT NOW,
}
}
// VECTOR NODES //
V::ResearchAreaAndDescriptionEmbedding {
areas_and_descriptions: String,
}
E::HasResearchAreaAndDescriptionEmbedding {
From: Professor,
To: ResearchAreaAndDescriptionEmbedding,
Properties: {
areas_and_descriptions: String,
}
}
import helix
from sentence_transformers import SentenceTransformer
# Initialize embedding model (Qwen 0.6B for this example)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
# Connect to HelixDB
db = helix.Client(local=True, port=6969, verbose=True)
Step 3: Create Base Nodes
# Define research areas with descriptions
research_areas = {
"Computer Vision for Basketball": "Designing CNN and Transformer architectures that track player pose, ball trajectory, and court zones to quantify defensive pressure and shooting mechanics.",
"Predictive Modelling & Simulation": "Building Monte-Carlo and sequence models that forecast possession outcomes and season performance using play-by-play and spatial data.",
"Sports Analytics with Large Language Models": "Leveraging LLMs to explain model outputs, auto-generate commentary, and mine historical game archives for strategic patterns.",
"Wearable Sensor Data Mining": "Applying time-series and graph learning techniques to inertial-measurement signals for fatigue monitoring and injury prevention.",
"Fairness & Ethics in Sports AI": "Studying algorithmic bias and ensuring equitable analytics across different leagues, genders, and play styles."
}
# Create research area nodes and store IDs
research_area_ids = {}
for research_area in research_areas:
research_area_node = db.query("create_research_area", {"area": research_area})
research_area_ids[research_area] = research_area_node[0]['research_area']['id']
# Create department nodes
departments = ["Computer Science", "Mathematics", "Physics", "Chemistry", "Biology"]
department_ids = {}
for department in departments:
department_node = db.query("create_department", {"name": department})
department_ids[department] = department_node[0]['department']['id']
# Create university nodes
universities = ["Uni X", "Uni Y", "Uni Z"]
university_ids = {}
for university in universities:
university_node = db.query("create_university", {"name": university})
university_ids[university] = university_node[0]['university']['id']
# Create lab nodes
labs = {
"Basketball Data Science Lab": "An interdisciplinary group combining data science, biomechanics, and sport psychology to create next-generation analytics tools for basketball."
}
lab_ids = {}
for lab in labs:
lab_node = db.query("create_lab", {"name": lab, "research_focus": labs[lab]})
lab_ids[lab] = lab_node[0]['lab']['id']
Step 4: Ingest Professor Data
for professor in professors:
# Create Professor Node
professor_node = db.query("create_professor", {
"name": professor["name"],
"title": professor["title"],
"page": professor["page"],
"bio": professor["bio"]
})
professor_id = professor_node[0]['professor']['id']
# Link Professor to Research Areas
for research_area in professor["key_research_areas"]:
if research_area['area'] in research_areas:
research_area_id = research_area_ids[research_area['area']]
db.query("link_professor_to_research_area", {
"professor_id": professor_id,
"research_area_id": research_area_id
})
# Link Professor to Departments
for department in professor["department"]:
if department in department_ids:
department_id = department_ids[department]
db.query("link_professor_to_department", {
"professor_id": professor_id,
"department_id": department_id
})
# Link Professor to Universities
for university in professor["university"]:
if university in university_ids:
university_id = university_ids[university]
db.query("link_professor_to_university", {
"professor_id": professor_id,
"university_id": university_id
})
# Link Professor to Labs
for lab in professor["labs"]:
if lab['name'] in lab_ids:
lab_id = lab_ids[lab['name']]
db.query("link_professor_to_lab", {
"professor_id": professor_id,
"lab_id": lab_id
})
# Create and store research area embeddings
research_area_and_description = "\n".join([
research_area['area'] + ": " + research_area['description']
for research_area in professor['key_research_areas']
])
research_area_and_description_embedding = model.encode(research_area_and_description).astype(float).tolist()
db.query("create_research_area_embedding", {
"professor_id": professor_id,
"areas_and_descriptions": research_area_and_description,
"vector": research_area_and_description_embedding
})
## Querying the System
### Vector Similarity Search
Find professors with similar research interests:
query = "Find me a professor who does computer vision for basketball"
embedded_query_vector = model.encode(query).astype(float).tolist()
results = db.query("search_similar_professors_by_research_area_and_description", {
"query_vector": embedded_query_vector,
"k": 5
})
print(results)
Example Output:
[{'professors': [{
'page': '<a target="_blank" href="https://www.example.com">https://www.example.com</a>',
'label': 'Professor',
'bio': 'James is an Assistant Professor whose work sits at the intersection of basketball analytics...',
'name': 'James',
'id': '...',
'title': 'Assistant Professor'
}]}]
Graph-Based Queries
1. Find professors by research area
professors_by_research_area = db.query("get_professor_by_research_area_name", {
"research_area_name": "Computer Vision for Basketball"
})
You can create custom tools for LLMs to interact with your HelixDB system. Here's an example using Google's Gemini:
from google import genai
from google.genai import types
import dotenv
import os
dotenv.load_dotenv()
def search_similar_professors_by_research_area_and_description(query: str) -> dict:
"""Takes the user's query and embeds it then uses the embedded query to search for similar professors
Args:
query (str): The user's query
Returns:
A list of professors who are similar to the user's query
"""
embedded_query_vector = model.encode(query).astype(float).tolist()
results = db.query("search_similar_professors_by_research_area_and_description", {
"query_vector": embedded_query_vector,
"k": 5
})
return results
# Configure Gemini with the custom tool
client = genai.Client()
config = types.GenerateContentConfig(
tools=[search_similar_professors_by_research_area_and_description]
)
# Use the tool in conversation
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Find me a professor who does computer vision for basketball",
config=config,
)
print(response.text)
Best Practices
Data Ingestion: Process data in batches for large datasets
Embedding Models: Choose models based on your domain (academic text vs general)
Index Management: Create appropriate indexes for frequently queried fields
Query Optimization: Use graph traversal for structured queries, vectors for semantic search
Caching: Cache frequently accessed professor profiles and embeddings
Conclusion
This Graph-Vector RAG system demonstrates the power of combining structured graph relationships with semantic vector search. By leveraging HelixDB's capabilities, we've created a system that can handle both exact matches (filtering by university/department) and semantic similarity (finding professors with related research interests).
The system is easily extensible - you can add new node types (publications, grants, collaborations), create more sophisticated embeddings (combining bio + research + publications), or integrate with various LLM providers for natural language interactions.
This approach scales well from small departmental databases to large-scale academic networks with millions of relationships, providing fast, accurate, and semantically-aware professor recommendations.