Building a GraphRAG System for Professor Recommendations with HelixDB

Introduction

In this comprehensive guide, we'll build a Graph-Vector RAG (Retrieval-Augmented Generation) system using HelixDB to help students find professors based on their research interests. This system combines the power of graph databases with vector embeddings to provide intelligent, context-aware recommendations.

Our system will be able to answer queries like:

"What professors does High-Energy Physics research?"
"What professors are working in X University?"
"What professors are working in the Computer Science department?"
"What professors are working in X University and are in the Computer Science department?"
"I like doing research in Large Language Models, can you recommend some professors at X University?"

Understanding the Dataset

We'll work with a comprehensive professor dataset containing:

Name and Title
Department(s) and University
Personal page URL
Short biography
Key Research Areas with descriptions
Lab affiliations and research focus

Here's an example professor record:

{
    "name": "James",
    "title": "Assistant Professor",
    "page": "<a target="_blank" href="https://james.com">https://james.com</a>",
    "department": ["Computer Science"],
    "university": ["Uni X"],
    "bio": "James is an Assistant Professor whose work sits at the intersection
     of basketball analytics, computer vision, and large-scale machine learning. 
     His research focuses on turning raw player-tracking video, wearable-sensor streams,
     and play-by-play logs into actionable insights for teams, coaches, and broadcasters. 
     Signature projects include ShotNet— a deep learning model that predicts shot success
     probability in real time— and DunkGPT, a language model fine-tuned on millions of play descriptions
     to generate advanced scouting reports.",
    "key_research_areas": [
        {
            "area": "Computer Vision for Basketball",
            "description": "Designing CNN and Transformer architectures that track
             player pose, ball trajectory, and court zones to quantify defensive pressure and shooting mechanics."
        },
        {
            "area": "Predictive Modelling & Simulation",
            "description": "Building Monte-Carlo and sequence models that forecast 
            possession outcomes and season performance using play-by-play and spatial data." 
        },
        {
            "area": "Sports Analytics with Large Language Models",
            "description": "Leveraging LLMs to explain model outputs, auto-generate commentary,
             and mine historical game archives for strategic patterns." 
        },
        {
            "area": "Wearable Sensor Data Mining",
            "description": "Applying time-series and graph learning techniques to
             inertial-measurement signals for fatigue monitoring and injury prevention." 
        },
        {
            "area": "Fairness & Ethics in Sports AI",
            "description": "Studying algorithmic bias and ensuring equitable analytics across different leagues,
             genders, and play styles." 
        }
    ],
    "labs": [
        {
            "name": "Basketball Data Science Lab",
            "research_focus": "An interdisciplinary group combining data science, biomechanics, 
            and sport psychology to create next-generation analytics tools for basketball."
        }
    ]
}

Why Vector-Graph RAG?

Traditional search systems struggle with semantic understanding and relationship traversal. By combining graph databases with vector embeddings, we get:

Graph Capabilities: Fast traversal of relationships (professor → department → university)
Vector Search: Semantic similarity matching for research areas
Hybrid Queries: Filter by exact matches (university) while finding similar research interests

At scale with 1000+ universities and millions of relationships, this hybrid approach provides both precision and semantic understanding.

System Architecture

Graph Schema Design

Our graph consists of the following components:

Nodes:

Professor: Core entity with name, title, page, bio
ResearchArea: Research domains with area and description
Department: Academic departments with name
University: Institutions with name
Lab: Research labs with name and research_focus
ResearchAreaAndDescriptionEmbedding: Vector node for semantic search

Edges:

Professor → ResearchArea
Professor → Department
Professor → University
Professor → Lab
Professor → ResearchAreaAndDescriptionEmbedding

Setting Up HelixDB

Step 1: Initialize HelixDB Project

mkdir professor_rag_system
cd professor_rag_system
helix init

Step 2: Define the Graph Schema

Create schema.hx:

// NODES //
N::Professor {
    name: String,
    title: String,
    page: String,
    bio: String,
}
N::ResearchArea {
    research_area: String,
    description: String,
}
N::Department {
    name: String,
}   
N::University {
    name: String,
}
N::Lab {
    name: String,
    research_focus: String,
}
// EDGES //
E::HasLab {
    From: Professor,
    To: Lab,
}
E::HasResearchArea {
    From: Professor,
    To: ResearchArea,
}
E::HasUniversity {
    From: Professor,
    To: University,
    Properties: {
        since: Date DEFAULT NOW,
    }
}
E::HasDepartment {
    From: Professor,
    To: Department,
    Properties: {
        since: Date DEFAULT NOW,
    }
}
// VECTOR NODES //
V::ResearchAreaAndDescriptionEmbedding {
    areas_and_descriptions: String,
}
E::HasResearchAreaAndDescriptionEmbedding {
    From: Professor,
    To: ResearchAreaAndDescriptionEmbedding,
    Properties: {
        areas_and_descriptions: String,
    }
}

Step 3: Create HelixQL Queries

Create query.hx:

// Node Creation Queries
QUERY create_professor(name: String, title: String, page: String, bio: String ) =>
    professor <- AddN<Professor>({ name: name, title: title, page: page, bio: bio })
    RETURN professor
QUERY create_department(name: String) =>
    department <- AddN<Department>({ name: name })
    RETURN department
QUERY create_university(name: String) =>
    university <- AddN<University>({ name: name })
    RETURN university
QUERY create_lab(name: String, research_focus: String) =>
    lab <- AddN<Lab>({ name: name, research_focus: research_focus })
    RETURN lab
QUERY create_research_area(name: String) =>
    research_area <- AddN<ResearchArea>({ research_area: name })
    RETURN research_area
// Relationship Linking Queries
QUERY link_professor_to_department(professor_id: ID, department_id: ID) =>
    professor <- N<Professor>(professor_id)
    department <- N<Department>(department_id)
    edge <- AddE<HasDepartment>::From(professor)::To(department)
    RETURN edge
QUERY link_professor_to_university(professor_id: ID, university_id: ID) =>
    professor <- N<Professor>(professor_id)
    university <- N<University>(university_id)
    edge <- AddE<HasUniversity>::From(professor)::To(university)
    RETURN edge
QUERY link_professor_to_lab(professor_id: ID, lab_id: ID) =>
    professor <- N<Professor>(professor_id)
    lab <- N<Lab>(lab_id)
    edge <- AddE<HasLab>::From(professor)::To(lab)
    RETURN edge
QUERY link_professor_to_research_area(professor_id: ID, research_area_id: ID) =>
    professor <- N<Professor>(professor_id)
    research_area <- N<ResearchArea>(research_area_id)
    edge <- AddE<HasResearchArea>::From(professor)::To(research_area)
    RETURN edge
// Embedding Creation and Search
QUERY create_research_area_embedding(professor_id: ID, areas_and_descriptions: String, vector: [F64]) =>
    professor <- N<Professor>(professor_id)
    research_area <- AddV<ResearchAreaAndDescriptionEmbedding>(vector, { areas_and_descriptions: areas_and_descriptions })
    edge <- AddE<HasResearchAreaAndDescriptionEmbedding>::From(professor)::To(research_area)
    RETURN research_area
QUERY search_similar_professors_by_research_area_and_description(query_vector: [F64], k: I64) =>
    vecs <- SearchV<ResearchAreaAndDescriptionEmbedding>(query_vector, k)
    professors <- vecs::In<HasResearchAreaAndDescriptionEmbedding>
    RETURN professors
QUERY get_professor_research_areas_with_descriptions(professor_id: ID) =>
    research_areas <- N<Professor>(professor_id)::Out<HasResearchAreaAndDescriptionEmbedding>
    RETURN research_areas::{areas_and_descriptions}
// Search Queries
QUERY get_professor_by_research_area_name(research_area_name: String) =>
    professors <- N<Professor>::Out<HasResearchArea>::WHERE(_::{research_area}::EQ(research_area_name))
    RETURN professors
QUERY get_professors_by_university_name(university_name: String) =>
    professors <- N<Professor>::Out<HasUniversity>::WHERE(_::{name}::EQ(university_name))
    RETURN professors
QUERY get_professors_by_department_name(department_name: String) =>
    professors <- N<Professor>::Out<HasDepartment>::WHERE(_::{name}::EQ(department_name))
    RETURN professors
QUERY get_professors_by_university_and_department_name(university_name: String, department_name: String) =>
    professors <- N<Professor>::WHERE(AND(
        EXISTS(_::Out<HasUniversity>::WHERE(_::{name}::EQ(university_name))),
        EXISTS(_::Out<HasDepartment>::WHERE(_::{name}::EQ(department_name)))
    ))
    RETURN professors

Python Implementation

Step 1: Environment Setup

python -m venv venv
source venv/bin/activate
pip install helix-py sentence-transformers

uv venv
uv add helix-py sentence-transformers

Step 2: Initialize Connection and Model

import helix
from sentence_transformers import SentenceTransformer
# Initialize embedding model (Qwen 0.6B for this example)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
# Connect to HelixDB
db = helix.Client(local=True, port=6969, verbose=True)

Step 3: Create Base Nodes

# Define research areas with descriptions
research_areas = {
    "Computer Vision for Basketball": "Designing CNN and Transformer architectures that track player pose, ball trajectory, and court zones to quantify defensive pressure and shooting mechanics.",
    "Predictive Modelling & Simulation": "Building Monte-Carlo and sequence models that forecast possession outcomes and season performance using play-by-play and spatial data.",
    "Sports Analytics with Large Language Models": "Leveraging LLMs to explain model outputs, auto-generate commentary, and mine historical game archives for strategic patterns.",
    "Wearable Sensor Data Mining": "Applying time-series and graph learning techniques to inertial-measurement signals for fatigue monitoring and injury prevention.",
    "Fairness & Ethics in Sports AI": "Studying algorithmic bias and ensuring equitable analytics across different leagues, genders, and play styles."
}
# Create research area nodes and store IDs
research_area_ids = {}
for research_area in research_areas:
    research_area_node = db.query("create_research_area", {"area": research_area})
    research_area_ids[research_area] = research_area_node[0]['research_area']['id']
# Create department nodes
departments = ["Computer Science", "Mathematics", "Physics", "Chemistry", "Biology"]
department_ids = {}
for department in departments:
    department_node = db.query("create_department", {"name": department})
    department_ids[department] = department_node[0]['department']['id']
# Create university nodes
universities = ["Uni X", "Uni Y", "Uni Z"]
university_ids = {}
for university in universities:
    university_node = db.query("create_university", {"name": university})
    university_ids[university] = university_node[0]['university']['id']
# Create lab nodes
labs = {
    "Basketball Data Science Lab": "An interdisciplinary group combining data science, biomechanics, and sport psychology to create next-generation analytics tools for basketball."
}
lab_ids = {}
for lab in labs:
    lab_node = db.query("create_lab", {"name": lab, "research_focus": labs[lab]})
    lab_ids[lab] = lab_node[0]['lab']['id']

Step 4: Ingest Professor Data

for professor in professors:
    # Create Professor Node
    professor_node = db.query("create_professor", {
        "name": professor["name"], 
        "title": professor["title"], 
        "page": professor["page"], 
        "bio": professor["bio"]
    })
    
    professor_id = professor_node[0]['professor']['id']
    
    # Link Professor to Research Areas
    for research_area in professor["key_research_areas"]:
        if research_area['area'] in research_areas:
            research_area_id = research_area_ids[research_area['area']]
            db.query("link_professor_to_research_area", {
                "professor_id": professor_id, 
                "research_area_id": research_area_id
            })
    
    # Link Professor to Departments
    for department in professor["department"]:
        if department in department_ids:
            department_id = department_ids[department]
            db.query("link_professor_to_department", {
                "professor_id": professor_id, 
                "department_id": department_id
            })
    
    # Link Professor to Universities
    for university in professor["university"]:
        if university in university_ids:
            university_id = university_ids[university]
            db.query("link_professor_to_university", {
                "professor_id": professor_id, 
                "university_id": university_id
            })
    
    # Link Professor to Labs
    for lab in professor["labs"]:
        if lab['name'] in lab_ids:
            lab_id = lab_ids[lab['name']]
            db.query("link_professor_to_lab", {
                "professor_id": professor_id, 
                "lab_id": lab_id
            })
    
    # Create and store research area embeddings
    research_area_and_description = "\n".join([
        research_area['area'] + ": " + research_area['description'] 
        for research_area in professor['key_research_areas']
    ])
    
    research_area_and_description_embedding = model.encode(research_area_and_description).astype(float).tolist()
    
    db.query("create_research_area_embedding", {
        "professor_id": professor_id, 
        "areas_and_descriptions": research_area_and_description, 
        "vector": research_area_and_description_embedding
    })

## Querying the System

### Vector Similarity Search

Find professors with similar research interests:

query = "Find me a professor who does computer vision for basketball"
embedded_query_vector = model.encode(query).astype(float).tolist()
    
results = db.query("search_similar_professors_by_research_area_and_description", {
    "query_vector": embedded_query_vector, 
    "k": 5
})
print(results)

Example Output:

[{'professors': [{
    'page': '<a target="_blank" href="https://www.example.com">https://www.example.com</a>',
    'label': 'Professor', 
    'bio': 'James is an Assistant Professor whose work sits at the intersection of basketball analytics...',
    'name': 'James', 
    'id': '...', 
    'title': 'Assistant Professor'
}]}]

Graph-Based Queries

1. Find professors by research area

professors_by_research_area = db.query("get_professor_by_research_area_name", {
    "research_area_name": "Computer Vision for Basketball"
})

2. Find professors by university

professors_by_university = db.query("get_professors_by_university_name", {
    "university_name": "Uni X"
})

3. Find professors by department

professors_by_department = db.query("get_professors_by_department_name", {
    "department_name": "Computer Science"
})

4. Complex filtering: University AND Department

professors_filtered = db.query("get_professors_by_university_and_department_name", {
    "university_name": "Uni X", 
    "department_name": "Computer Science"
})

Retrieving Research Details

Get the full research areas and descriptions for a specific professor:

prof_research_areas = db.query("get_professor_research_areas_with_descriptions", {
    "professor_id": results[0]['professors'][0]['id']
})
print(prof_research_areas)

Integration with LLMs

You can create custom tools for LLMs to interact with your HelixDB system. Here's an example using Google's Gemini:

from google import genai
from google.genai import types
import dotenv
import os
dotenv.load_dotenv()
def search_similar_professors_by_research_area_and_description(query: str) -> dict:
    """Takes the user's query and embeds it then uses the embedded query to search for similar professors
    
    Args:
        query (str): The user's query
    
    Returns:
        A list of professors who are similar to the user's query
    """
    embedded_query_vector = model.encode(query).astype(float).tolist()
    results = db.query("search_similar_professors_by_research_area_and_description", {
        "query_vector": embedded_query_vector, 
        "k": 5
    })
    return results
# Configure Gemini with the custom tool
client = genai.Client()
config = types.GenerateContentConfig(
    tools=[search_similar_professors_by_research_area_and_description]
)
# Use the tool in conversation
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Find me a professor who does computer vision for basketball",
    config=config,
)   
print(response.text)

Best Practices

Data Ingestion: Process data in batches for large datasets
Embedding Models: Choose models based on your domain (academic text vs general)
Index Management: Create appropriate indexes for frequently queried fields
Query Optimization: Use graph traversal for structured queries, vectors for semantic search
Caching: Cache frequently accessed professor profiles and embeddings

Conclusion

This Graph-Vector RAG system demonstrates the power of combining structured graph relationships with semantic vector search. By leveraging HelixDB's capabilities, we've created a system that can handle both exact matches (filtering by university/department) and semantic similarity (finding professors with related research interests).

The system is easily extensible - you can add new node types (publications, grants, collaborations), create more sophisticated embeddings (combining bio + research + publications), or integrate with various LLM providers for natural language interactions.

This approach scales well from small departmental databases to large-scale academic networks with millions of relationships, providing fast, accurate, and semantically-aware professor recommendations.