So, you came across Neo4j. You’ve read the book and think it’s a pretty cool graph database, with an intuitive query language and you’d like to give it a try.

You’re in my shoes.

Let’s now say that you want to do a mass dataset import and start hacking around.

Read on.

In my case, I wanted to create a simple recommendation engine (the domain doesn’t matter so much). To do that, I had to import FAST 20 million nodes of one-to-many, sparse matrix data. This became a bit more complicated (and interesting) task than originally anticipated, so it became a mini-project itself.

My goal was to be use the data with Spring Neo4j and make it available preferably on a Neo4j server. It turns out you can kind of do both of these, but not out of the box with standard Spring Neo4j mappings. Let me explain…

Spring Data Neo4j Repositories

Neo4j Repositories are kind of magic. All you have to do in your code, is add annotations to domain objects and to create an interface that extends the GraphRepository. With that you get lots of goodness for CRUD operations on your domain objects. A shortened snippet from the example Cineasts project:

@NodeEntity
public class Movie {

    @Indexed
    String id;
}

And repository:

package org.neo4j.cineasts.repository;

import org.neo4j.cineasts.domain.Movie;
import org.springframework.data.neo4j.repository.GraphRepository;

public interface MovieRepository extends GraphRepository{
}

This would be enough to give you access to basic CRUD operations. No more implementation would be required. Cool, eh? To add more elaborate graph traversal queries, you would only need to annotate methods with you Cypher query like this:

package org.neo4j.cineasts.repository;

import org.neo4j.cineasts.domain.Movie;
import org.springframework.data.neo4j.repository.GraphRepository;

public interface MovieRepository extends GraphRepository{

@Query( "START movie=node:Movie(id={0}) MATCH movie-[rating?:rating]->() RETURN movie, AVG(rating.stars)" )
    MovieData getMovieData( String movieId );
}

Still no implementation. ‘Automagic’, you might say.

The caveat

Using bog-standard repositories and annotations as shown above however, is that each operation results in a transaction (both for embedded and server mode). This slows things down significantly! It might be OK if performance is not an issue for your environment or if you are patient, but the latter was not true in my case, so I had to look for other ways to import my data.

Custom Repositories i.e. welcome Neo4jTemplate

If you’re familiar with other Spring persistence frameworks, such as Spring JDBC this doesn’t need much explanation. The template offers lower-level API for CRUD operations. What you win by that is that you can control when a transaction is commited. That way you can batch lots of operations together, instead of one per operation which is the default. As an example, let’s assume the type Cart that will contain a number of Products. Below is an extract of the Cart repository:


public class ManagedTransactionCartRepository implements FastCartRepository {

	private final ManagedTransaction transaction;
	private final Neo4jTemplate neo4jTemplate;

	@Override
	public void save(Cart cart) {
		transaction.prepareTransaction();
		neo4jTemplate.save(cart);
	}
}

If you’ve read the code correctly, you will be wondering about the ManagedTransaction class. That is basically the component that will manage the transactional context that an operation is performed in. Here’s a snippet:

package com.nextmilestone.neo4j.repository;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Transaction;

public class ManagedTransaction {
	private GraphDatabaseService graphDatabaseService;
	private Transaction transaction;

	public void start() {
		transaction = graphDatabaseService.beginTx();
	}

	public void stop() {
		transaction.success();
		transaction.finish();
	}

	public void prepareTransaction() {
		incrementOperationsCount();
		if (shouldCreateNewTransaction()) {
			commitAndStartNewTransaction();
		}
	}

	private void incrementOperationsCount() {
            // Increase operations count
	}

	private boolean shouldCreateNewTransaction() {
            // Check whether a new transaction is due
	}

	private void commitAndStartNewTransaction() {
            // Commit the existing transaction and begin a new one
	}
}


An important detail is that since we are not managing our relationships with standard Neo4j mappings, also the relationships have to be managed via the Neo4jTemplate. That is also going to be handled in a custom relationship repository.

public class ManagedTransactionRelationshipRepository implements FastRelationshipRepository {
	private static final String CONTAINS = "CONTAINS";
	private final ManagedTransaction transaction;
	private final Neo4jTemplate neo4jTemplate;

	@Override
	public void addContainsRelationship(Cart cart, Product product) {
		Node cartNode = neo4jTemplate.getPersistentState(cart);
		Node productNode = neo4jTemplate.getPersistentState(product);
		transaction.prepareTransaction();
		neo4jTemplate.createRelationshipBetween(cartNode, productNode, CONTAINS, null);
	}
}

That should be it. Running the project, this achieves import of 100,000 lines of CSV (nodes and relationships) in 30 secs, for an embedded database.

Alternatives

Looking around, there a couple of tools to do something similar, namely Michael Hunger’s Batch Importer, but not entirely so. The tool assumes a single type of node, it doesn’t add indexes for the imported nodes and does not use the core Neo4j API, so the data would not be retrievable for later use by Spring Data. And although it could probably be made to work for this example, it’s not generic enough to handle heterogenous graphs.

Conclusion

Neo4j is pretty much the coolest technology that has emerged recently in NoSQL (I also like Redis). I find Cypher incredibly intuitive for queries and I’m going to use it in my next side-project (a simple recommendation engine for artist recommendations). I’m still not sure how it will scale for production and big data, i.e. data that’s not on the same machine.

Spring Data for Neo4j is still work in progress and lots can be done, especially the performance of operations server-side, via REST. For this example, I had to resort to embedded mode, but I will investigate further how to use the Neo4j REST in batch/stream mode in a similarly clean way, with custom repositories.

You can find the full code on Github. Hat tip to Michael Hunger for the help.

@iordanis_g