noSQL: when it's too big for a dictionary

How to use and optimize RocksDB for very large files

Recently, I’ve been working on a project where I need to store a key-value pairs for very large files. In Python, the first idea would be to use a dictionary.
However, I had more than 800 millions key-value pairs to store, so a dictonary just wouldn’t cut it.
Enters the world of noSQL databases, and more specifically RocksDB.
Described by its creators, originally at Facebook, as a “high performance persistant key-value store”, it behaves pretty much like a dictonary, but can scale very well.

Written in C++, there is fortunately a python binding, in the name of python-rocksdb, even available as a conda package.

It’s relatively simple to use:

import rocksdb
# Create or open database
db = rocksdb.DB("test.db", rocksdb.Options(create_if_missing=True))

key = "Hello"
value = "World"

# Store key-value pair
db.put(key.encode(), value.decode())

# Get value for a given key
# prints "World"

# Delete a key value-pair

Because RocksDB stores all data as byte strings, one need to use .encode() and .decode() to store and get data.

Batch operation

One way to speed up RockDB, is to perform operations in batch, instead of performing them one by one.

import rocksdb
db = rocksdb.DB("test.db", rocksdb.Options(create_if_missing=True))

batch = rocksdb.WriteBatch()
batch.put(b"key", b"v1")
batch.put(b"key", b"v2")
batch.put(b"key", b"v3")


Batch limits and solutions

But beware ! For very large files, the batch can become too big for the memory of your computer, and you end up swapping like crazy !

When memory overflows to swap
When memory overflows to swap

To prevent this, we have to restrict ourselves to batches of reasonable sizes, meaning making more batches, but smaller.

But beware ! RocksDB storage system relies on many indivual .sst files, that RocksDB opens in parallel to make queries and store data. The larger the database, the more files are open. And this can lead to this kind of error:

IO error: While open a file for appending: xxxxxxx.sst: Too many open files

Indeed, there is maximum number of files that can be open at any give time by a single process.
For example, it’s 256 by default on macOS

Luckily, to overcome this hurdle, RocksDB has the max_open_files option.

Putting it all together, this gives us the following script:

import rocksdb
from subprocess import check_output
from tqdm import tqdm

def get_nb_lines(filename):
  """A function for getting the number of lines in a file
    filename(str): The path to a file
    int: Number of lines in file
  cmd = f"wc -l {filename}"
  return int(check_output(cmd, shell=True).split()[0])

OPTS = rocksdb.Options()
OPTS.create_if_missing = False
OPTS.max_open_files = 250

# Instantiating the database
db = rocksdb.DB("mybigdata.db", OPTS)

# The big file we want to store in RocksDB
key_value_file = "very_large_file.tsv"

# We get the number of key-value pairs in the file
nb_key_value_pairs = get_nb_lines(key_value_file)

# Starting our first batch
batch = rocksdb.WriteBatch()

# Key-value pair counter
i = 0

# Number of batches, the more, the less memory used
max_batches = 100

# Setting the batch size: how many key-value pairs go in each batch
batch_size = min(nlines-1, int(nlines/max_batches))

with open(key_value_file) as bigfile:
  for line in tqdm(bigfile, total=nlines):
    # Each line looks like: key [tab] value
    linesplit = line.split()
    key = linesplit[0]
    value = linesplit[1]
    batch.put(bytes(key, encoding='utf8'), bytes(value, encoding='utf8'))
    # If we reached the batch size, store it in RocksDB, and create a new batch
    if i % batch_size == 0:
        batch = rocksdb.WriteBatch()
# Store the remaining key-value pairs
Maxime Borry, PhD.

Bioinformatician - Postdoctoral Researcher at the Max Planck Institute for Evolutionary Anthropology