I started playing around with a variety of databases lately (future posts to come) because I wanted to scrape a bunch of json data, dump those into a database and start sorting and filtering results. The json data is finanical information and has hundreds of data points (some in nested layers) and I wasn’t quite sure what information I even want to search for. There are gigabytes of raw json text being shoved into this database, and as I do exploratory data anaylsis, I wanted a nice place to store and query/filter/order/map the data without the overhead of creating lots of schemas.
So some of you might ask. Why not use the json
data type on Mysql/PGSQL? Sure. Sure, that works. It’s gets tedious to query those things nested in a json column. Not to mention I want to do indexing to speed up my query searches. I’m scraping this data from varying sources, and I’m not even certain which data is relevant or useful yet. So nosql seemed like a good fit here. I think from a performance standpoint, I’d probably be fine with SQL tables as I’m dealing with 100’s of GB of data. But that isn’t the point. I’m trying to quickly get data imported and start exploring it without any overhead of defining schemas where I’m unsure of data layout.
In this project, I’m keeping track of changes over time. So, a time serious database design is useful. I checked out Influx, but decided against it. Currently, I’m adding a column for the snapshot time.
Every story has an ending, and RethinkDB’s tale ends sadly. RethinkDB has ceased development. The simplicity of this tool is what drew me to it. As previously dicussed, horizontal scaling and sharding is a breeze. The admin panel is really slick. You can run commands straight from the admin website. This would be a really nice tool, a middleground between SQL and no-SQL. Something I could add to the Batman utility belt. Alas, without future development, the project is likely to burn out like a red-dwarf, surely but slowly.
Here are my ratings for RethinkDB. These are just arbitrary, not based on any sort of benchmarks or anything, so don’t take them too seriously.
- Getting Started: 9/10
- Adoption: 3/10
- Documentation: 7/10
- Scaling: 9/10
- Security: 6/10
- Similarity to SQL: 7/10
- Production Ready: Probably not due to development shutdown
In the future I’m going to review some of the following tools.
Python for big data - h5py - mpi4py - dask - https://towardsdatascience.com/how-to-handle-large-datasets-in-python-with-pandas-and-dask-34f43a897d55 - blaze - PyStore https://medium.com/@aroussi/fast-data-store-for-pandas-time-series-data-using-pystore-89d9caeef4e2 - Pandas as big data https://www.dataquest.io/blog/pandas-big-data/
https://github.com/wiktorski/opentsdb_pandas
https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
Plain-text CSV — a good old friend of a data scientist
Pickle — a Python’s way to serialize things
MessagePack — it’s like JSON but fast and small
HDF5 —a file format designed to store and organize large amounts of data
Feather — a fast, lightweight, and easy-to-use binary file format for storing data frames
Parquet — an Apache Hadoop’s columnar storage format
- Timescale Looks interesting https://docs.timescale.com/latest/tutorials/tutorial-hello-timescale
Other sqlless databases - Couchbase - ElasticSearch - HBase - Cassandra - MongoDB - DynamoDB - neo4j - Google Cloud Datastore - Aerospike