A small python library with tools to work with raw bson data exported from mongodb
This code is derived from the mongo-hadoop connector https://github.com/mongodb/mongo-hadoop/blob/master/streaming/language_support/python/pymongo_hadoop/input.py
I needed a way to read BSON dumped from mongodb into python for customized map reduce scripts.
I did not want the overhead of hadoop and wanted to add a fast string prematcher to expedite map reduce
Because the source is a NoSQL database, not all documents contain the same types of data.
As bson stores raw text, a fast string matcher quickly bypasses unwanted documents.
This can be combined with bsonsearch to perform mongo-like queries against raw BSON data stored on disk.
The python-bson-streaming library can be installed via the distutils setup.py script included at the root directory:
Make sure you use the bson package originating from pymongo You can install bson from pymongo using pip or distribution packaging.
pipinstallpymongoDO NOT INSTALL THIRD PARTY BSONpip install bson - That libraray does not work
python setup.py install this example shows an example start to a map/reduce style script.
The fast_string_prematch would not bother converting records that do not have "github" somewhere in the document as plaintext.
frombsonstreamimportBSONInputfromsysimportargvimportgzipforfileinargv[1:]: f=Noneif"gz"notinfile: f=open(file, 'rb') else: f=gzip.open(file,'rb') stream=BSONInput(fh=f, fast_string_prematch=b"github") fordict_datainstream: ...processdict_data...or if you are passing data to another tool that can handle raw bson (like bsonsearch), don't even bother decoding the BSON to a dict
frombsonstreamimportBSONInputfromsysimportargvimportgzipforfileinargv[1:]: f=Noneif"gz"notinfile: f=open(file, 'rb') else: f=gzip.open(file,'rb') stream=BSONInput(fh=f, fast_string_prematch=b"github") forraw_bsoninstream: ...processdict_data...Unfortunately, I cannot make available the test bson file.
used an 8GB bson file with ~2,500,000 documents of varying sizes gzipped bson file compressed to 2.1GB
Without fast string matcher
[bauman@localhost ~]$ time ./example_map_reduce.py bson/example.bson.1.gz real 6m55.758s user 6m53.541s sys 0m1.952s With fast string matcher. In this case, documents matching the fast string patern were present in 10% of documents, resuling in time savings not deserializing 90% of documents.
[bauman@localhost ~]$ time ./example_fast_match_map_reduce.py bson/example.bson.1.gz real 1m16.387s user 1m37.455s sys 0m17.427s Required libraries
- [pymongo]