Skip to content

bauman/python-bson-streaming

Repository files navigation

python-bson-streaming

A small python library with tools to work with raw bson data exported from mongodb

Source

This code is derived from the mongo-hadoop connector https://github.com/mongodb/mongo-hadoop/blob/master/streaming/language_support/python/pymongo_hadoop/input.py

Background

I needed a way to read BSON dumped from mongodb into python for customized map reduce scripts.

I did not want the overhead of hadoop and wanted to add a fast string prematcher to expedite map reduce

Because the source is a NoSQL database, not all documents contain the same types of data.
As bson stores raw text, a fast string matcher quickly bypasses unwanted documents.

This can be combined with bsonsearch to perform mongo-like queries against raw BSON data stored on disk.

Installation

The python-bson-streaming library can be installed via the distutils setup.py script included at the root directory:

Make sure you use the bson package originating from pymongo You can install bson from pymongo using pip or distribution packaging.

pipinstallpymongo

DO NOT INSTALL THIRD PARTY BSONpip install bson - That libraray does not work

python setup.py install 

Usage

this example shows an example start to a map/reduce style script.

The fast_string_prematch would not bother converting records that do not have "github" somewhere in the document as plaintext.

frombsonstreamimportBSONInputfromsysimportargvimportgzipforfileinargv[1:]: f=Noneif"gz"notinfile: f=open(file, 'rb') else: f=gzip.open(file,'rb') stream=BSONInput(fh=f, fast_string_prematch=b"github") fordict_datainstream: ...processdict_data...

or if you are passing data to another tool that can handle raw bson (like bsonsearch), don't even bother decoding the BSON to a dict

frombsonstreamimportBSONInputfromsysimportargvimportgzipforfileinargv[1:]: f=Noneif"gz"notinfile: f=open(file, 'rb') else: f=gzip.open(file,'rb') stream=BSONInput(fh=f, fast_string_prematch=b"github") forraw_bsoninstream: ...processdict_data...

Benchmark

Unfortunately, I cannot make available the test bson file.

used an 8GB bson file with ~2,500,000 documents of varying sizes gzipped bson file compressed to 2.1GB

Without fast string matcher

 [bauman@localhost ~]$ time ./example_map_reduce.py bson/example.bson.1.gz real 6m55.758s user 6m53.541s sys 0m1.952s 

With fast string matcher. In this case, documents matching the fast string patern were present in 10% of documents, resuling in time savings not deserializing 90% of documents.

 [bauman@localhost ~]$ time ./example_fast_match_map_reduce.py bson/example.bson.1.gz real 1m16.387s user 1m37.455s sys 0m17.427s 

Dependencies

Required libraries

  • [pymongo]

Versioning

About

BSON stream raw data into dict or individual BSON format - python

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages