python-bson-streaming

A small python library with tools to work with raw bson data exported from mongodb

Source

This code is derived from the mongo-hadoop connector https://github.com/mongodb/mongo-hadoop/blob/master/streaming/language_support/python/pymongo_hadoop/input.py

Background

I needed a way to read BSON dumped from mongodb into python for customized map reduce scripts.

I did not want the overhead of hadoop and wanted to add a fast string prematcher to expedite map reduce

Because the source is a NoSQL database, not all documents contain the same types of data.
As bson stores raw text, a fast string matcher quickly bypasses unwanted documents.

This can be combined with bsonsearch to perform mongo-like queries against raw BSON data stored on disk.

Installation

The python-bson-streaming library can be installed via the distutils setup.py script included at the root directory:

Make sure you use the bson package originating from pymongo You can install bson from pymongo using pip or distribution packaging.

pipinstallpymongo

DO NOT INSTALL THIRD PARTY BSONpip install bson - That libraray does not work

python setup.py install

Usage

this example shows an example start to a map/reduce style script.

The fast_string_prematch would not bother converting records that do not have "github" somewhere in the document as plaintext.

frombsonstreamimportBSONInputfromsysimportargvimportgzipforfileinargv[1:]: f=Noneif"gz"notinfile: f=open(file, 'rb') else: f=gzip.open(file,'rb') stream=BSONInput(fh=f, fast_string_prematch=b"github") fordict_datainstream: ...processdict_data...

or if you are passing data to another tool that can handle raw bson (like bsonsearch), don't even bother decoding the BSON to a dict

frombsonstreamimportBSONInputfromsysimportargvimportgzipforfileinargv[1:]: f=Noneif"gz"notinfile: f=open(file, 'rb') else: f=gzip.open(file,'rb') stream=BSONInput(fh=f, fast_string_prematch=b"github") forraw_bsoninstream: ...processdict_data...

Benchmark

Unfortunately, I cannot make available the test bson file.

used an 8GB bson file with ~2,500,000 documents of varying sizes gzipped bson file compressed to 2.1GB

Without fast string matcher

 [bauman@localhost ~]$ time ./example_map_reduce.py bson/example.bson.1.gz real 6m55.758s user 6m53.541s sys 0m1.952s

With fast string matcher. In this case, documents matching the fast string patern were present in 10% of documents, resuling in time savings not deserializing 90% of documents.

 [bauman@localhost ~]$ time ./example_fast_match_map_reduce.py bson/example.bson.1.gz real 1m16.387s user 1m37.455s sys 0m17.427s

Dependencies

Required libraries

[pymongo]

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.ci		.ci
.github/workflows		.github/workflows
bsonstream		bsonstream
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bson2json.py		bson2json.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python-bson-streaming

Source

Background

Installation

Usage

Benchmark

Dependencies

Versioning

About

Uh oh!

Releases 3

Packages

Uh oh!

Languages

License

bauman/python-bson-streaming

Folders and files

Latest commit

History

Repository files navigation

python-bson-streaming

Source

Background

Installation

Usage

Benchmark

Dependencies

Versioning

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Languages

Packages