Skip to content

InExtremo/tabula-java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Repository files navigation

tabula-java Build StatusJoin the chat at https://gitter.im/tabulapdf/tabula-java

tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that used to power Tabula (repo). You can use tabula-java as a command-line tool to programmatically extract tables from PDFs.

(This is the new version of the extraction engine; the previous code can be found at tabula-extractor.)

© 2014-2016 Manuel Aristarán. Available under MIT License. See LICENSE.

Download

Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from our releases page.

Usage Examples

tabula-java provides a command line application:

$ java -jar ./target/tabula-0.9.1-jar-with-dependencies.jar --help usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f <FORMAT>] [-g] [-h] [-i] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s <PASSWORD>] [-u] [-v] Tabula helps you extract tables from PDFs -a,--area <AREA> Portion of the page to analyze (top,left,bottom,right). Example: --area 269.875,12.75,790.5,561. Default is entire page -c,--columns <COLUMNS> X coordinates of column boundaries. Example --columns 10.1,20.2,30.3 -d,--debug Print detected table areas instead of processing. -b,--batch <DIRECTORY> Convert all .pdfs in the provided directory -f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV -g,--guess Guess the portion of the page to analyze per page. -h,--help Print this help text. -i,--silent Suppress all stderr output. -n,--no-spreadsheet Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) -o,--outfile <OUTFILE> Write output to <file> instead of STDOUT. Default: - -p,--pages <PAGES> Comma separated list of ranges, or all. Examples: --pages 1-3,5-7, --pages 3 or --pages all. Default is --pages 1 -r,--spreadsheet Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) -s,--password <PASSWORD> Password to decrypt document. Default is empty -u,--use-line-returns Use embedded line returns in cells. (Only in spreadsheet mode.) -v,--version Print version and exit. 

It also includes a debugging tool, run java -cp ./target/tabula-0.9.1-jar-with-dependencies.jar technology.tabula.debug.Debug -h for the available options.

You can also integrate tabula-java with any JVM language. For Java examples, see the tests folder.

JVM start-up time is a lot of the cost of the tabula command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:

  • the drip utility
  • the Ruby, R, and Node.js bindings
  • writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
  • waiting for us to implement an API/server-style system (it's on the roadmap)

Building from Source

Clone this repo and run:

mvn clean compile assembly:single 

About

Extract tables from PDF files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java100.0%