tabula-sharp is a library for extracting tables from PDF files — it is a port of tabula-java
- Supports netstandard2.0, net462, net471, net6.0, net8.0
- No java bindings
NuGet packages available on the releases page and on www.nuget.org:
- Uses PdfPig, and not PdfBox.
- Coordinate system starts from the bottom left point (going up) of the page, and not from the top left point (going down).
- The
NurminenDetectionAlgorithmis replaced bySimpleNurminenDetectionAlgorithm, because it requieres an image management library. - Table results might be different because of the way PdfPig builds Letters bounding box.
using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true })) { PageArea page = ObjectExtractor.Extract(document, 1); // detect canditate table zones SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm(); var regions = detector.Detect(page); IExtractionAlgorithm ea = new BasicExtractionAlgorithm(); IReadOnlyList<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area var table = tables[0]; var rows = table.Rows; }using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true })) { PageArea page = ObjectExtractor.Extract(document, 1); IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm(); IReadOnlyList<Table> tables = ea.Extract(page); var table = tables[0]; var rows = table.Rows; }
