DEV Community

Claudio Fior for Abbrevia

Posted on

Extract data from a PDF

One of our providers give us some data ad PDF and I have to produce a JSON object for further elaborations.

Demo of the report

For the textual information non problem: I used pdftotext to extract the text.

$content = shell_exec('pdftotext -enc UTF-8 -layout input.pdf -'); 
Enter fullscreen mode Exit fullscreen mode

Then I used regular expressions to extract the data

 $anagrafica=array(); if(preg_match('/^Denominazione\W*(.*)/m', $content, $aDenominazione)) { $anagrafica['denominazione']=$aDenominazione[1]; } 
Enter fullscreen mode Exit fullscreen mode

How to extract the data of the semaphores that are images without labels?

I used the linux command pdftohtml

$rawImages = shell_exec('pdftohtml -enc UTF-8 -noframes -stdout -xml "'.$this->filePath.'" - | grep image'); $tok = strtok($rawImages,"\r\n"); while ($tok !== false) { $oImage = simplexml_load_string($tok); $images[]=$oImage; $tok = strtok("\r\n"); } 
Enter fullscreen mode Exit fullscreen mode

The output of pdftohtml in a xml document for each text box or image.

$rawImages is an array of the xml elements of the images ans I put them as SimpleXmlObjects in $images array.

Than I searched trough the array the images with 77 pixel of width and sort the by the vertical position.

The images are saved in the current directory of the script.

I queried the color of a pixel in a specific position of the image with convert command of ImageMagick library and saved the data in the JSON object.

$color = shell_exec('convert "'.$imagePath.'" -format \'%[pixel:p{100,50}]\' info:- '); switch ($color) { case 'srgb(253,78,83)': $anagrafica[$this::chekcs[$pos]]='red'; break; case 'srgb(123,196,78)': $anagrafica[$this::chekcs[$pos]]='green'; break; case 'srgb(254,211,80)': $anagrafica[$this::chekcs[$pos]]='yellow'; break; }; 
Enter fullscreen mode Exit fullscreen mode

At this point: is there an easy way to do the trick?

Top comments (0)