One of our providers give us some data ad PDF and I have to produce a JSON object for further elaborations.
For the textual information non problem: I used pdftotext
to extract the text.
$content = shell_exec('pdftotext -enc UTF-8 -layout input.pdf -');
Then I used regular expressions to extract the data
$anagrafica=array(); if(preg_match('/^Denominazione\W*(.*)/m', $content, $aDenominazione)) { $anagrafica['denominazione']=$aDenominazione[1]; }
How to extract the data of the semaphores that are images without labels?
I used the linux command pdftohtml
$rawImages = shell_exec('pdftohtml -enc UTF-8 -noframes -stdout -xml "'.$this->filePath.'" - | grep image'); $tok = strtok($rawImages,"\r\n"); while ($tok !== false) { $oImage = simplexml_load_string($tok); $images[]=$oImage; $tok = strtok("\r\n"); }
The output of pdftohtml
in a xml document for each text box or image.
$rawImages
is an array of the xml elements of the images ans I put them as SimpleXmlObjects in $images
array.
Than I searched trough the array the images with 77 pixel of width and sort the by the vertical position.
The images are saved in the current directory of the script.
I queried the color of a pixel in a specific position of the image with convert
command of ImageMagick library and saved the data in the JSON object.
$color = shell_exec('convert "'.$imagePath.'" -format \'%[pixel:p{100,50}]\' info:- '); switch ($color) { case 'srgb(253,78,83)': $anagrafica[$this::chekcs[$pos]]='red'; break; case 'srgb(123,196,78)': $anagrafica[$this::chekcs[$pos]]='green'; break; case 'srgb(254,211,80)': $anagrafica[$this::chekcs[$pos]]='yellow'; break; };
At this point: is there an easy way to do the trick?
Top comments (0)