Skip to main content

Task

You want to get, decode, and show elements, such as images and tables, that are embedded in a PDF document.

Approach

Extract the Base64-encoded representation of specific elements, such as images and tables, in the document. For each of these extracted elements, decode the Base64-encoded representation of the element into its original visual representation and then show it.

To run this example

You will need a document that is one of the document types supported by the extract_image_block_types argument. See the extract_image_block_types entry in API Parameters. This example uses a PDF file with embedded images and tables.

Code

For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.
Python
import json, base64, io from PIL import Image  def get_image_block_types(input_json_file_path: str):  with open(input_json_file_path, 'r') as file:  file_elements = json.load(file)   for element in file_elements:  if "image_base64" in element["metadata"]:  # Decode the Base64-encoded representation of the   # processed "Image" or "Table" element into its original  # visual representation, and then show it.  image_data = base64.b64decode(element["metadata"]["image_base64"])  image = Image.open(io.BytesIO(image_data))  image.show()  if __name__ == "__main__":  # Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/pdf/embedded-images-tables.pdf   # Specify where to get the local file, relative to this .py file.  get_image_block_types(  input_json_file_path="local-ingest-output/embedded-images-tables.json"  ) 
⌘I