How to get text from layers (ocg)

I create pdf with text blocs associated with one layer (QMARK). So printing layers I see it

ocgs = doc.get_ocgs()
for xref in ocgs:
obj = doc.xref_object(xref)
print (“obj:”, obj, xref)

obj: <<
/Type /OCG
/Name (QMARKS)

But how to associate it with text inside. If I print (page.get_text(“rawjson”)) I don’t see nothing about OCG. I have some program which returns json from pdf and there is some mark on character if it contains to a layer:

 { "bbox":[ 161.47, 140.16, 163.61, 146.83 ], "x":161.47, "y":146.83, "md5":"516196545100a13c47de3a0b509a7581", "c":"i", "mc":[ "OC" ], }, 

My goal is to find some text in QMARK layer and add a link on it to a some page. So how to get coordinates of the layer area? Then I could search for wanted text in that area.

Here is a pdf

test.pdf (4.9 KB)

Hi @Linas - welcome and thanks for this complex issue. Agreed - we don’t get the OC info exposed in the dictionaries. So I came up with a method to try and isolate this info.

This is done doing a get_text()with everything on, m then by switching off the OC group and doing another get_text() then figuring out the difference. What you are left with should just be the info for the OC layer I think?

Here was my code:

from pprint import pprint import pymupdf doc = pymupdf.open("test.pdf") resultA = doc[0].get_text("dict") # everything on layer_configs = doc.layer_ui_configs() pprint(layer_configs) # now turn off QMARKS which has number id 0 doc.set_layer_ui_config(0, 2) layer_configs = doc.layer_ui_configs() pprint(layer_configs) resultB = doc[0].get_text("dict") # QMARKS layer off # we want to work out the difference, i.e. resultB-resultA def deep_dict_diff(dict1, dict2, path=""): """Recursively find differences in nested dictionaries""" differences = [] all_keys = set(dict1.keys()) | set(dict2.keys()) for key in all_keys: current_path = f"{path}.{key}" if path else key if key not in dict1: differences.append(f"Added: {current_path} = {dict2[key]}") elif key not in dict2: differences.append(f"Removed: {current_path} = {dict1[key]}") elif isinstance(dict1[key], dict) and isinstance(dict2[key], dict): # Recursively compare nested dictionaries differences.extend(deep_dict_diff(dict1[key], dict2[key], current_path)) elif dict1[key] != dict2[key]: differences.append(f"Changed: {current_path}") differences.append(f" Old: {dict1[key]}") differences.append(f" New: {dict2[key]}") return differences diffs = deep_dict_diff(resultA, resultB) print("*********") pprint(diffs) print("**********") # Figure out how to get a new dictionary which reflects just the QMARKS data def merge_unique(dict1, dict2): """Merge two dicts, keeping only unique key-value pairs from each""" # Items only in dict1 (not in dict2 at all, or different value) unique_from_dict1 = {k: v for k, v in dict1.items() if k not in dict2 or dict2[k] != v} # Items only in dict2 (not in dict1 at all, or different value) unique_from_dict2 = {k: v for k, v in dict2.items() if k not in dict1 or dict1[k] != v} # Combine them result = {**unique_from_dict1, **unique_from_dict2} return result merged = merge_unique(resultB, resultA) print(merged) 

Thanks @Jamie_Lemon I understood the concept. It looks like over-engineering now. I hope in the future it will be dedicated functions to access layers data. Or just simply will be the key in the get_text() data which points to corresponding OCG id. Like 'ocg': <ocg object number>:

{'bbox': (148.71200561523438, 135.4921417236328, 275.0582275390625, 149.60919189453125), 'ocg': <ocg object number>, 'dir': (1.0, 0.0), 'spans': [{'alpha': 255, 'ascender': 1.1269999742507935, 'bbox': (148.71200561523438, 135.4921417236328, 275.0582275390625, 149.60919189453125), 'bidi': 0, 'char_flags': 16, 'color': 0, 'descender': -0.28999999165534973, 'flags': 4, 'font': 'LMRoman10-Regular', 'origin': (148.71200561523438, 146.72003173828125), 'size': 9.962639808654785, 'text': 'This is a Q1 mark in a layer.'}], 'wmode': 0}, 
1 Like

@Linas You know - I agree with you! What I gave is probably over-engineered, but none-the-less maybe some kind of solution.
I totally agree with you - if we could get OCG layers defined in the structured text as you outline in your sample that would be the way to go here. @HaraldLieder What do you think about the feasibility of this one?

Content extraction and page rendering (Page.get_pixmap()) will react to the currently active OC configuration.
This means that at any time we will only get (or see) those parts of the page content which are either completely independent from Optional Content or have an OCG / OCMD that evaluates to ON.
Changing the OC configuration and then repeating extraction / rendering will always yield the appropriate output.


A side note:
Optional visibility is controlled by so-called Optional Content Group (OCG) objects or boolean expressions of multiple OCGs. These boolean expressions are called Optional Content Membership Dictionary, OCMD.
OCGs and OCMDs can be assigned to text particles, images, PDF XObjects, annotations and vector graphics.
The expression “layer” usually (and imprecisely) is used for the name value of some OCG.


PyMuPDF returns the layer value for vector graphics as part of page.get_drawings().
For annotations, the oc attribute contains the xref of an OCG or OCMD … or 0 – not a layer name. Same is true for images and XObjects.

It does not return this value for Page.get_text() and there currently are no such plans. We are busy here with implementing improvements that have a much higher priority.

With more involvement, there are ways to still get the desired information. Method (see Functions - PyMuPDF documentation) Page.get_bboxlog(layers=True) returns tuples (bboxtype, bbox, layer) in the same sequence as the appropriate display commands in the page`s appearance occur.
By matching (containment check) a text span with one of those bboxes should give you the desired result.

Here is a simple implementation of my above idea:

import pymupdf doc = pymupdf.open("test.pdf") page = doc[0] # extract bboxlog with layer information bboxlog = [(b[0], pymupdf.Rect(b[1]), b[2]) for b in page.get_bboxlog(layers=True)] # extract text with full metadata blocks = page.get_text("dict", flags=pymupdf.TEXT_ACCURATE_BBOXES)["blocks"] # make span list enriched with layer info spans = [ s | {"layer": [b[2] for b in bboxlog if b[1].contains(s["bbox"])]} for b in blocks for l in b["lines"] for s in l["spans"] ] # print enriched span info for s in spans: print(f'{s["text"]=} is in layers {s["layer"]}') 

Here is the result:

s["text"]='Hello, this is normal text.' is in layers [''] s["text"]='This is a Q1 mark in a layer.' is in layers ['QMARKS'] s["text"]='This is a Q2 mark in the same layer.' is in layers ['QMARKS'] s["text"]='1' is in layers [''] 
2 Likes