- Notifications
You must be signed in to change notification settings - Fork 235
IPIP-499: UnixFS CID Profiles #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
lets make the fanout match the max links from files and rename profile to `-wide` this will make it easier to discuss in ipfs/specs#499
Co-authored-by: Bumblefudge <bumblefudge@learningproof.xyz>
Import.* config params for controlling DAG width were added in: ipfs/kubo#10774
| Thank you for kicking this off, and filling initial state. I've incorporated specific "dag width" settings for Next:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Co-authored-by: Christian Paul <info@jaller.de>
src/ipips/ipip-0499.md Outdated
| 1. UnixFS DAG layout (e.g. balanced, trickle etc...) | ||
| 1. UnixFS DAG width (max number of links per `File` node) | ||
| 1. `HAMTDirectory` bitwidth, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). | ||
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links | |
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links. We do not include details about the estimation algorithm as we do not encourage implementations to support it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit odd to discourage, when both most popular implementations in GO and JS use size-based heurstic - #499 (comment)
Unsure how to handle this. Perhaps clarify the heuristic is implementation-specific, and when deterministic behavior is expected, a specific heuristic should be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be estimating the block size as it's trivial to calculate it exactly. Can we not just define this (and punt to the spec for the details) to make it less hand-wavey?
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links | |
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on the final size of the serialized form of the [PBNode protobuf message](https://specs.ipfs.tech/unixfs/#dag-pb-node) that represents the directory. |
| Hey, I'd love to be able to reference this, even if it's in "draft" form, could we just merge it and continue to iterate on top of it to get it right? |
Fixed outdated references, consistent profile names, streamlined Summary and Motivation sections.
🚀 Build Preview on IPFS ready
|
| I made a few changes/fixes, aiming to land this early next week.
Open questions:
|
| | ||
| As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle DAG nodes needed to verify the CID. | ||
| | ||
| ## Test fixtures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting this is (imo) a blocker.
We did not merge UnixFS spec until we had sensible set of fixtures that people could use as reference.
The spec may be incomplete, but a fixture will let people reverse-engineer any details, and then PR improvement to spec.
Without fixtures for each UnixFS node type, we risk unknown unknown silently impacting final CID (e.g. because we did not know that someone may decide to place leaves one level sooner as "optimization" and someone else always at bottom, as "formal consistency")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking this in ipfs/kubo#11071
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
- I will implement
kubo-*profiles as part of 0.40 and test fixtures will be part of that work. - Then we will be able to link to them form spec, like we did in https://specs.ipfs.tech/unixfs/#appendix-test-vectors
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Rod Vagg <rod@vagg.org>
| Just synced with @lidel. He wants to ship this with test fixtures in place, (tracked in kubo/issues/11071). In the meantime, we don't anticipate changes to the profiles themselves so you can can reference this PR. |
Co-authored-by: Rod Vagg <rod@vagg.org>
| Great work, glad to see this! Couple notes/questions:
|
- add chunking algorithm parameter to both tables (fixed-size) - add hidden entities row to legacy profiles table - ensures both unixfs-2025 and legacy tables cover same parameters
- rename kubo-legacy-2015 to kubo-legacy-2025 - clarify (v0.39 default) instead of (kubo default) - fix leaves value: dag-pb (UnixFSRawLeaves=False in legacy-cid-v0)
clarify empty directories and hidden entities handling with precise terminology based on kubo v0.39, helia, and storacha implementations: - `included`: always in DAG, no option to exclude (kubo/helia empty dirs) - `excluded`: never in DAG, no option to include (storacha empty dirs) - `opt-in`: excluded by default, flag to include (all hidden entities) - `opt-out`: included by default, flag to exclude add terminology note to explain these terms
add "Based on" row with package/tool versions and kubo profile names
- unixfs-2025: mark threshold as TODO, prefer Helia's block size approach - unixfs-2025: note kubo needs opt-out flag for empty directories - legacy profiles: add estimation method to kubo profiles - parameters section: add backticks, clarify threshold estimation methods
- add Symlinks parameter to UnixFS parameters list - add Symlinks row to unixfs-2025 (TODO) and legacy profiles tables - kubo: preserved, helia/storacha: followed, dasl: not specified - add terminology for preserved/followed with UnixFS spec reference - clarify kubo --dereference-args behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick update: I've pushed several commits addressing feedback and gaps in the document:
Resolved / Research done
- Documented Syhmlink behavior as suggested by @icidasset
- Only Kubo 0.39 preserves symlinks, everything else dereferences on the fly by default, turning symlinks into real files and directories a symlink pointed at
- Added Chunking algorithm row to both profile tables for completeness
- Fixed kubo-legacy-2025 profile: corrected Leaves from raw to dag-pb (verified against kubo v0.39 legacy-cid-v0 profile where UnixFSRawLeaves=false)
- Documented filtering behavior with clear terminology:
- included: always in DAG (no option to exclude)
- excluded: always excluded (no option to include)
- opt-in: excluded by default, flag to include (e.g., --hidden)
- opt-out: included by default, flag to exclude
- Added Based on row with implementation versions and kubo profile names (legacy-cid-v0, test-cid-v1, test-cid-v1-wide)
- Clarified HAMTDirectory threshold estimation methods in the parameters section: link count (naive), PBNode.Links size (name + CID), or full dag-pb block size (most accurate)
- Noted that legacy table includes non-UnixFS implementations (DASL) in Summary section
- Added estimation method suffix (est:links[name+cid]) to kubo profiles in legacy table
Remaining TODOs in unixfs-2025
| Parameter | Status |
|---|---|
| HAMTDirectory threshold | TODO - fix kubo: likely based on full block size estimation (Helia approach) |
| Empty directories | TODO - use kubo? needs opt-out flag + Import.* |
| Hidden entities | TODO - use kubo? needs opt-in flag + Import.* |
| Symlinks | TODO - use kubo? needs flag + Import.* for controlling if all symlinks in imported directory tree are preserved or dereferenced) |
| Test fixtures | TODO - reuse kubo: will reuse once kubo has them for *-2025 profiles |
Other:
Implementation Plan (Kubo 0.40, ETA 2026 Q1)
To finalize this IPIP, Kubo needs to support additional Import.* configuration flags for:
- Empty directories: opt-out flag to exclude them from DAG
- Hidden files: already has
--hidden, just need to wire it up from config - HAMTDirectory threshold: configurable to support both legacy estimation (name + CID size) and Helia-style full block size calculation
Test fixtures will likely be included in the same Kubo PR that adds these missing features.
I also think we may replace two kubo-2025 and kubo-2025-wide profiles with a single one, that makes decision on what remains narrow and what is wide, but will update once Kubo changes land. (now that we have convention of doing IPIPs with profiles, we can always course-correct in `-202
recommend full serialized PBNode size, link to dag-pb spec ref: ipfs#499 (comment)
- rename to UnixFS CID Profiles - add lidel as editor - add thanks section with PR reviewers
Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID.
This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. They can be used to verify data across implementations, provide recommended settings depending on retrieval performance goals, and more.