Auditing Distributed Preservation Networks

Prepared for CNI Fall Meeting 2012 Washington, D.C. December 2012 Auditing Distributed Digital Preservation Networks Micah Altman, Director of Research, MIT Libraries Non Resident Senior Fellow, The Brookings Institution Jonathan Crabtree, Assistant Director of Computing and Archival Research HW Odum Institute for Research in Social Science, UNC

Collaborators* • Nancy McGovern • Tom Lipkis & the LOCKSS Team • Data-PASS Partners – ICPSR – Roper Center – NARA – Henry A. Murray Archive • Dataverse Network Team @ IQSS Research Support Thanks to the Library of Congress, the National Science Foundation, IMLS, the Sloan Foundation, the Harvard University Library, the Institute for Quantitative Social Science, and the Massachusetts Institute of Technology. * And co-conspirators Auditing Distributed Digital Preservation 2 Networks

Related Work Reprints available from: micahaltman.com • M. Altman, J. Crabtree, “Using the SafeArchive System: TRAC- Based Auditing of LOCKSS”, Proceedings of Archiving 2011, Society for Imaging Science and Technology. • Altman, M., Beecher, B., & Crabtree, J. (2009). A Prototype Platform for Policy-Based Archival Replication. Against the Grain, 21(2), 44- 47. Auditing Distributed Digital Preservation 3 Networks

Preview • Why? … distributed digital preservation? … audit? • SafeArchive: Automating Auditing • Theory vs. Practice – Round 0: Calibration – Round 1: Self-Audit – Round 2: Self-Compliance (almost) – Round 3: Auditing Other Networks • Lessons learned: practice & theory Auditing Distributed Digital Preservation Networks 4

Why distributed digital preservation? Auditing Distributed Digital Preservation Networks 5

Slightly Long Answer: Things Go Wrong Physical & Hardware Software Insider & External Attacks Organizational Failure Media Auditing Distributed Digital Preservation Networks Curatorial Error 6

Potential Nexuses for Preservation Failure • Technical – Media failure: storage conditions, media characteristics – Format obsolescence – Preservation infrastructure software failure – Storage infrastructure software failure – Storage infrastructure hardware failure • External Threats to Institutions – Third party attacks – Institutional funding – Change in legal regimes • Quis custodiet ipsos custodes? – Unintentional curatorial modification – Loss of institutional knowledge & skills – Intentional curatorial de-accessioning – Change in institutional mission Source: Reich & Rosenthal 2005 Auditing Distributed Digital Preservation 7 Networks

The Problem “Preservation was once an obscure backroom operation of interest chiefly to conservators and archivists: it is now widely recognized as one of the most important elements of a functional and enduring cyberinfrastructure.” – [Unsworth et al., 2006] “ • Libraries, archives and museums hold digital assets they wish to preserve, many unique • Many of these assets are not replicated at all • Even when institutions keep multiple backups offsite, many single points of failure remain, Auditing Distributed Digital Preservation Networks 8

Why audit? Auditing Distributed Digital Preservation Networks 9

Short Answer: Why the heck not? “Don‟t believe in anything you hear, and only half of what you see” - Lou Reed “Trust, but verify.” - Ronald Reagan Auditing Distributed Digital Preservation Networks 10

Full Answer: It’s our responsibility Auditing Distributed Digital Preservation Networks 11

OAIS Model Responsibilities • Accept appropriate information from Information Producers. • Obtain sufficient control of the information to ensure long term preservation. • Determine which groups should become the Designated Community able to understand the information. • Ensure that the preserved information is independently understandable to the DC • Ensure that the information can be preserved against all reasonable contingencies, • Ensure that the information can be disseminated as authenticated copies of the original or as traceable back to the original • Makes the preserved data available to the DC Auditing Distributed Digital Preservation Networks 12

OAIS Basic Implied Trust Model • Organization is axiomatically trusted to identify designated communities • Organization is engineered with the goal of: – Collecting appropriate authentic document – Reliably deliver authentic documents, in understandable form, at a future time • Success depends upon: – Reliability of storage systems & services: e.g., LOCKSS network, Amazon Glacier – Reliability of organizations: MetaArchive, DataPASS, Digital Preservation Network – Document contents and properties: Formats, Metadata, Semantics, Provenance, Authenticity Auditing Distributed Digital Preservation Networks 13

Enhancing Reliability through Trust Engineering • Incentives: • Social engineering – Rewards, penalties – Recognized practices; shared norms – Incentive-compatible mechanisms – Social evidence • Modeling and analysis: – Reduce provocations – Statistical quality control & reliability – Remove excuses estimation, threat-modeling and • Regulatory approaches vulnerability assessment – Disclosure; Review; Certification; Audits • Portfolio Theory: – Regulations & penalties – Diversification (financial, legal, technical, • Security engineering institutional … ) – Increase effort for attacker: harden target – Hedging (reduce vulnerability); increase • Over-engineering approaches: technical/procedural controls; , – Safety margin, redundancy remove/conceal targets • Informational approaches: – Increase risk to attacker: surveillance, – Transparency (release of information detection, likelihood of response permitting direct evaluation of – Reduce reward: deny benefits, disrupt compliance); common knowledge, markets, identify property – Crypto: signatures, fingerprints, non- repudiation Auditing Distributed Digital Preservation Networks 14

Audit [aw-dit]: An independent evaluation of records and activities to assess a system of controls Fixity mitigates risk only if used for auditing. Auditing Distributed Digital Preservation Networks 15

Functions of Storage Auditing • Detect corruption/deletion of content • Verify compliance with storage/replication policies • Prompt repair actions Auditing Distributed Digital Preservation Networks 16

Bit-Level Audit Design Choices • Audit regularity and coverage: on-demand (manually); on object access; on event; randomized sample; scheduled/comprehensive • Fixity check & comparison algorithms • Auditing scope: integrity of object; integrity of collection; integrity of network; policy compliance; public/transparent auditing • Trust model • Threat model Auditing Distributed Digital Preservation Networks 17

Repair Auditing mitigates risk only if used for repair. Key Design Elements • Repair granularity • Repair trust model • Repair latency: – Detection to start of repair – Repair duration • Repair algorithm Auditing Distributed Digital Preservation Networks 18

Summary of Current Automated Preservation Auditing Strategies LOCKSS Automated; decentralized (peer-2-peer); tamper-resistant auditing & repair; for collection integrity. iRODS Automated centralized/federated auditing for collection integrity; micro-policies. DuraCloud Automated; centralized auditing; for file integrity. (Manual repair by DuraSpace staff available as commercial service if using multiple cloud providers.) Digital Preservation In development… Mechanism Automated; independent; multi-centered; auditing, repair and provisioning; of existing LOCKSS storage networks; for collection integrity, for high-level policy (e.g. TRAC) compliance. Auditing Distributed Digital Preservation Networks 19

LOCKSS Auditing & Repair Decentralized, peer-2-peer, tamper-resistant replication & repair Regularity Scheduled Algorithms Bespoke, peer-reviewed, tamper resistant Scope - Collection integrity - Collection repair Trust model - Publisher is canonical source of content - Changed contented treated as new - Replication peers are untrusted Main threat models - Media failure - Physical Failure - Curatorial Error - External Attack - Insider threats - Organizational failure Key auditing limitations - Correlated Software Failure - Lack of Policy Auditing, public/transparent auditing Auditing Distributed Digital Preservation Networks 20

Auditing & Repair TRAC-Aligned policy auditing as a overlay network Regularity Scheduled; Manual Fixity algorithms Relies on underlying replication system Scope - Collection integrity - Network integrity - Network repair - High-level (e.g. trac) policy auditing Trust model - External auditor, with permissions to collect metadata/log information from replication network - Replication network is untrusted Main threat models - Software failure - Policy implementation failure (curatorial error; insider threat) - Organizational failure - Media/physical failure through underlying replication system Key auditing limitations Relies on underlying replication system, (now) LOCKSS, for fixity check and repair Auditing Distributed Digital Preservation Networks 21

SafeArchive: TRAC-Based Auditing & Management of Distributed Digital Preservation Facilitating collaborative replication and preservation with technology… • Collaborators declare explicit non-uniform resource commitments • Policy records commitments, storage network properties • Storage layer provides replication, integrity, freshness, versioning • SafeArchive software provides monitoring, auditing, and provisioning • Content is harvested through HTTP (LOCKSS) or OAI-PMH • Integration of LOCKSS, The 22 Dataverse Network,Auditing Distributed Digital Preservation Networks TRAC

SafeArchive: Schematizing Policy and Behavior “The repository system must be able to identify the Policy number of copies of all stored digital objects, and the location of each object and their copies.” Schematization Behavior (Operationalization) Auditing Distributed Digital Preservation Networks 23

Adding High-Level Policy to LOCKSS • LOCKSS Lots of Copies Keep Stuff Safe – Widely used in library community – Self-contained OSS replication system, low maintenance, inexpensive – Harvests resources via web-crawling, OAI-PMH, database queries,… – Maintains copies through secure p2p protocol – Zero trust & self repairing • What does SafeArchive Add? – Auditing – easily monitor number of copies of content in network – Provisioning – ensure sufficient copies and distribution – Collaboration – coordinate across partners, monitor resource commitments – Provide restoration guarantees – Integrate with Dataverse Network digital repository Auditing Distributed Digital Preservation Networks 24

Design Requirements SafeArchive is a targeted vertical slice of functionality through the policy stack…  Policy Driven status of participating systems  Institutional policy creates formal  At least one system to initiate new replication commitments harvesting on participating system  Documents and supports TRAC  No deletion/modification of /ISO policies objects stored on another system  Allows Asymmetric  Schema based auditing used to Commitments  … verify collection replication  … record storage commitments  … storage commitments  … document all TRAC criteria  … size of holdings being replicated  … demonstrate policy compliance  … distribution of holdings over time  Provide restoration guarantees  to owning archive  to replication hosts  Limited trust  No superuser  Partners trusted to hold the unencrypted content of other (reinforced with legal agreements)  At least one system trusted to read Auditing Distributed Digital Preservation Networks 25

SafeArchive Components Auditing Distributed Digital Preservation Networks 26

SafeArchive in Action safearchive.org Auditing Distributed Digital Preservation Networks 27

Theory vs. Practice Round 0: Setting up the Data-PASS PLN “Looks ok to me” - PHB Motto Auditing Distributed Digital Preservation Networks 28

THEORY Start Expose Content ( Through Install LOCKSS OAI+DDI+HTTP ) (On 7 servers) Harvest Content (through OAI plugin) Setup PLN configurations (through OAI plugin) LOCKSS Magic Done 29 Auditing Distributed Digital Preservation Networks

Application: Data-PASS Partnership • Data-PASS partners collaborate to – identify and promote good archival practices, – seek out at-risk research data, – build preservation infrastructure, – and mutually safeguard collections. • Data-PASS collections – 5 Collections – Updated ~daily – Research data as content – 25000+ Studies – 600000+ Files – <10TB – Goal: >=3 verified replicas per collection, >= 2 regions Auditing Distributed Digital Preservation Networks 30

Practice (Round 0) • OAI Plugin extensions required for: – Non-DC metadata – Large metadata Expose Content ( Install LOCKSS Through – Alternate authentication method OAI+DDI+HTTP ) (On 7 servers) – Support for OAI-SETS – Non-fatal error handling Harvest Content • OAI Provider (Dataverse) tuning: (through OAI plugin) – Performance handling for delivery – Performance handling for errors Setup PLN configurations • PLN Configuration required: (through OAI plugin) – Stabilization around LOCKSS versions LOCKSS – Coordination around plugin repository Magic – Coordination around collection definition • Dataverse Network Extensions – Generate LOCKSS manifest pages (Theory) – License harmonization – LOCKSS export control by archive curator Auditing Distributed Digital Preservation Networks 31

Results (Round 0) • Remaining issues – None known • Outcomes – LOCKSS OAI plugin extensions (later integrated into LOCKSS core) – Dataverse Network performance tuning – Dataverse Network extensions Auditing Distributed Digital Preservation Networks 32

Lesson 0 • When innovating plan for… – substantial gap between prototype and production – multiple iterations Auditing Distributed Digital Preservation Networks 33

Theory vs. Practice Round 1: Self-Audit “A mere matter of implementation” - PHB Motto Auditing Distributed Digital Preservation Networks 34

THEORY Log Error for Later Investigation (Round 1) LOCKSS Cache Manager Gather Information Start from Add Replica Each Replica Integrate Information -> Map Network State State NO Compare Current == Network to Policy Policy ? YES Success Auditing Distributed Digital Preservation Networks 35

Implementation www.safearchive.org

Practice (Round 1) • Gathering information required – Replacing the LOCKSS cache manager – Permissions – Reverse-engineering UI’s (with help) Gather Information – Network magic from Add Replica Each Replica • Integrating information required – Heuristics for lagged information Integrate Information -> – Heuristics for incomplete Map Network State information State Compare == N – Heuristics for aggregated Current State Polic O Map to Policy y information ? • Comparing map to policy required YES Mere matter of implementation  Success (Theory) Auditing Distributed Digital Preservation Networks 37

Results (Round 1) • Outcomes – Implementation of SafeArchive reporting engine – Stand alone OSS replacement for LOCKSS cache manager – Initial audit of Data-PASS replicated collections • Problems – Collections achieving policy compliance were actually incomplete • Dude, where’s our metadata? – Uh-oh, most collections failed policy compliance  • Adding replicas didn’t solve it Auditing Distributed Digital Preservation Networks 38

Lesson 1: Replication agreement does not prove collection integrity What you see Replicas X,Y,Z agree on collection A What you are tempted to conclude: Replicas X,Y,Z agree Collection on collection A A is good Auditing Distributed Digital Preservation Networks 39

What can you infer from replication agreement? Replicas X,Y,Z agree Collection on collection A Assumptions: A is good • Harvesting did not report errors AND • Harvesting system is error free OR • Errors are independent per object AND • Large number of objects in collection Supporting External Evidence Multiple Systematic Collection Independent Automated Comparison Automated Restore & Harvester Systematic with External Harvester Log Comparison Implementations Harvester Testing Collection Monitoring Testing per Collection Statistics Auditing Distributed Digital Preservation Networks 40

Lesson 2: Replication disagreement does not prove corruption What you see Replicas X,Y disagree with Z on collection A What you are tempted to conclude: Repair/Repl Collection Replicas X,Y disagree ace A on host with Z on collection A Collection A Z is bad on host Z Auditing Distributed Digital Preservation Networks 41

What can you infer from replication failure? Replicas X,Y disagree Collection with Z on collection Assumptions: A on host A Z is bad • Disagreement implies that content of collection A is different on all hosts • Contents of collection A should be identical on all hosts • If some content of collection A is bad, entire collection is bad Possible alternate scenarios Audit Objects in information Collections grow collections are cannot be ??? ??? rapidly frequently collected from updated some host Auditing Distributed Digital Preservation Networks 42

Theory vs. Practice Round 2: Compliance (almost) “How do you spell „backup‟? R–E-C–O–V–E–R-Y - Auditing Distributed Digital Preservation Networks 43

Lesson 3: Distributed digital preservation works …with evidence-based tuning and adjustment • Diagnostics – When network is out of adjustment additional information is needed to inform adjustment – Worked with LOCKSS team to gather information • Adjustments – Timings (e.g. crawls, polls) • Understand • Tune • Parameterize heuristics, reporting • Track trends over time – Collections • Change partitioning to AU’s at source • Extend mapping to AU’s in plugin • Extend reporting/policy framework to group AU’s • Outcomes – At time: Verified replications of all collections – Currently: Minor policy violations in one collection – Worked with LOCKSS team to design further instrumentation of LOCKSS Auditing Distributed Digital Preservation Networks 44

Theory vs. Practice Round 3: Auditing Other PLNs “In theory, theory and practice are the same – in practice, they differ.” - Auditing Distributed Digital Preservation Networks 45

Application: Coppul • Council of Prairie and Pacific University Libraries • Collections – 9 Institutions – Dozens of collections • Journal runs • Digitized member content: text, photos, images, ETDS • Goal – ‘Multiple’ verified replicas Auditing Distributed Digital Preservation Networks 46

Application: Digital Federal Depository Library Program • The Digital Federal Depository Library Program, or the “USDocs” private LOCKSS network replicates key aspects of the United States Federal Depository System. • Collections – Dozens of institutions (24 replicating) – Electronic publications – 580+ collections – <10TB • Goal – “Many” replicas, “many” regions Auditing Distributed Digital Preservation Networks 47

Application: MetaArchive • “A secure and cost-effective repository that provides for the long-term care of digital materials – not by outsourcing to other organizations, but by actively participating in the preservation of our own content.” • 50+ institutions, 22+ members • >10TB, including audio and video content • Testing only, full auditing not yet performed… Auditing Distributed Digital Preservation Networks 48

THEORY (Round 3) Gather Information Add Start from Replica Each Replica NO YES Collection Integrate Adjust Sizes, Information -> Polling Map Network State Intervals adjusted? State NO Compare Current == Network to Policy Policy ? YES Success Auditing Distributed Digital Preservation Networks 49

Here’s where things get even more complicated… Auditing Distributed Digital Preservation Networks 50

Practice (Year 3) Lesson 6: Trust, but continuously verify • 20-80 % initial failure to confirm policy compliance Gather • Tuning infeasible, or yielded only Information from Add Replica moderate improvement Each Replica NO YES Integrate Adjust AU Sizes, Information -> Polling Outcomes Map Network Intervals • In-depth diagnostic and analysis with State adjusted? State LOCKSS team Compare Current == NO • Adjustment of auditing algorithms: Network to Policy Policy ? YES detect “islands of agreement” • Adjust expectations – Focus on inferences rather than replication Success agreement – Focus on 100% policy compliance per collection rather than 100% error-free • Design file-level diagnostic instrumentation in LOCKSS Re-analysis in progress… Auditing Distributed Digital Preservation Networks 51

What can you infer from replication failure? Replicas X,Y disagree Collection with Z on collection Assumptions: A on host A Z is bad • Disagreement implies that content of collection A is different on all hosts • Contents of collection A should be identical on all hosts • If some content of collection A is bad, entire collection is bad Possible alternate scenarios Audit ??? ??? Objects in information Collections grow collections are cannot be rapidly frequently collected from updated some host Auditing Distributed Digital Preservation Networks 52

What else could be wrong? Round 1 hypothesis Disagreement is real, but doesn’t matter in long run 1.1 Temporary differences. Collections temporarily out or sync (either missing objects or different object versions) – will resolve over time (E.g. if harvest frequency << source update frequency, but harvest times across boxes vary significantly) 1.2 Permanent point-in-time collection differences that are artefact of synchronization. (E.g. if one replica always has version n-1, at time of poll) Hypothesis 2: Disagreement is real, but nonsubstantive. 2.1.Non-Substantive collection differences (arising from dynamic elements in collection that have no bearing on the substantive content ) 2.1.1 Individual URLS/files that are dynamic and non substantive (e.g., logo images, plugins, Twitter feeds, etc.) cause content changes (this is common in the GLN). 2.2.2 dynamic content embedded in substantive content (e.g. a customized per-client header page embedded in the pdf for a journal article) 2.2. Audit summary over-simplifies  loses information 2.2.1 Technical failure of poll can occur when still sub-quora “islands” of agreement, sufficient for policy Hypothesis 3: Disagreement is real, matters Substantive collection differences 3.1 Some objects are corrupt (e.g. from corruption in storage, or during transmission/harvesting) 3.2 Substantive objects persistently missing from some replicas ( e.g. because of permissions issue @ provider; technical failures during harvest; plugin problems) 3.3 Versions of objects permanently missing (Note that later “agreement” may signify that a later version was verified)

What diagnostic evidence would be useful for audit-related inference? • Longitudinal fixity and modification time collection – E.g. detect if disagreement is persistent for a specific collection – Transient collection problems suggest synchronization issues • Collection-replica fixity – Detect sub-quorum-level ‘islands’ of agreement (insufficient to validate default poll; but potentially sufficient to verify policy compliance) • File fixity, version/modification time – E.g., establish partial collection agreement • All files older than time X agree • All disagreements are versioning/modification time agreements • Longitudinal information on file/collection level – Subset of files persistently missing – Subset of files longitudinally different across re-harvesting (suggests dynamic content issues) • Universal Numeric Fingerprints/Semantic Signatures – Remove false positives from format migration/dynamic non-substantive content • Object information – Suggests dynamic object • Manual file inspection  – Check dynamic objects – Not scalable – Difficult to do without violating trust model Auditing Distributed Digital Preservation Networks 54

What can you infer from replication failure? Replicas X,Y disagree Collection with Z on collection Assumptions: A on host A Z is bad • Disagreement implies that content of collection A is different on all hosts • Contents of collection A should be identical on all hosts • If some content of collection A is bad, entire collection is bad Alternative Scenarios Audit Objects in Partial Non- information Collections grow collections are Agreement substantive cannot be rapidly frequently without dynamic collected from updated Quorum content some host Auditing Distributed Digital Preservation Networks 55

Lesson 6: Don’t aim for 100% performance, aim for 100% compliance • 100% of replicas agree: NO • 100% of collections are compliant 100% of the time: NO • 100% of files agree between verified collections: Maybe • 100% of policy overall: By design • 100% of bits in a file? : Implicitly assumed by tools; but not necessary • 100% policy for specific collection at specific time: Yes Auditing Distributed Digital Preservation Networks 56

Lessons Learned “What, me worry?” - - Alfred E. Neuman Auditing Distributed Digital Preservation Networks 57

Formative Lessons • Lesson 0: When innovating plan for… – substantial gap between prototype and production – multiple iterations • Lesson 1: Replication agreement does not prove collection integrity… confirm with external evidence of correct harvesting • Lesson 2: Replication disagreement does not not prove collection corruption… use diagnostics • Lesson 3: Distributed digital preservation works… with evidence-based tuning and adjustment Auditing Distributed Digital Preservation Networks 58

Analytic Lessons • Lesson 4: All networks had substantial and unrecognized gaps  Trust but continuously verify • Lesson 5: Don’t aim for 100% performance, aim for 100% compliance • Lesson 6: Many different things can go wrong in distributed systems, without easily recognizable external symptoms  Distributed preservation requires distributed auditing • Lesson 7: External information on system operation and collection characteristics is important for analyzing results  Transparency helps preservation Auditing Distributed Digital Preservation Networks 59

What’s Next? “It‟s tough to make predictions, especially about the future” -Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, and others Auditing Distributed Digital Preservation Networks 60

Short Term • Complete round 4 data collection (including CLOCKSS) • Refinements of current auditing algorithms – More tunable parameters (yeah!?!) – Better documentation – Simple health metrics • Reports, and dissemination Auditing Distributed Digital Preservation Networks 61

Longer Term • Statistical health metrics • Diagnostics • Policy decision support • Additional audit standards • Support additional replication technology • Audit other policy sets Auditing Distributed Digital Preservation Networks 62

Bibliography (Selected) • B. Schneier, 2012. Liars and Outliers, John Wiley & Sons • H.M. Gladney, J.L. Bennett, 2003. “What do we mean by authentic”, D-Lib 9(7/8) • K. Thompson, 1984. “Reflections on Trusting Trust”, Communication of the ACM, Vol. 27, No. 8, August 1984, pp. 761-763. • David S.H. Rosenthal, Thomas S. Robertson, Tom Lipkis, Vicky Reich, Seth Morabito. “Requirements for Digital Preservation Systems: A Bottom-Up Approach”, D-Lib Magazine, vol. 11, no. 11, November 2005. • OAIS, Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1, Blue Book, January 2002 Auditing Distributed Digital Preservation 63 Networks

Questions? SafeArchive: safearchive.org E-mail: Micah_altman@alumni.brown.edu Web: micahaltman.com Twitter: @drmaltman E-mail: Jonathan_Crabtree@unc.edu Auditing Distributed Digital Preservation 64 Networks

Auditing Distributed Preservation Networks

More Related Content

What's hot

Similar to Auditing Distributed Preservation Networks

More from Micah Altman

Recently uploaded

Auditing Distributed Preservation Networks

Editor's Notes