Materials Project computation and database infrastructure
The document describes the Materials Project computation infrastructure, which uses the Atomate framework to automatically run density functional theory simulations on over 85,000 materials in a high-throughput manner, with the results stored in a MongoDB database for users to explore and analyze in order to accelerate materials innovation. The Materials Project infrastructure aims to make it easy for researchers to generate large amounts of computational data on materials properties through standardized and scalable workflows.
Materials Project computation and database infrastructure
1.
Materials Project computationand database infrastructure Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA Presentation given to Delaware Energy Institute, 2018 Slides (already) posted to https://hackingmaterials.lbl.gov
2.
Outline 2 ① Introduction tothe Materials Project ② Materials Project computation infrastructure ③ Database considerations
3.
The Materials Projectdatabase • Online resource of density functional theory simulation data for ~85,000 inorganic materials • Includes band structures, elastic tensors, piezoelectric tensors, battery properties and more • >60,000 registered users • Free • www.materialsproject.org 3 Jain et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1, 11002 (2013).
4.
4 Many data setsare available! M. De Jong et al. Sci. Data, 2015, 2, 150009. ] M. De Jong et al. Sci. Data, 2015, 2, 150009.
Outline 6 ① Introduction tothe Materials Project ② Materials Project computation infrastructure ③ Database considerations
7.
A “black-box” viewof performing a calculation 7 “something” Results! researcher What is the GGA-PBE elastic tensor of GaAs?
8.
Unfortunately, the insideof the “black box” is usually tedious and “low-level” 8 lots of tedious, low-level work… Results! researcher What is the GGA-PBE elastic tensor of GaAs? Input file flags SLURM format how to fix ZPOTRF? q set up the structure coordinates q write input files, double-check all the flags q copy to supercomputer q submit job to queue q deal with supercomputer headaches q monitor job q fix error jobs, resubmit to queue, wait again q repeat process for subsequent calculations in workflow q parse output files to obtain results q copy and organize results, e.g., into Excel
9.
What would bea better way? 9 “something” Results! researcher What is the GGA-PBE elastic tensor of GaAs?
10.
What would bea better way? 10 Results! researcher What is the GGA-PBE elastic tensor of GaAs? Workflows to run q band structure q surface energies ü elastic tensor q Raman spectrum q QH thermal expansion
11.
Ideally the methodshould scale to millions of calculations 11 Results! researcher Start with all binary oxides, replace O->S, run several different properties Workflows to run ü band structure ü surface energies ü elastic tensor q Raman spectrum q QH thermal expansion q spin-orbit coupling
12.
Atomate tries makeit easy, automatic, and flexible to generate data with existing simulation packages 12 Results! researcher Run many different properties of many different materials!
13.
Atomate contains alibrary of simulation procedures 13 VASP-based • band structure • spin-orbit coupling • hybrid functional calcs • elastic tensor • piezoelectric tensor • Raman spectra • NEB • GIBBS method • QH thermal expansion • AIMD • ferroelectric • surface adsorption • work functions Other • BoltzTraP • FEFF method • LAMMPS MD Mathew, K. et al Atomate: A high-level interface to generate, execute, and analyze computational materials science workflows, Comput. Mater. Sci. 139 (2017) 140–152.
14.
Each simulation proceduretranslates high-level instructions into a series of low-level tasks 14 quickly and automatically translate PI-style (minimal) specifications into well-defined FireWorks workflows What is the GGA-PBE elastic tensor of GaAs? M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, et al., Charting the complete elastic properties of inorganic crystalline compounds, Sci. Data. 2 (2015).
15.
Atomate thus encodesand standardizes knowledge about running various kinds of simulations from domain experts 15 K. Mathew J. Montoya S. Dwaraknath A. Faghaninia All past and present knowledge, from everyone in the group, everyone previously in the group, and our collaborators, about how to run calculations M. Aykol S.P. Ong B. Bocklund T. Smidt H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood Z.K. Liu J. Neaton K. Persson A. Jain +
16.
16 Full operation diagram job1 job 2 job 3 job 4 structure workflow database of all workflows automatically submit + executeoutput files + database
17.
17 Full operation diagram job1 job 2 job 3 job 4 structure workflow database of all workflows automatically submit + executeoutput files + database
18.
• Pymatgen canretrieve crystal structures from the Materials Project database (MPRester class) • It can also manipulate crystal structures – substitutions – supercell creation – order-disorder (shown at right) – interstitial finding – surface / slab generation • A visual interface to many of the tools are in Materials Project’s “Crystal Toolkit” app 18 Crystal structure generation via pymatgen Example: Order-disorder resolve partial or mixed occupancies into a fully ordered crystal structure (e.g., mixed oxide-fluoride site into separate oxygen/fluorine)
19.
19 Full operation diagram job1 job 2 job 3 job 4 structure workflow database of all workflows automatically submit + executeoutput files + database
20.
20 Atomate’s main goal– convert structures to workflows Workflows consist of a series of jobs (“FireWorks”), each with multiple tasks. Atomate jobs typically (i) run a calculation and (ii) store the results in a database
21.
21 Full operation diagram job1 job 2 job 3 job 4 structure workflow database of all workflows automatically submit + executeoutput files + database
22.
FireWorks allows youto write your workflow once and execute (almost) anywhere 22 • Execute workflows locally or at a supercomputing center • Queue systems supported – PBS – SGE – SLURM – IBM LoadLeveler – NEWT (a REST-based API at NERSC) – Cobalt (Argonne LCF)
• Job provenanceand automatic metadata storage • Detect and rerun failures • “Dynamic” workflows that change behavior based on results • Customize job priorities • Much more… 24 Other features
25.
25 Full operation diagram job1 job 2 job 3 job 4 structure workflow database of all workflows automatically submit + executeoutput files + database
27 The atomate databasemakes it easy to perform various analyses with pymatgen atomate output database(s) phase diagrams Pourbaix diagrams diffusivity via MDband structure analysis
28.
28 Many research groupshave run tens of thousands of materials science workflows with atomate also used by: • Persson research group, UC Berkeley • Ong research group, UC San Diego • Neaton research group, UC Berkeley • Liu research group, Penn State • Groups not developing on atomate! • e.g., see “Thermal expansion of quaternary nitride coatings” by Tasnadi et al. atomate now powers the Materials Project and will be used to run hundreds of thousands of simulations in the next year (www.materialsproject.org)
29.
Outline 29 ① Introduction tothe Materials Project ② Materials Project computation infrastructure ③ Database considerations
30.
30 About a decadeago, we were using a SQL infrastructure Main problems we ran into: • Too static – every time we wanted to store a new kind of data, the DB master needed to “design and update” the database schema • Too difficult for newcomers – constructing queries (joins, etc.). We actually designed a system to help people make queries, which is common
31.
31 Since then, wehave switched to MongoDB – a “noSQL” database Major advantages • Very dynamic – easy to add new data types without interfering with old data types or redesigning everything. No central “database master” needed • Easy for newcomers – easy syntax, no complex “joins”, easy to visualize results • Easy object-relational mapping – built our pymatgen code so that any objects (e.g., band structures, crystal structures, etc.) could be exported to a database or imported from a database easily
32.
32 How we storecomputed data Data is stored in “collections”. Each collection is a set of documents that can be queried. Each document consists of nested key- value pairs (“dictionaries”) or arrays. e.g. one can search for: {“tags”: “phosphides”} to retrieve all documents tagged with “phosphide”
33.
33 Each collection hasa set of standard keys Data is stored in “collections”. Each collection is a set of documents that can be queried. materials collection – each document represents a material, with keys like “formula” and “band_gap” tasks collection – each document represents a DFT calculation, with keys like “dir_name” and “input.parameters” workflows collection – each document represents a calculation workflow, with keys like “nodes” and “links” Typically, each document within a collection will be of a uniform format, but this not a hard requirement in MongoDB.
34.
1. As describedpreviously: for each data type (a “material”, “task”, “workflow”, etc.) decide on a set of fields that describe each instance of that data type. In MongoDB, these fields can easily be changed or added to later if needed. 2. Try to create a single collection and document format that can handle any kind of materials data! – example 1: “PIF” file format from Citrine[1] – example 2: MPContribs from Materials Project[2] 34 Two approaches to store data in MongoDB [1] J. O’Mara, B. Meredig, K. Michel, Materials Data Infrastructure : A Case Study of the Citrination Platform to Examine Data Import , Storage , and Access, Jom. (2016). [2] P. Huck, D. Gunter, S. Cholia, D. Winston, A.T. N’Diaye, K. Persson, User applications driven by the community contribution framework MPContribs in the Materials Project, Concurr. Comput. Pract. Exp. 22 (2015)
Funding: DOE-BES MaterialsScience Division, Computing: NERSC 37 Who to talk to next! The current “Guardians of the MP infrastructure” Slides (already) posted to https://hackingmaterials.lbl.gov