This document discusses open source tools for materials informatics, including Matminer and Matscholar. Matminer is a library of descriptors for materials science data that can generate features for machine learning models. It includes over 60 featurizer classes and supports scikit-learn. Matscholar applies natural language processing to over 2 million materials science abstracts to extract keywords and enable improved literature searching. The document argues that open datasets like Matbench and automated tools like Automatminer could help lower barriers for developing machine learning models in materials science by making it easier to obtain training data and evaluate model performance.
Presentation introduction by Anubhav Jain on open source tools for materials informatics.
Discusses the challenges faced by CS majors vs. MS&E majors in learning materials science and machine learning, highlighting asymmetries in available resources.
Outlines key questions regarding the development and testing of ML models for materials science, emphasizes importance of quality datasets.
Explains the role of high-quality datasets like ImageNet for machine learning and proposes a similar dataset for materials science.
Describes the Matbench test set, detailing various target properties, data sources, and sizes, plus diversity in the benchmark suite.
Introduces Automatminer as a tool for automated ML model generation, discusses its capabilities and potential for reducing entry barriers.
Highlights accessibility of Matminer, Matbench, and Automatminer through GitHub and upcoming data on Figshare.
Discusses the use of NLP for extracting knowledge from literature and enhancing materials search engines, emphasizes user involvement.
Concludes thoughts on supporting computer scientists in learning materials science via open-source tools and resources.
Acknowledges funding sources for Matminer and Matscholar projects.
Open Source Toolsfor Materials Informatics Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA MRS Fall Meeting 2019 Slides (already) posted to hackingmaterials.lbl.gov
2.
Staffing interdisciplinary research MachinelearningMaterials Science I find a recurring dilemma and asymmetry in staffing materials informatics research Materials Informatics
3.
3 Who has atougher job to get started? MS&E major CS major • Already has background in the material science aspects of the project • But needs to learn the machine learning and software engineering aspects • Already has background in software engineering and appropriate machine learning • But needs to learn the materials science aspects
4.
4 MS&E major CSmajor My experience is that the CS major typically has the tougher road ahead of them Who has a tougher job to get started?
5.
5 MS&E major CSmajor My experience is that the CS major typically has the tougher road ahead of them Who has a tougher job to get started? easier to pick up / self-learn random forests & neural networks than phase diagrams & crystal structures
6.
6 There is anasymmetry in resources available MS&E major CS major • Hands-on code and examples to run and modify • Hundreds of Youtube videos and online courses • Code reviews from collaborators • And the standard books, etc. • Books and research articles • Conversations with colleagues, impromptu lectures • Practice problems? Worked examples? Interactive code?
7.
Outline 7 ①Matminer: data anddescriptors for producing ML structure-property relationships ② Matscholar – applying natural language processing to materials science information retrieval
8.
8 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How can we quickly represent chemistry and structure as vectors? How do we get labeled training /test data? How do we know if our ML model is extraordinary?
9.
9 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How can we quickly represent chemistry and structure as vectors?
10.
>60 featurizer classescan generate thousands of potential descriptors that are described in the literature 10 Matminer contains a library of descriptors for various materials science entities feat = EwaldEnergy([options]) y = feat.featurize([input_data]) • compatible with scikit- learn pipelining • automatically deploy multiprocessing to parallelize over data • include citations to methodology papers
11.
11 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How do we get labeled training /test data?
12.
• Typically, alot of attention is given to advanced algorithms for machine learning – e.g., deep neural networks versus standard ML • But perhaps there is not enough emphasis on developing the appropriate data sets – with enough information to train ML algorithms – with sufficient data quality – easy enough for anyone to at least get started without specialized knowledge 12 What about data?
13.
The importance ofdata 13 https://qz.com/1034972/the-data-that-changed-the-direction-of-ai- research-and-possibly-the-world/
14.
14 What is ImageNet? TheImageNet data set collected and hand-labeled (e.g., via Amazon Mechanical Turk). The latest version has over 14 million hand-annotated images, organized into ~20,000 categories
How data stimulatesnew algorithms 16 How can we create an ImageNet for materials science?
17.
• We wanta test set that contains a diverse array of problems – Smaller data versus larger data – Different applications (electronic, mechanical, etc.) – Composition-only or structure information available – Classification or regression • We also want a cross-validation metric that gives reliable error estimates – i.e., less dependent on specific choice of splits 17 An “ImageNet” for materials science
18.
18 Overview of Matbenchtest set Target Property Data Source Samples Method Bulk Modulus Materials Project 10,987 DFT-GGA Shear Modulus Materials Project 10,987 DFT-GGA Band Gap Materials Project 106,113 DFT-GGA Metallicity Materials Project 106,113 DFT-GGA Band Gap Zhuo et al. [1] 6,354 Experiment Metallicity Zhuo et al. [1] 6,354 Experiment Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment Refractive index Materials Project 4,764 DFPT-GGA Formation Energy Materials Project 132,752 DFT-GGA Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF Steel yield strength Citrine Informatics 312 Experiment 1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
19.
<1K 1K-10K10K-100K >100K 19 Diversity of benchmarksuite mechanical electronic stability optical thermal classification regression experiment (composition only) DFT (structure) application data size problem type data type
20.
20 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How do we know if our ML model is extraordinary?
21.
21 How about abenchmark algorithm? Automatminer is a ”black box” machine learning model Give it any data set with either composition or structure inputs, and automatminer will train an ML model (no researcher intervention)
22.
22 Automatminer develops anML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
24 If we canget a well-established “benchmark”, perhaps interdisciplinary teams can start hammering on accuracy Today 5years 10years A lower barrier to entry in the field means more ideas can be tested from more researchers Matbenchtestset averageerror
25.
25 Matminer, matbench, andautomatminer can all be accessed, used, and modified by anyone Code / Examples all on Github • github.com/hackingmaterials/matminer • github.com/hackingmaterials/matminer_examples • github.com/hackingmaterials/automatminer Matbench data on Figshare • (coming soon, still finalizing) Free support via Discourse • https://discuss.matsci.org
26.
Outline 26 ① Matminer: dataand descriptors for producing ML structure-property relationships ②Matscholar – applying natural language processing to materials science information retrieval
27.
We have extracted~2 million abstracts of relevant scientific articles We use natural language processing algorithms to try to extract knowledge from all this data 27 Goal: collect and organize knowledge embedded in the materials science literature
• How dowe get more people benefitting from this work and involved in improving it? • One solution - expose an easy-to-use web frontend, with links to all the backend codes in case people want to dive further – New tools like Plotly Dash make this easier than ever 31 Using a web site as a “gateway” into the algorithms frontend backend
• We needmore resources to help computer scientists learn about materials science topics through hands-on examples and interactive demos • Some things that can help: – Open-source implementations of materials science methods – Interactive examples (e.g., Jupyter) – Documentation and support(!) – Labeled data sets – Front-ends for easy exploration 36 Concluding thoughts
37.
37 Funding acknowledgements Slides (already)posted to hackingmaterials.lbl.gov • Matminer – U.S. Department of Energy, Materials Science Division • Matscholar – Toyota Research Institutes