Open Source Tools for Materials Informatics Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA MRS Fall Meeting 2019 Slides (already) posted to hackingmaterials.lbl.gov
Staffing interdisciplinary research Machine learningMaterials Science I find a recurring dilemma and asymmetry in staffing materials informatics research Materials Informatics
3 Who has a tougher job to get started? MS&E major CS major • Already has background in the material science aspects of the project • But needs to learn the machine learning and software engineering aspects • Already has background in software engineering and appropriate machine learning • But needs to learn the materials science aspects
4 MS&E major CS major My experience is that the CS major typically has the tougher road ahead of them Who has a tougher job to get started?
5 MS&E major CS major My experience is that the CS major typically has the tougher road ahead of them Who has a tougher job to get started? easier to pick up / self-learn random forests & neural networks than phase diagrams & crystal structures
6 There is an asymmetry in resources available MS&E major CS major • Hands-on code and examples to run and modify • Hundreds of Youtube videos and online courses • Code reviews from collaborators • And the standard books, etc. • Books and research articles • Conversations with colleagues, impromptu lectures • Practice problems? Worked examples? Interactive code?
Outline 7 ①Matminer: data and descriptors for producing ML structure-property relationships ② Matscholar – applying natural language processing to materials science information retrieval
8 How can we make it easy to develop and test ML models for composition-structure-property relationships? How can we quickly represent chemistry and structure as vectors? How do we get labeled training /test data? How do we know if our ML model is extraordinary?
9 How can we make it easy to develop and test ML models for composition-structure-property relationships? How can we quickly represent chemistry and structure as vectors?
>60 featurizer classes can generate thousands of potential descriptors that are described in the literature 10 Matminer contains a library of descriptors for various materials science entities feat = EwaldEnergy([options]) y = feat.featurize([input_data]) • compatible with scikit- learn pipelining • automatically deploy multiprocessing to parallelize over data • include citations to methodology papers
11 How can we make it easy to develop and test ML models for composition-structure-property relationships? How do we get labeled training /test data?
• Typically, a lot of attention is given to advanced algorithms for machine learning – e.g., deep neural networks versus standard ML • But perhaps there is not enough emphasis on developing the appropriate data sets – with enough information to train ML algorithms – with sufficient data quality – easy enough for anyone to at least get started without specialized knowledge 12 What about data?
The importance of data 13 https://qz.com/1034972/the-data-that-changed-the-direction-of-ai- research-and-possibly-the-world/
14 What is ImageNet? The ImageNet data set collected and hand-labeled (e.g., via Amazon Mechanical Turk). The latest version has over 14 million hand-annotated images, organized into ~20,000 categories
How data stimulates new algorithms 15
How data stimulates new algorithms 16 How can we create an ImageNet for materials science?
• We want a test set that contains a diverse array of problems – Smaller data versus larger data – Different applications (electronic, mechanical, etc.) – Composition-only or structure information available – Classification or regression • We also want a cross-validation metric that gives reliable error estimates – i.e., less dependent on specific choice of splits 17 An “ImageNet” for materials science
18 Overview of Matbench test set Target Property Data Source Samples Method Bulk Modulus Materials Project 10,987 DFT-GGA Shear Modulus Materials Project 10,987 DFT-GGA Band Gap Materials Project 106,113 DFT-GGA Metallicity Materials Project 106,113 DFT-GGA Band Gap Zhuo et al. [1] 6,354 Experiment Metallicity Zhuo et al. [1] 6,354 Experiment Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment Refractive index Materials Project 4,764 DFPT-GGA Formation Energy Materials Project 132,752 DFT-GGA Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF Steel yield strength Citrine Informatics 312 Experiment 1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
<1K 1K-10K10K-100K >100K 19 Diversity of benchmark suite mechanical electronic stability optical thermal classification regression experiment (composition only) DFT (structure) application data size problem type data type
20 How can we make it easy to develop and test ML models for composition-structure-property relationships? How do we know if our ML model is extraordinary?
21 How about a benchmark algorithm? Automatminer is a ”black box” machine learning model Give it any data set with either composition or structure inputs, and automatminer will train an ML model (no researcher intervention)
22 Automatminer develops an ML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
23 Can actually do apple—to-apples competition between algorithms
24 If we can get a well-established “benchmark”, perhaps interdisciplinary teams can start hammering on accuracy Today 5years 10years A lower barrier to entry in the field means more ideas can be tested from more researchers Matbenchtestset averageerror
25 Matminer, matbench, and automatminer can all be accessed, used, and modified by anyone Code / Examples all on Github • github.com/hackingmaterials/matminer • github.com/hackingmaterials/matminer_examples • github.com/hackingmaterials/automatminer Matbench data on Figshare • (coming soon, still finalizing) Free support via Discourse • https://discuss.matsci.org
Outline 26 ① Matminer: data and descriptors for producing ML structure-property relationships ②Matscholar – applying natural language processing to materials science information retrieval
We have extracted ~2 million abstracts of relevant scientific articles We use natural language processing algorithms to try to extract knowledge from all this data 27 Goal: collect and organize knowledge embedded in the materials science literature
28 We’ve developed algorithms to automatically tag keywords in the abstracts
29 Application: a revised materials search engine Auto-generated summaries of materials based on text mining
30 Application: materials compositions of interest … A search for thermoelectrics that do not have Pb or Bi
• How do we get more people benefitting from this work and involved in improving it? • One solution - expose an easy-to-use web frontend, with links to all the backend codes in case people want to dive further – New tools like Plotly Dash make this easier than ever 31 Using a web site as a “gateway” into the algorithms frontend backend
32 https://www.matscholar.com – demo 1
33 https://www.matscholar.com – demo 2
34 Matscholar MRS! https://matscholar-mrs.herokuapp.com
35 Hopefully these frontend demos get you interested enough to check the “About page”
• We need more resources to help computer scientists learn about materials science topics through hands-on examples and interactive demos • Some things that can help: – Open-source implementations of materials science methods – Interactive examples (e.g., Jupyter) – Documentation and support(!) – Labeled data sets – Front-ends for easy exploration 36 Concluding thoughts
37 Funding acknowledgements Slides (already) posted to hackingmaterials.lbl.gov • Matminer – U.S. Department of Energy, Materials Science Division • Matscholar – Toyota Research Institutes

Open Source Tools for Materials Informatics

  • 1.
    Open Source Toolsfor Materials Informatics Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA MRS Fall Meeting 2019 Slides (already) posted to hackingmaterials.lbl.gov
  • 2.
    Staffing interdisciplinary research MachinelearningMaterials Science I find a recurring dilemma and asymmetry in staffing materials informatics research Materials Informatics
  • 3.
    3 Who has atougher job to get started? MS&E major CS major • Already has background in the material science aspects of the project • But needs to learn the machine learning and software engineering aspects • Already has background in software engineering and appropriate machine learning • But needs to learn the materials science aspects
  • 4.
    4 MS&E major CSmajor My experience is that the CS major typically has the tougher road ahead of them Who has a tougher job to get started?
  • 5.
    5 MS&E major CSmajor My experience is that the CS major typically has the tougher road ahead of them Who has a tougher job to get started? easier to pick up / self-learn random forests & neural networks than phase diagrams & crystal structures
  • 6.
    6 There is anasymmetry in resources available MS&E major CS major • Hands-on code and examples to run and modify • Hundreds of Youtube videos and online courses • Code reviews from collaborators • And the standard books, etc. • Books and research articles • Conversations with colleagues, impromptu lectures • Practice problems? Worked examples? Interactive code?
  • 7.
    Outline 7 ①Matminer: data anddescriptors for producing ML structure-property relationships ② Matscholar – applying natural language processing to materials science information retrieval
  • 8.
    8 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How can we quickly represent chemistry and structure as vectors? How do we get labeled training /test data? How do we know if our ML model is extraordinary?
  • 9.
    9 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How can we quickly represent chemistry and structure as vectors?
  • 10.
    >60 featurizer classescan generate thousands of potential descriptors that are described in the literature 10 Matminer contains a library of descriptors for various materials science entities feat = EwaldEnergy([options]) y = feat.featurize([input_data]) • compatible with scikit- learn pipelining • automatically deploy multiprocessing to parallelize over data • include citations to methodology papers
  • 11.
    11 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How do we get labeled training /test data?
  • 12.
    • Typically, alot of attention is given to advanced algorithms for machine learning – e.g., deep neural networks versus standard ML • But perhaps there is not enough emphasis on developing the appropriate data sets – with enough information to train ML algorithms – with sufficient data quality – easy enough for anyone to at least get started without specialized knowledge 12 What about data?
  • 13.
    The importance ofdata 13 https://qz.com/1034972/the-data-that-changed-the-direction-of-ai- research-and-possibly-the-world/
  • 14.
    14 What is ImageNet? TheImageNet data set collected and hand-labeled (e.g., via Amazon Mechanical Turk). The latest version has over 14 million hand-annotated images, organized into ~20,000 categories
  • 15.
    How data stimulatesnew algorithms 15
  • 16.
    How data stimulatesnew algorithms 16 How can we create an ImageNet for materials science?
  • 17.
    • We wanta test set that contains a diverse array of problems – Smaller data versus larger data – Different applications (electronic, mechanical, etc.) – Composition-only or structure information available – Classification or regression • We also want a cross-validation metric that gives reliable error estimates – i.e., less dependent on specific choice of splits 17 An “ImageNet” for materials science
  • 18.
    18 Overview of Matbenchtest set Target Property Data Source Samples Method Bulk Modulus Materials Project 10,987 DFT-GGA Shear Modulus Materials Project 10,987 DFT-GGA Band Gap Materials Project 106,113 DFT-GGA Metallicity Materials Project 106,113 DFT-GGA Band Gap Zhuo et al. [1] 6,354 Experiment Metallicity Zhuo et al. [1] 6,354 Experiment Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment Refractive index Materials Project 4,764 DFPT-GGA Formation Energy Materials Project 132,752 DFT-GGA Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF Steel yield strength Citrine Informatics 312 Experiment 1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
  • 19.
    <1K 1K-10K10K-100K >100K 19 Diversity of benchmarksuite mechanical electronic stability optical thermal classification regression experiment (composition only) DFT (structure) application data size problem type data type
  • 20.
    20 How can wemake it easy to develop and test ML models for composition-structure-property relationships? How do we know if our ML model is extraordinary?
  • 21.
    21 How about abenchmark algorithm? Automatminer is a ”black box” machine learning model Give it any data set with either composition or structure inputs, and automatminer will train an ML model (no researcher intervention)
  • 22.
    22 Automatminer develops anML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
  • 23.
    23 Can actually doapple—to-apples competition between algorithms
  • 24.
    24 If we canget a well-established “benchmark”, perhaps interdisciplinary teams can start hammering on accuracy Today 5years 10years A lower barrier to entry in the field means more ideas can be tested from more researchers Matbenchtestset averageerror
  • 25.
    25 Matminer, matbench, andautomatminer can all be accessed, used, and modified by anyone Code / Examples all on Github • github.com/hackingmaterials/matminer • github.com/hackingmaterials/matminer_examples • github.com/hackingmaterials/automatminer Matbench data on Figshare • (coming soon, still finalizing) Free support via Discourse • https://discuss.matsci.org
  • 26.
    Outline 26 ① Matminer: dataand descriptors for producing ML structure-property relationships ②Matscholar – applying natural language processing to materials science information retrieval
  • 27.
    We have extracted~2 million abstracts of relevant scientific articles We use natural language processing algorithms to try to extract knowledge from all this data 27 Goal: collect and organize knowledge embedded in the materials science literature
  • 28.
    28 We’ve developed algorithmsto automatically tag keywords in the abstracts
  • 29.
    29 Application: a revisedmaterials search engine Auto-generated summaries of materials based on text mining
  • 30.
    30 Application: materials compositionsof interest … A search for thermoelectrics that do not have Pb or Bi
  • 31.
    • How dowe get more people benefitting from this work and involved in improving it? • One solution - expose an easy-to-use web frontend, with links to all the backend codes in case people want to dive further – New tools like Plotly Dash make this easier than ever 31 Using a web site as a “gateway” into the algorithms frontend backend
  • 32.
  • 33.
  • 34.
  • 35.
    35 Hopefully these frontenddemos get you interested enough to check the “About page”
  • 36.
    • We needmore resources to help computer scientists learn about materials science topics through hands-on examples and interactive demos • Some things that can help: – Open-source implementations of materials science methods – Interactive examples (e.g., Jupyter) – Documentation and support(!) – Labeled data sets – Front-ends for easy exploration 36 Concluding thoughts
  • 37.
    37 Funding acknowledgements Slides (already)posted to hackingmaterials.lbl.gov • Matminer – U.S. Department of Energy, Materials Science Division • Matscholar – Toyota Research Institutes