Materials scientists developing technologies from smartphones to automobiles face significant challenges in predicting material properties, as even slight compositional differences can dramatically alter characteristics. While machine learning offers promise for identifying empirical trends, the field has been limited by the difficulty of extracting structured data from millions of existing research papers containing valuable but untapped experimental results. A breakthrough approach using large language models now enables automated conversion of complex scientific information into usable databases.
Dr. Yukari Katsura's team at the National Institute for Materials Science has developed two innovative tools that leverage LLMs to accelerate construction of the Starrydata materials property database. The research, published in Science and Technology of Advanced Materials: Methods, addresses the critical bottleneck in materials informatics by automating data extraction from paper PDFs. "We found that by specifying a data structure and giving instructions to an LLM, we can accurately and comprehensively extract information about figures, tables, and samples from the text of paper PDFs across a wide range of fields," explained Katsura.
The first tool, Starrydata Auto-Suggestion for Sample Information, is already integrated into the Starrydata2 web system and uses OpenAI's GPT via API to suggest candidate entries for data fields when users paste text from paper abstracts or experimental methods sections. The second tool, Starrydata Auto-Summary GPT, deconstructs entire open-access paper PDFs and automatically summarizes all descriptions of figures, tables, and samples as structured JSON data using ChatGPT's custom GPT feature. This output can be viewed as easy-to-read tables in web browsers, dramatically accelerating data collection work.
Current limitations include publisher restrictions on AI use with paper PDFs, prompting the team to focus initially on open-access papers. Additionally, LLMs cannot reliably extract data from graph images, requiring data collectors to use a separately developed semi-automated tool for this task. Despite these constraints, the automation represents a significant advancement. "A paper is a logical structure assembled to convey the author's claims, but by deconstructing it and returning it to the form of experimental data, other researchers can also use it for their own research," noted Katsura.
The implications extend beyond efficiency gains to fundamentally transforming materials research methodology. By enabling large-scale dataset construction from existing literature, researchers can gain inspiration through comprehensive data overviews and implement property predictions based on empirical trends using machine learning. This approach moves materials science toward a future where experimental data from all fields can be shared digitally and analyzed from integrated perspectives. Currently focused on specific areas like thermoelectric materials and magnets, Starrydata as an open dataset is already being utilized by leading researchers worldwide for new materials development.
The team's work establishes paper data collection as a recognized research form within the scientific community while raising awareness about the transformative potential of large-scale experimental data aggregation. This development marks a pivotal shift in how materials property information is curated and utilized, potentially accelerating innovation across numerous technology sectors that depend on advanced functional materials.



