Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Published in KDD 2026, 2026

Recommended citation: Jiatong Li*, Junxian Li*, Yunqing Liu, Dongzhan Zhou, and Qing Li. (2026). Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation. KDD 2026. https://arxiv.org/pdf/2412.14642

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs’ ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S2-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S2-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S2-Bench. Our comprehensive evaluation of 31 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery. Our codes and datasets are fully accessible through the Github Repository: https://github.com/phenixace/S2-TOMG-Bench and Huggingface Datasets: https://huggingface.co/datasets/phenixace/S2-TOMG-Bench.

Project homepage: here

Code: here

Dataset (Hugging Face): here

Download paper here