S²-Bench: Speak-to-Structure (TOMG)

Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Jiatong Li*, Junxian Li*, Weida Wang, Yunqing Liu, Changmeng Zheng, Xiaoyong Wei, Dongzhan Zhou, Qing Li
*Equal Contribution

Abstract

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S²-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S²-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S²-Bench. Our comprehensive evaluation of 30 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery.

Dataset: Full benchmark phenixace/S2-TOMG-Bench (45k samples); mini phenixace/S2-TOMG-Bench-mini (4.5k) for fast experimentation.

Leaderboard

Rank Model #Parameters (B) SR (%) WSR (%)
1Llama3.1-8B (OpenMolIns-xlarge)858.7939.33
2Claude-3.551.1035.92
3Gemini-1.5-pro52.2534.80
4GPT-4-turbo50.7434.23
5GPT-4o49.0832.29
6Claude-346.1430.47
7Llama3.1-8B (OpenMolIns-large)843.127.22
8Galactica-125M (OpenMolIns-xlarge)0.12544.4825.73
9Llama3-70B-Instruct (Int4)7038.5423.93
10Galactica-125M (OpenMolIns-large)0.12539.2823.42
11Galactica-125M (OpenMolIns-medium)0.12534.5419.89
12GPT-3.5-turbo28.9318.58
13Galactica-125M (OpenMolIns-small)0.12524.1715.18
14Gemma3-12B1226.2815.00
15Deepseek-R1-distill-Qwen-7B725.0714.61
16Llama3.1-8B-Instruct826.2614.09
17Llama3-8B-Instruct826.4013.75
18chatglm-9B918.5013.13(7)
19Galactica-125M (OpenMolIns-light)0.12520.9513.13(6)
20ChemDFM-v1.5-8B818.2412.07
21ChemLLM-20B2016.239.76
22Llama3.2-1B (OpenMolIns-large)114.118.10
23yi-1.5-9B914.107.32
24Mistral-7B-Instruct-v0.2711.174.81
25BioT5-base0.2524.194.21
26MolT5-large0.7823.112.89
27Llama3.1-1B-Instruct13.951.99
28MolT5-base0.2511.111.30(0)
29MolT5-small0.0811.551.29(9)
30Qwen2-7B-Instruct70.180.15

Submit Your Model

If your model achieves strong performance on the benchmark and you want to update the leaderboard, please send your results (including raw prediction files) via the contact given on the project page. We will verify and update the leaderboard accordingly.

BibTeX

@article{li2024speak,
  title={Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation},
  author={Li, Jiatong and Li, Junxian and Liu, Yunqing and Zheng, Changmeng and Wei, Xiaoyong and Zhou, Dongzhan and Li, Qing},
  journal={arXiv preprint arXiv:2412.14642v3},
  year={2024}
}