Material Data Processing Method and Apparatus, Medium, Electronic Device and Program Product

Link Copied.

Opportunity

The demand for high-performance new materials is rapidly increasing across various industries, highlighting an urgent need for efficient methods to explore and discover novel materials. Traditionally, material screening relies on key physicochemical property parameters such as melting point, boiling point, viscosity, and dielectric constant. However, the vast landscape of potential materials presents a significant challenge; only a minuscule fraction of possible compounds have been synthesized and characterized. Conventional experimental measurement methods are inherently slow, labor-intensive, and costly, making them entirely impractical for the large-scale, high-throughput screening required to navigate this immense chemical space. This inefficiency creates a major bottleneck in materials research and development, prolonging discovery cycles and increasing costs. Furthermore, even when computational approaches are employed, researchers face the dual challenge of manually selecting the most relevant material descriptors (quantitative features representing material properties) and choosing the optimal predictive model from many possibilities. This reliance on expert intuition and trial-and-error often leads to suboptimal descriptor sets and models, resulting in increased prediction errors, poor model generalizability, and ultimately, unreliable virtual screening outcomes. There is a clear and pressing opportunity for a systematic, automated, and data-driven methodology that can efficiently identify the most predictive descriptor combinations and pair them with the best-performing machine learning models to accelerate accurate material property prediction.

Technology

This patent discloses a systematic and automated material data processing method that integrates descriptor space optimization with machine learning model selection to accurately predict target material properties. The core innovation lies in its iterative, three-stage optimization cycle designed to find the optimal pairing between a subset of physicochemical descriptors and a predictive model. The process begins by defining an initial, potentially large, descriptor space relevant to a target property (e.g., boiling point). These descriptors are pre-processed through normalization, removal of zero-variance features, and elimination of highly correlated duplicates to ensure data quality. The method then employs a selected first machine learning model (e.g., decision tree regression) to guide the search for an optimal descriptor subspace. A key step involves calculating correlation coefficients (e.g., Pearson) between each descriptor and the target property, sorting the descriptors accordingly. The optimization cycle consists of: 1) a Forward Descriptor Space Construction phase, where descriptors are added sequentially based on correlation ranking, and only those that reduce the model's prediction error (evaluated via robust methods like ten-fold cross-validation) are retained; 2) a Reverse Descriptor Space Reduction phase, where each descriptor in the current optimal set is tentatively removed, and the removal is made permanent if it further lowers the prediction error, thus pruning redundant features; and 3) a Hyperparameter Tuning phase (exemplified as Gaussian Process Regression tuning), where the model's internal parameters are optimized based on the current descriptor subspace. This cycle repeats until the descriptor set and model hyperparameters converge, yielding a first optimal subspace with minimal prediction error for the initial model. The technology further enhances accuracy by allowing a subsequent search across a collection of machine learning models (e.g., linear regression, random forest, XGBoost, neural networks) using this optimized descriptor subspace to identify a second, superior model with even lower error. The entire workflow is automated, removing subjective human bias from descriptor and model selection.

Advantages

High Efficiency and Automation: Eliminates the need for manual, experience-based selection of descriptors and models, enabling rapid, large-scale screening of material libraries.
Enhanced Prediction Accuracy: The iterative optimization cycle systematically identifies the descriptor subset that minimizes prediction error for a given model, leading to more reliable property predictions.
Improved Model Generalizability: By reducing redundant and irrelevant descriptors, the method mitigates overfitting, resulting in models that perform better on unseen data.
Systematic Optimization: Integrates descriptor selection, model choice, and hyperparameter tuning into a cohesive, data-driven pipeline.
Broad Applicability and Robustness: The framework is not limited to specific properties or descriptor types, offering strong potential for transfer across different material prediction tasks.
Resource Efficiency: Reduces computational cost and time compared to brute-force or unguided exploration of the combined descriptor-model space.

Applications

Accelerated Materials Discovery: High-throughput virtual screening for novel materials with desired properties in fields like pharmaceuticals, catalysts, polymers, and energy storage.
Property Prediction Platform: Building accurate predictive models for various material properties (thermal, optical, mechanical, electronic) based on computational descriptors.
Descriptor Relevance Analysis: Identifying the key physicochemical features most influential for a specific material behavior, providing insights for material design.
Integration with Materials Databases: Serving as an analytical engine for large materials informatics platforms to predict properties of unsynthesized compounds.
Optimization of Material Formulations: Assisting in the design of composites or alloys by predicting properties from constituent descriptors.
Educational and Research Tool: Providing a standardized methodology for academic and industrial researchers working in computational materials science.