Designing lipid nanoparticles utilizing a transformer-based neural community

August 16, 2025

28

COMET particulars

This part describes the mannequin structure and coaching algorithms of COMET. Pseudocode for inference is offered in Algorithm S1.

COMET mannequin structure

Lipid molecular buildings are encoded into high-dimensional vectors (molecular embeddings), whereas scalar compositional options are encoded utilizing a Gaussian-based encoder⁵³. Steady formulation-wide parameters (for instance, N/P ratio and volumetric combine ratio) are encoded with Gaussian layers; categorical inputs use one-hot embeddings.

The transformer makes use of a [CLS] token to combination enter options throughout a number of consideration layers. For multitask studying, every cell kind is assigned a separate [CLS] token and prediction head, enabling task-specific outputs whereas sharing LNP-level illustration studying.

Molecular encoder

COMET is appropriate with varied molecular encoders; right here we use Uni-Mol¹¹, pretrained to get better masked atom sorts and corrupted three-dimensional coordinates. It gives robust property prediction efficiency and is used with default hyperparameters (from https://github.com/dptech-corp/Uni-Mol/tree/essential/unimol). Pretrained weights are frozen throughout COMET coaching. Every compound is encoded right into a 512-dimensional vector utilizing atom sorts and coordinates.

Lipid molar percentages are encoded into 128-dimensional vectors utilizing a shared Gaussian layer. Every element is additional assigned a 128-dimensional one-hot embedding (({z}_{ok}^{{rm{kind}}})) to tell apart lipid lessons. These are concatenated and projected by means of a two-layer MLP right into a 256-dimensional element illustration.

N/P ratio and volumetric ratio

N/P ratio is encoded utilizing a separate 256-dimensional Gaussian layer (z_N/P). Aqueous/natural ratios, handled as categorical variables, are one-hot encoded (z_section) with 256 dimensions.

CLS token and prediction head

Every cell kind makes use of a realized [CLS] token (z_CLS) of dimension 256. These combination element and formulation-wide token representations throughout N_block transformer layers by way of consideration⁵⁴. Remaining predictions are made by passing the [CLS] token by means of a two-layer MLP (MLP_predict).

Transformer blocks

Every block follows a Pre-LayerNorm construction⁵⁵ composed of layernorm → self-attention → MLP with residual connections.

Coaching particulars

The mannequin is educated with a binary rating goal⁵⁶ the place, given a pair of LNP samples, the mannequin learns to foretell a bigger efficacy rating for the LNP that has the next efficacy label worth from the opposite LNP:

$${{mathcal{L}}}_{mathrm{rating}}=-log left(sigma (;{f}_{theta }({x}_{mathrm{h}})-{f}_{theta }({x}_{mathrm{l}}))proper.$$

(1)

the place x_h and x_l are high- and low-efficacy LNPs and f_θ is COMET’s scoring perform. Coaching makes use of a batch dimension of 64 (2,016 pairwise comparisons per batch).

Battle-averse gradient descent

Battle-averse gradient descent (CAGrad)³³ mitigates conflicting gradients in multitask settings. We apply CAGrad with a coefficient of 0.2 to stabilize coaching throughout duties.

Noise augmentation

To deal with noise within the experimental knowledge, particularly within the fluid dealing with course of, we increase the molar share with Gaussian noise proportionate to its worth the place the usual deviation of the noise is 10% of precise molar share.

Label margin

From the label values, we will inform not solely which LNP is healthier than one other but additionally by how a lot. To coach the mannequin to be taught this extra data, we embody a margin time period⁵⁷ within the binary rating goal:

$${{mathcal{L}}}_{mathrm{rating}}=-log left({rm{Sigmoid}}(;{f}_{theta }({x}_{mathrm{h}})-{f}_{theta }({x}_{mathrm{l}})-{lambda }_{mathrm{margin}}(;{y}_{mathrm{h}}-{y}_{mathrm{l}}))proper.$$

(2)

the place y_h and y_l are the (efficacy) label values of the extra efficacious and fewer efficacious LNP, respectively, and λ_margin controls how a lot this goal dominates the coaching. We use λ_margin = 0.01 in our experiments.

Ensembling

For in silico analysis (Fig. 3e–l), the ensemble is fashioned by N_mannequin fashions educated with the identical hyperparameters and dataset (practice/legitimate/take a look at cut up) however weights initialized with totally different random seeds. For the ensemble deployed to deduce digital LNPs, 5 totally different practice (80%)/legitimate (20%) splits are made in a fivefold method and every mannequin within the ensemble is educated on a distinct fold. To make sure that ensembled scores will not be biased in the direction of fashions with excessive variance, the anticipated scores from every mannequin are normalized by making their scores for the LANCE LNPs match a standard distribution with imply 0 and commonplace deviation 1 earlier than ensembling. Extra particularly, for every mannequin, that is executed by inferring the anticipated scores on all of the LANCE LNPs and utilizing the imply (imply_i) and commonplace deviation (std_i) of LANCE LNPs’ scores to compute the normalized scores ({y}_{i}^{{prime} mathrm{normalized}}) by means of

$${y}_{i}^{{prime} mathrm{normalized}}=frac{{y}_{i}^{{prime} }-{mathrm{imply}}_{i}}{mathrm{std}_{i}},quad i sim {1,{.}{.}{.},{N}_{{rm{mannequin}}}}$$

(3)

The ultimate ensemble rating is the imply of all fashions’ normalized scores:

$${y}^{{prime} mathrm{ensemble}}=frac{1}{{N}_{{rm{mannequin}}}}mathop{sum }limits_{i}^{{N}_{{rm{mannequin}}}}{y}_{i}^{{prime} mathrm{normalized}}$$

(4)

COMET is applied in PyTorch and educated with NVIDIA V100 GPUs.

ok-Nearest neighbours and random forest mannequin particulars

The ok-nearest neighbours and random forest fashions are applied with the scikit-learn (https://scikit-learn.org/) package deal, with default hyperparameters. Extra particularly, the ok-nearest neighbours mannequin makes use of n = 5 nearest neighbours whereas the random forest mannequin makes use of n = 100 estimators (timber).

LANCE dataset particulars

LANCE contains 4 components spanning orthogonal LNP design dimensions: lipid element identities, molar percentages, synthesis parameters (for instance, N/P and aqueous/natural volumetric ratios) and high-resolution molar sweeps.

Seven ionizable lipids, three sterols, two helper lipids and two PEG lipids have been used (Supplementary Desk 14), reflecting the main focus of present analysis^12,42. To check molar % results, we designed 13 lipid ratios by various one lipid class at a time from a reference BASE ratio (Fig. 1d), based mostly on ref. ²⁰. As an example, ratios I1–I4 modify ionizable lipid %, C1–C3 alter ldl cholesterol (compensated by helper lipid), and P1–P3 alter PEG lipid %, whereas the remaining modify a number of elements (Supplementary Desk 13).

Half 1 (lipid selection)

To look at lipid identification results, we generated 84 mixtures from all permutations of seven ionizable lipids, 3 sterols, 2 helper lipids and a pair of PEG lipids. Paired with 13 molar ratios, this leads to 1,092 doable LNPs; 1,066 have been examined. After eradicating 91 overlapping with half 2, this half yielded 975 distinctive LNPs.

Half 2 (ionizable lipid synergy)

Following research suggesting synergy from dual-ionizable lipid formulations⁵⁸, we created LNPs with 60:40 molar splits throughout all ionizable lipid pairs, distributed throughout 13 lipid ratios. This yielded 637 further LNPs.

Half 3 (key synthesis parameters)

To discover synthesis results, we launched variation in ionizable lipid/RNA weight ratios (10:1, 15:1 and 20:1) and aqueous/natural section ratios (1:1 and three:1). Weight ratios have been adjusted by molar mass to take care of equal molar %. These parameters have been later transformed to N/P ratios for mannequin enter. This half consists of 924 LNPs.

Half 4 (molar share sweeps)

To check finer-grained molar % results, we created 24 evenly spaced intervals from 10% to 80% for ionizable lipid, ldl cholesterol and helper lipid, producing 492 LNPs throughout 3 centered sweeps.

Formulation ratios

Single-ionizable LNPs span 18 distinctive N/P ratios, derived from 3 ionizable lipid/RNA weight ratios and seven ionizable lipids. Twin-ionizable formulations add 63 extra, totalling 81 N/P ratios. In molar phrases, 13 base lipid ratios and 72 sweep ratios (24 per lipid class) lead to 85 complete molar compositions.

LNP synthesis

LNPs have been synthesized by mixing lipid–ethanol and mRNA–citrate buffer phases, incubated at 4 °C for 10 min. Automated dealing with was carried out on the Tecan Fluent platform. For animal research, LNPs have been combined, incubated on ice for 10 min and dialysed in a single day at 4 °C in PBS (Slide-A-Lyzer, ThermoFisher).

Supplies

FLuc mRNA (L-7202, Trilink); lipids (Cayman Chemical compounds, Avanti); luciferase assay (Regular-Glo, E2550) and Agilent BioTek plate reader for readout. alamarBlue was used for viability assays.

Information processing

Every 96-well plate included a ‘commonplace’ LNP. Uncooked luminescence values have been normalized to the usual and averaged throughout 4 replicates (two organic, two technical). Imply values have been log-transformed and min–max normalized to [0, 1].

We’ve represented a number of key options of the LANCE dataset in Fig. 2. Under, we clarify how these key options have been extracted from LANCE. For Fig. 2a, half 1 formulations have been chosen. For the 4 ionizable lipids (ALC-0315, DLin-MC3-DMA, C12-200 and CKK-E12), we had 156 formulations containing 2 helper lipids, 3 sterol lipids and a pair of PEG lipids (that’s, 2 × 2 × 3 = 12 mixtures) at 13 molar ratios (12 × 13 = 156 formulations).

For Fig. 2b,c, half 3 formulations containing one ionizable lipid, ldl cholesterol and C14-PEG have been chosen. Two molar ratios of the lipid elements (that are proven within the determine) have been studied. The ionizable lipid to mRNA molar ratio was 10,162. The aqueous to natural quantity ratio was assorted. For Fig. 2nd, half 3 formulations containing one ionizable lipid, DOPE, ldl cholesterol and C14-PEG have been chosen. Two molar ratios of the lipid elements (that are proven within the determine) have been studied. The natural to aqueous quantity ratio was held at 1:3.

Determine 2e was generated from half 2 knowledge. Solely formulations containing DOPE, ldl cholesterol and C14-PEG have been used for the graph. The title of the primary ionizable lipid was listed because the title of graph and the second ionizable lipid title was the row title. The entire molar content material of the ionizable lipids was the column title. The molar ratio of ionizable lipid 1/ionizable lipid 2 is 1.5. The molar ratio of DOPE/ldl cholesterol was 0.34. The molar % of C14-PEG was 2.5%. The molar ratio of ionizable lipid/mRNA was 10,162. The complete library was used to assemble Fig. 2f. We calculated the normalized transfection efficacy for the thirtieth and seventieth percentile formulations in B16-F10 and DC2.4 cells. These values have been as follows: seventieth percentile, B16-F10 = 0.43887; thirtieth percentile, B16-F10 = 0.24315; seventieth percentile DC2.4 = 0.64623; thirtieth percentile DC2.4 = 0.30946. Formulations above and beneath these values within the respective cell strains have been chosen and are plotted in Fig. 2f.

In vitro validation particulars

The LNPs are named in line with the teams to which they belong. A abstract of the prefixes used right here is given in Supplementary Desk 16.

Clinically accredited LNP baselines

The recipes for the three scientific LNP baselines are based mostly on the literature³⁹ and synthesized in an aqueous/natural volumetric of three:1 following what is often utilized in earlier work.

Prime LANCE LNP hits baselines

To seek out robust and dependable LNP baselines from LANCE, we randomly choose 10 LNP formulations from the ninetieth percentile for every cell line to once more display screen them with the respective cell line to examine for reproducibility. Amongst these ten formulations, three LNPs with their normalized efficacy worth closest to their authentic LANCE efficacy label values have been chosen as LANCE baseline LNPs.

Exploratory LNP library

To span an unlimited formulation area, the digital library was generated by enumerating by means of doable LNP options corresponding to lipid decisions, their molar percentages and key synthesis parameters corresponding to N/P ratios and aqueous/natural volumetric ratios, in line with Supplementary Desk 15. To seek out LNPs which can be totally different from the hits within the LANCE dataset, formulations inside a ten% L1 distance lipid molar share neighbourhood of any prime 10% most efficacious LANCE hits have been excluded. After this step, the exploratory library has 27,354,600 and 34,539,960 formulations for DC2.4 and B16-F10, respectively. An ensemble of 5 COMET fashions predicted efficacy in each cell strains. The highest 0.1% highest-scoring LNPs have been chosen (34,529 B16-F10 and 27,354 DC2.4).

The following step removes formulations based mostly on uncertainty in COMET prediction. We seize the extent of uncertainty by first computing the usual deviation (σ) between the fashions’ prediction (({y}_{i}^{{prime} mathrm{normalized}}) in equation (3)) throughout the ensemble. We then scale the usual deviation by division with a non-negative predicted efficacy time period to get a relative uncertainty worth (u_rel):

$${u}_{{rm{rel}}}=frac{sigma }{{hat{y}}^{rm{ensemble}}},quad {hat{y}}^{rm{ensemble}}={y}^{{prime} rm{ensemble}}-{y}^{{prime} rm{ensemble,min,LANCE}}$$

(5)

the place ({y}^{{prime} mathrm{ensemble,min,LANCE}}) is the minimal ensemble rating among the many LANCE LNPs. Any formulations with destructive ({hat{y}}^{mathrm{ensemble}}) time period have been dropped. Supplementary Fig. 13 exhibits the distribution of this relative uncertainty worth. Formulations with largest 50% relative uncertainty values have been eliminated, leaving 17,269 B16-F10 and 13,677 DC2.4 formulations.

To advertise chemical range, Okay-means clustering (on 14-dimensional vectors encoding lipid molar percentages) grouped these candidates into 10 clusters. Clustering was repeated 1,000 occasions to stabilize assignments. The best-scoring formulation in every cluster was chosen, leading to ten numerous in silico hits per cell line (Supplementary Tables 17 and 18).

Lead optimization LNP library

For every cell kind, three prime LANCE hits (from ‘Prime LANCE LNP hits baselines’ part) have been used as beginning factors. Round every, digital candidates have been generated by (1) exploring inside a 20% L1 molar share distance, (2) substituting at the least one lipid (6 ionizable lipids, 2 cholesterols, 1 helper and 1 PEG) and (3) altering the N/P ratio.

To generate three numerous candidates per lead, we segmented the neighbourhood into three zones: (1) molar % phase (inside 20% L1, no lipid modifications), (2) substitute-lipid phase (inside 20% L1, however with at the least one totally different lipid) and (3) N/P ratio phase (differing N/P ratio). From every zone, the highest predicted LNP was chosen (Fig. 3d, proper). This yielded three optimized LNPs per lead. The digital library dimension ranged from 1.5 million (single-ionizable lipid) to 9 million (dual-ionizable lipid) candidates. The sixfold improve in dual-ionizable lipid instances arises from combinatorial enumeration: every minor ionizable lipid was paired with six main ones. In contrast, single-ionizable lipid compositions require no pairing. The ultimate chosen formulations for validation are listed in Supplementary Tables 19 and 20.

PBAE synthesis

The compositions and molar ratios of amines, diacrylates and branching brokers are listed in Supplementary Desk 21. To synthesize PBAE polymers, the mix of the amines, diacrylates and branching brokers have been used. In short, in a 20 ml glass vial, the complete weight of diacrylate and branching agent was added. Then, the solvent (dimethylformamide) was added to the response combination. Later, the response vials have been positioned on a hotplate at 90 °C. After 24 h, the vials have been faraway from the hotplate and cooled to room temperature. The amines have been added to the response vial and positioned again on the hotplate at 90 °C and the response was allowed to proceed for 48 h. Lastly, the vials have been faraway from the hotplate and allowed to chill to room temperature. Then, the response combination was added (drop-by-drop) right into a beaker containing ~150 ml ice-cold diethyl ether (~10× extra quantity). The collected samples have been transferred to 50 ml tubes and centrifuged at 1,000 × g for 3 min to pellet the polymer. Later, the supernatant was eliminated and dissolved within the minimal doable quantity of dimethylformamide. This purification step was repeated thrice. Remaining polymers have been dried beneath vacuum and solubility examined in ethanol.

Representing PBAEs in COMET

PBAEs have been represented as a mix of their diacrylate–amine repeating unit and branching agent, every with distinctive component-type embeddings. The repeating unit was handled as a fifth molar element kind alongside lipids, with its molar focus estimated from polymer weight and molecular weight. Whole molar percentages of PBAE and lipids sum to 100%. Inference proceeds as in lipid-only LNPs (‘COMET particulars’ part).

COMET PBAE LNP lead optimization hits

Two top-performing PBAE LNPs per cell kind have been used as beginning factors. Round every, digital candidates have been generated by (1) exploring inside a 20% L1 molar share neighbourhood, (2) substituting lipids (6 ionizable lipids, 2 sterols, 1 helper and 1 PEG) and (3) changing to dual-ionizable compositions. To pick three numerous candidates, we outlined three non-overlapping segments: one throughout the 20% L1 distance however will need to have the identical lipid decisions, one with at the least one totally different lipid compound and one with a dual-ionizable lipid configuration. The highest predicted LNP from every phase was chosen (Fig. 3d, proper). Remaining hits are detailed in Supplementary Tables 22 and 23.

Human IL-15 screening

The IL-15 mRNA is synthesized by way of in vitro transcription with a HiScribe T7 mRNA equipment with CleanCap Reagent AG (E2080S) from New England Biolabs, with 5-methoxy-UTP (N-1093) from Trilink. The LNP transfection is completed at an mRNA focus of 0.25 µg ml⁻¹ within the 96-well plate format. The Human IL-15 expression stage is measured with Human IL-15 Uncoated ELISA (88-7620) procured from Invitrogen, after 16 h of incubation of HepG2 cells with LNP. Uncooked efficacy knowledge are normalized, just like bioluminescence knowledge talked about above, earlier than used as dataset for machine studying experiments. This dataset (20%) is randomly cut up into take a look at set, whereas the remainder is used because the practice and validation units.

Lyophilization of LNPs

The LNPs are synthesized in a tris buffer (5 mM tris buffer, pH 8). After synthesis, the LNP formulations are frozen at −80 °C for two h earlier than present process the next lyophilization course of: equilibrate at −40 °C for two h, in ambiance → −40 °C for 21 h, in vacuum → 25 °C for two h, in vacuum. Labconco FreeZone 6 l with a Stoppering Tray Dryer was used for lyophilization.

Degradation within the efficacy

Put up-lyophilization efficacy values have been computed and normalized equally to the LANCE label values (‘Information processing’ part). The degradation of efficacy owing to lyophilization was calculated by subtracting the post-lyophilization efficacy values rating from the LANCE B16-F10 values.

Animal experiments

Animal experiments for this research have been accredited by the Massachusetts Institute of Expertise Institutional Animal Care and Use Committee and have been in step with native, state and federal rules as relevant. Feminine C57BL/6J mice (000664, The Jackson Laboratory) have been used within the experiments. For imagining, d-luciferin (LUCK-1G, Gold Biotechnology) solubilized in PBS was administered by way of intraperitoneal injection and the mice have been imaged utilizing an IVIS imaging system (PerkinElmer).

t-SNE visualization

We chosen the COMET mannequin most correlated (Spearman) with ensemble scores throughout a random digital LNP subset. LNP options for t-SNE have been the ultimate [CLS] token representations. To make sure even distribution throughout ionizable sorts, dual-ionizable lipid LNPs have been handled as a definite class, and 1,250 LNPs per class (8 complete) have been randomly sampled (10,000 complete).

Built-in gradients implementation

To execute built-in gradients (IG) with COMET’s multimodal inputs, we tailored the Captum library. IG computes attribution by integrating gradients alongside a path from reference to enter. Characteristic attributions have been computed per LNP, baseline-subtracted and averaged throughout every group. Non-PBAE LANCE LNPs have been used because the baseline. Attribution scores have been normalized (max = 1) and averaged throughout ensemble fashions.

Reporting abstract

Additional info on analysis design is accessible within the Nature Portfolio Reporting Abstract linked to this text.