PXR challenge #1: Exploring the data

Whenever you start a new project, it is always good to familiarize yourself with the data. In the previous post, I briefly checked the overlap of compound structures between the datasets, but barely looked at chemical structures or activity patterns. In this post, we will get an overview of the chemical space across the four datasets and explore in detail the activity distribution. The main questions to answer in this analysis are:

How diverse is the chemical space?
How widespread is PXR induction within the chemical space?
What is the SAR character of the data (is it discontinuous)?
Are there broad chemical series that can be extracted from the data?

A note about the code

My plan was to generate one marimo notebook for each post and make each notebook standalone. However, as I kept expanding the scope of the analysis, it became increasingly cumbersome to work within one really large notebook. So I decided to split the code into five notebooks as detailed below. On the one hand, I think this makes the notebooks easier to follow. On the other hand, it means some duplicated code as I wanted to keep the notebooks portable. In a normal project, I would just have made a folder with common scripts and functions to be called within the notebooks. For each notebook, I provide a link to the Python file in my pxr_challenge repository as well as HTML exports of the notebooks that allow some exploration of the data tables and plots.

Data preprocessing

The first notebook 1a (html) reads the latest HuggingFace datasets (those updated on April 9th) and generates a set of processed datasets. The first derived dataset contains all unique compounds across the single dose screen, dose response screen, counter screen and test dataset. I make sure to normalize chemical structures using InChIKeys (somewhat against the advice of the OpenADMET organizers who recommend organizing data by Molecule Name or OCNT Batch ID) so each row represents one chemical structure. In this step I keep only those columns that I might potentially use for filtering and modeling in subsequent notebooks.

From the single dose, I keep the median log₂FC value and whether it is considered a hit at the given concentration (a hit is defined by median log₂FC > 1 and FDR-BH < 0.05). I keep these two properties for all four concentrations present in the single dose dataset, even though the 1 μM and 100 μM concentrations have much fewer data points than 30 μM and 10 μM. For the dose response and counter screen data I keep only the pEC50 and "Emax_estimate (log₂FC vs. baseline)" columns. The test set compounds obviously only have IDs and chemical structure.

Once I had a file with all unique chemical structures across the four datasets, I computed all potential matched molecular pairs (MMPs) using mmpdb. MMPs are two molecules that share a large common substructure and differ only at a single point (either central or terminal). Because of that, they are a very intuitive way to identify similar pairs of compounds. Both the activity dataset and the MMP dataset are used in the following notebooks.

Chemical Space Exploration

In notebook 1b (html), I provide an overview of the chemical space. This can be done in many different ways and generally depends on a molecular representation or fingerprint and a similarity metric. Two types of visualizations of chemical space are demonstrated:

chemical embeddings
similarity networks

UMAP and t-SNE embeddings

Chemical embeddings are scatterplots where each point is a molecule and its X and Y position depends on its chemical structure. This can be done using dimensionality reduction techniques like t-SNE or UMAP on a given fingerprint. I decided to use an ECFP6 fingerprint as a basis for the t-SNE and UMAP plots. Often the coloring scheme in these visualizations is based on the value of some biological property like activity. However, in our examples the coloring is based on dataset.

UMAP embedding of ECFP6 fingerprints coloured by dataset — UMAP embedding of ECFP6 fingerprints, coloured by dataset: single-dose (gray), dose-response (light blue), counter-screen (dark blue), and test set (dark green).

The UMAP reveals a single, broadly dispersed cloud with no strong cluster separation. The absence of tight, well-separated clusters suggests high chemical diversity without dominant scaffold families. Dose-response compounds (light blue) and counter-screen compounds (dark blue) are distributed throughout the same space as the single-dose compounds (gray). This could suggest that PXR induction is fairly well distributed across the chemical space, but no pEC50 values are shown here. The test set (dark green) is similarly distributed across the embedding. While they tend to appear as small clusters, these clusters are distributed widely and suggest that test compounds are also structurally diverse.

t-SNE embedding of ECFP6 fingerprints coloured by dataset — t-SNE embedding of ECFP6 fingerprints, coloured by dataset.

The t-SNE embedding shows very similar patterns to the UMAP plot. t-SNE plots tend to be more space-filling than UMAP plots at default parameters, but because there is no high-level structure to the diversity of the dataset they both end up very diffuse.

I also show in the last cell of the notebook an interactive plot of the chemical space colored based on pEC50 values on the dose response screen, with test compounds highlighted in red. Sadly, the HTML notebook exported from marimo is not able to show the chemical structure of the hovered data point. But if you run the notebook yourself, you will be able to easily get an idea of the chemical structures present in the different datasets.

The interactive plot helps to show that PXR induction at μM level is widely distributed across the chemical space. There are clusters of mostly low-activity compounds in several sections of the plot. The coloring scheme also highlights better than the other plots that test compounds (in red) are found all over the chemical space.

Similarity networks

Similarity networks are graphs where each node is a molecule and edges connect "similar" molecules. How we define similarity can be very different. Here I use MMPs, although I filter the MMPs calculated by mmpdb so that the common core in an MMP is larger than the variable section. In datasets that are less diverse, I would be more strict when filtering MMPs, but the number of MMPs here is small enough even with this loose filter.

Once you have your graph, you use a layout algorithm like Fruchterman-Reingold to calculate the X and Y positions of each node. The idea is to pull connected nodes closer together and push unconnected ones apart.

Similarity network built from matched molecular pairs (MMPs). Nodes are compounds; edges connect MMP pairs. Test-set compounds are shown in green.

The MMP network is fragmented: one large hub cluster, several medium clusters, and many isolated pairs. The large, diffuse grid-like region should represent compounds with single-substituent changes at common positions, like systematic analogues sharing a common core. However in this case it seems to be an artifact, as compounds in the largest connected component are small molecules with one or a small number of small rings. Test-set compounds (green) appear in several clusters alongside training compounds, showing many similarity relationships between training and test set.

pEC50 landscape across chemical space

UMAP of dose-response compounds coloured by pEC50, recomputed on this subset to maximize structural resolution.

The UMAP of dose-response compounds colored by pEC50 (recomputed on this subset to maximize structural resolution) shows that high-potency compounds (pEC50 > 6) are scattered throughout chemical space rather than concentrated in one region. Multiple chemotypes contribute to activity; global structural similarity to known actives is a poor proxy for potency.

Activity Cliffs

Activity cliffs (ACs) are pairs of structurally similar compounds with large differences in biological activity. They are the most information-dense SAR signal in the dataset and the hardest prediction targets. Datasets with a large number of activity cliffs are called SAR discontinuous, and they are generally harder to model than datasets with a low number of activity cliffs. The main complexity of AC analysis is how to define them — we need to separate similar compound pairs from dissimilar ones. Here I compute ACs based on MMPs and different fingerprint representations. The code can be found in notebook 1c (html).

MMP-based cliffs

Applying |ΔpEC50| ≥ 2 to the 1,503 MMP pairs within the dose-response set identifies 60 activity cliff pairs, just 4% of MMP-connected compounds. The number of MMP cliffs is very low, but it is not surprising considering the low number of MMPs we have in the dataset.

Fingerprint-based cliffs and fingerprint dependence

Computing similarity based on chemical fingerprints has a long history in chemoinformatics. Similar to the discussion in the chemical embedding section, the choice of fingerprint and similarity metric has a large impact on the analysis. It is important to be aware that the distribution of similarity values will differ from fingerprint to fingerprint, even when the differences between them (e.g. folding ECFP4 to 1k or 4k bits) are small.

Pairwise similarity distributions for dose-response compounds across four fingerprint types — Pairwise Tanimoto similarity distributions for dose-response compounds across four fingerprint types.

Based on my experience and the literature, I frequently use rule-of-thumb thresholds to define similar pairs: for MACCS it's Tanimoto ≥ 0.8 and for ECFP4 it's Tanimoto ≥ 0.4. The number of ACs calculated from each fingerprint varies, but it generally hovers around 10% of similar pairs. However, while the number is similar the exact pairs identified are often different.

Venn diagram of activity cliff overlap between MACCS and ECFP4 fingerprints — Overlap of activity cliff pairs identified by MACCS and ECFP4 fingerprints.

The Venn diagram of MACCS- and ECFP4-defined cliff pairs reveals that the two fingerprints identify largely non-overlapping sets of activity cliffs. Only a small fraction of pairs (~10 out of cumulatively > 500 cliffs) are identified as cliffs by both criteria. These consensus cliffs are a way to generalize the AC analysis across different representations.

Train–test similarity comparison

In this section (notebook 1d — html) I look in more detail at the similarity patterns between training and test set. I want to test how distinct test compounds are from training compounds. For that I compared the similarity value distribution within the training and test set, and between compounds of the train and test set.

ECFP4 Tanimoto similarity distributions comparing within-train, within-test, and cross-set pairs — ECFP4 Tanimoto similarity distributions: within-train (gray), within-test (green), and cross-set train–test (blue). All three peak near Tanimoto = 0.12–0.15.

The three distributions are overlaid: within-train, within-test, and train-vs-test (cross-set) ECFP4 Tanimoto similarities. All three are very similar, peaking near Tanimoto = 0.12–0.15. The test set seems indistinguishable from the training set in structural diversity and coverage. If the test set was structurally different from the train we would expect the cross-set similarity distribution to shift left.

The next analysis looks at the similarity value distribution between the training and test set to extract, for each test set compound, its nearest neighbor (NN) in the training set. The majority of the test compounds have a NN in the train set that can be considered highly similar (ECFP4 Tanimoto similarity > 0.4).

As the final analysis I recreate a plot shown in the example notebook for the PXR activity submission, comparing dose response pEC50 to counter screen pEC50 values. Based on this plot, training compounds can be divided into:

Not tested in counter screen
Non-selective: ΔpEC50_{DR − counter} ≤ 1.5
Selective: ΔpEC50_{DR − counter} > 1.5
Hit: selective and pEC50_DR ≥ 6

Based on these definitions, I looked at what type of training compounds were the NN of the test set compounds.

Bar chart of counter-screen classification of nearest training neighbours for each test compound — Counter-screen classification of the nearest training-set neighbour for each test compound.

The majority of test compounds have NNs that are classified as selective or hits. I was surprised at the data as I had expected the hits to be more predominant. This may reflect that the test set was assembled from a broader pool, focusing more on exploration than exploitation of highly active and selective compounds. On the other hand, this could also be an artifact: an analog of a hit compound may have been selected from a commercially available set whose nearest neighbor within our dataset is different. This issue can often appear when you retrospectively assess decisions on compound sets using only the output. Without additional data, it is not possible to be sure what the basis was for each test set compound to be chosen.

Scaffold Analysis

In the final section (notebook 1e — html) I look in more detail at the structural variety of the dataset. I performed a scaffold decomposition similar to the one done in a scaffold tree. I was interested not in the topology of the nodes in a scaffold tree, but in the frequency of occurrence of Bemis-Murcko scaffolds, linked ring systems and individual ring systems, which together I call substructures for lack of a better word. I used this data to analyze and visualize the structural overlap between the training and test set. In this type of analysis, I am always interested to find large substructures (in terms of number of heavy atoms) that are present in a large number of compounds, as they can provide data-derived chemical series to cluster and classify the data.

The plot comparing substructures between the training and test set finally killed a misconception I had about the test dataset. As I had seen them called analog sets, my mind went directly to analog series. I was expecting the test set to be based on one or a small number of unique scaffolds, using a small set of hit compounds as reference. That is not the case. Although the test set molecules are highly similar to at least one compound in the training set, they are themselves fairly diverse.

Conclusion

So, to conclude, let's go back to the questions I originally posed:

How diverse is the chemical space? Very diverse. Some parts were not surprising (considering the single dose dataset came from diverse libraries) but I didn't expect the test set to be as diverse as it is. I think that will make prediction harder.
How widespread is PXR induction within the chemical space? The pEC50 values in the training set are generally lower (values between 2 to 7) than you often see in ChEMBL, but this is expected as we're looking at unoptimized molecules. I was surprised that μM activity was found pretty much over the whole chemical space.
What is the SAR character of the data (is it discontinuous)? I would say mildly discontinuous. We did find activity cliffs, but their propensity across similar pairs of compounds (4–12%) is not massive.
Are there broad chemical series that can be extracted from the data? No, I was not able to find substructures or scaffolds that would help organize the chemical space into a small number of chemical series.

Overall, I expect the modeling to be challenging, and I heard similar sentiments from other participants in the Discord chat. The next step will be to explore data splits, train baseline and single-task models, and generate my first submission — making sure to have at least some predictions ready for the mid-challenge unblinding of part of the test set.