Curated Datasets
Dataset Name | Total # of Compounds | Data Type | Dataset Source | Dataset Description |
ABC-Transporters (ATP-binding cassette) |
8644 |
Transporters: • P-glycoprotein (P-gp; also known as MDR1) • Breast cancer resistance protein (BCRP) • Multidrug resistance-associated protein (MRP1, MRP2) Activity Types: • Substrates: Pgp_substrate, BCRP_substrate, MRP1_substrate, MRP2_substrate • Inhibitors: Pgp_inhibitor, BCRP_inhibitor, MRP1_inhibitor, MRP2_inhibitor Labels: • 1 (substrate/inhibitor), 0 (non-substrate/non-inhibitor) |
Daood et al. (Manuscript Submitted) | Curated over 24,000 ChEMBL bioactivity records related to human ABC transporters (P-gp, BCRP, MRP1, MRP2), with each record manually reviewed for cited literature source, noting details such as cell lines, substrates, substrate concentrations, and positive controls where available. Combined the data collected from PubChem and Metrabase. |
BBB (Blood Brain Barrier) |
438 | logBB | Wang et al. | Compounds with experimental logBB values was compiled and curated using ChemAxon and CASE Ultra tools. |
BCRP (Breast Cancer Resistance Protein) |
395 | µM (evidence of inhibition at 10 µM) |
Sedykh et al. Zhao et al. |
The BCRP dataset was curated for experimental consistency and structural quality, and filtered to include only reliable binary classification labels for substrates and inhibitors. |
Bioavailability | 1141 | oral bioavailability (%F) | Kim et al. | Compiled across public and literature sources. Chemical structures were standardized, and %F values were harmonized to resolve discrepancies. |
BSEP (Bile Salt Export Pump) |
725 | µM (evidence of inhibition at 100 µM) |
Zhao et al. | Collected from publicly available experimental data. Structures were curated and standardized to ensure consistency and dataset includes binary labels. |
Cancer (Human Oral Carcinogenicity) |
342 | Binary, 0=Non-Carcinogen; 1=Carcinogen | Chung et al. | 342 unique organic compounds from the EPA’s IRIS database, labeled as carcinogenic or noncarcinogenic based on oral slope factor (OSF), a quantitative measure for oral cancer risk. |
Cosmetics | 4129 | Activity_Cosmetics (All chemicals fall under the cosmetic category) | Chung et al. | Cosmetic dataset collected from COSMOS Cosmetics Inventory knowledge base. |
DART (Developmental and Reproductive Toxicity) |
1280 | Oral Developmental, Inhalation Maternal, ToxRefDB Maternal | Ciallella et al. | Collected from U.S. EPA’s in vivo prenatal developmental toxicity studies in rats and rabbits based on oral or inhalation studies. |
Drugbank | 8055 | Activity_Drugbank (All chemicals fall under the drug category) | Chung et al. | Collected from DrugBank database. |
Embryotox | 764 | Binary Labels (1:Safe; 0: Teratogen) | Aljarf et al. | Collected from FDA drug labeling data and literature annotations for known teratogens. Drugs with strong evidence of teratogenicity were classified as positives. Non-teratogenic drugs were chosen from non-reproductive risk categories to avoid mislabeling. |
Estrogen | 2103 | Agonist, Antagonist, Binding, and Uterotrophic class | Ciallella et al. | Collected from the Tox21 screening program using high-throughput in vitro assays that assess estrogen receptor (ER) activation and inhibition. |
FM (Fathead Minnow) |
675 | -log10 of Conc. (µmol/L) | Klopman et al. | Collected from standardized 96-hour LC₅₀ test data for Pimephales promelas (fathead minnow), sourced from the EPA’s ECOTOX database and additional public toxicology resources. |
Hepatotoxicity | 5177 | Several classification endpoints for hepatotoxicity are provided at standard or dose-based thresholds. Activity is the main endpoint. Endpoints labeled H-* represent Human data and PC-* represent Pre-Clinical rat data. The abbreviations used include: CC – clinical chemistry, HC – hepatocellular injury, HT – hepatotoxicity, HB – hepatobiliary injury, and MF – morphological findings. | Mulliner et al. | Compiled from multiple public toxicology databases, including the U.S. FDA’s Liver Toxicity Knowledge Base (LTKB), EMEA, LiverTox, and published scientific literature, with a focus on liver toxicity endpoints in both humans and animals. |
High Production Volume | 1672 | Activity_HPV (All chemicals fall under the HPV category) | Chung et al. | U.S. EPA HPV Challenge Program's chemical database was used for collection. |
Httk_ADME_Parameters | 1449 | Human CLint (intrinsic clearance) in µL/min/10^6 cells, Human Fu (fraction unbound), Rat CLint (intrinsic clearance) in µL/min/10^6 cells, and Rat Fu (fraction unbound) | High-Throughput Toxicokinetics | The HTTK dataset, developed by the U.S. EPA, contains high-throughput toxicokinetic data and models covering pharmaceuticals and environmental chemicals. It includes in vitro measurements like plasma protein binding and hepatic clearance rates, as well as species-specific physiological data such as tissue volumes and blood flow rates. |
LD50 (Rat oral) |
7332 | log10 mol/kg-bw | Zhu et al. | Used publicly available rat oral acute toxicity data, with LD₅₀ values classified into toxicity categories (e.g., high, moderate, low) according to Globally Harmonized System (GHS) thresholds. |
MDR1 (Multidrug Resistance 1 transporter) |
1585 | µM (evidence of inhibition at 10 µM) |
Sedykh et al. | Collected from the Intestinal Transporter Database using high-confidence experimental data sourced from literature and public databases. |
Natural Products | 2479 | Activity_NP (All chemicals fall under the natural product category) | Chung et al. | The natural products dataset from the traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) database, was curated. |
Pesticides | 1009 | Activity_Pesticides (All chemicals fall under the pesticide category) | Chung et al. | Collected from literature and public databases including U.S. EPA CompTox Chemistry Dashboard. |