Curated Datasets

Dataset Name Total # of Compounds Data Type Dataset Source Dataset Description
BBB
(Blood Brain Barrier)
438logBBWang et al.Compounds with experimental logBB values was compiled and curated using ChemAxon and CASE Ultra tools.
BCRP
(Breast Cancer Resistance Protein)
395µM
(evidence of inhibition at 10 µM)
Sedykh et al.
Zhao et al.
The BCRP dataset was curated for experimental consistency and structural quality, and filtered to include only reliable binary classification labels for substrates and inhibitors.
Bioavailability1159oral bioavailability (%F)Kim et al.Compiled across public and literature sources. Chemical structures were standardized, and %F values were harmonized to resolve discrepancies.
BSEP
(Bile Salt Export Pump)
725µM
(evidence of inhibition at 100 µM)
Zhao et al.Collected from publicly available experimental data. Structures were curated and standardized to ensure consistency and dataset includes binary labels.
Cancer
(Human Oral Carcinogenicity)
342Binary, 0=Non-Carcinogen; 1=CarcinogenChung et al.342 unique organic compounds from the EPA’s IRIS database, labeled as carcinogenic or noncarcinogenic based on oral slope factor (OSF), a quantitative measure for oral cancer risk.
Cosmetics4129---Chung et al.Cosmetic dataset collected from COSMOS Cosmetics Inventory knowledge base.
DART
(Developmental and Reproductive Toxicity)
1452Oral Developmental, Inhalation Maternal, ToxRefDB MaternalCiallella et al.Collected from U.S. EPA’s in vivo prenatal developmental toxicity studies in rats and rabbits based on oral or inhalation studies.
Drugbank8055---Chung et al.Collected from DrugBank database.
Embryotox766Binary Labels (1:Safe; 0: Teratogen)Aljarf et al.Collected from FDA drug labeling data and literature annotations for known teratogens. Drugs with strong evidence of teratogenicity were classified as positives. Non-teratogenic drugs were chosen from non-reproductive risk categories to avoid mislabeling.
Estrogen2144Agonist, Antagonist, Binding, and Uterotrophic classCiallella et al.Collected from the Tox21 screening program using high-throughput in vitro assays that assess estrogen receptor (ER) activation and inhibition.
FM
(Fathead Minnow)
675-log10 of Conc. (µmol/L)Klopman et al.Collected from standardized 96-hour LC₅₀ test data for Pimephales promelas (fathead minnow), sourced from the EPA’s ECOTOX database and additional public toxicology resources.
Hepatotoxicity7502Several classification endpoints for hepatotoxicity at standard or dose-based thresholdsMulliner et al.Compiled from multiple public toxicology databases, including the U.S. FDA’s Liver Toxicity Knowledge Base (LTKB), EMEA, LiverTox, and published scientific literature, with a focus on liver toxicity endpoints in both humans and animals.
High Production Volume1672---Chung et al.U.S. EPA HPV Challenge Program's chemical database was used for collection.
Httk_ADME_Parameters1610---High-Throughput ToxicokineticsThe HTTK dataset, developed by the U.S. EPA, contains high-throughput toxicokinetic data and models covering pharmaceuticals and environmental chemicals. It includes in vitro measurements like plasma protein binding and hepatic clearance rates, as well as species-specific physiological data such as tissue volumes and blood flow rates.
LD50
(Rat oral)
7332log10 mol/kg-bwZhu et al.Used publicly available rat oral acute toxicity data, with LD₅₀ values classified into toxicity categories (e.g., high, moderate, low) according to Globally Harmonized System (GHS) thresholds.
MDR1
(Multidrug Resistance 1 transporter)
1585µM
(evidence of inhibition at 10 µM)
Sedykh et al.Collected from the Intestinal Transporter Database using high-confidence experimental data sourced from literature and public databases.
Natural Products2479---Chung et al.The natural products dataset from the traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) database, was curated.
Pesticides1009---Chung et al.Collected from literature and public databases including U.S. EPA CompTox Chemistry Dashboard.
Please Wait...