Curated Datasets

Dataset Name Total # of Compounds Data Type Dataset Source Dataset Description
ABC-Transporters
(ATP-binding cassette)
8644 Transporters:
• P-glycoprotein (P-gp; also known as MDR1)
• Breast cancer resistance protein (BCRP)
• Multidrug resistance-associated protein (MRP1, MRP2)

Activity Types:
• Substrates: Pgp_substrate, BCRP_substrate, MRP1_substrate, MRP2_substrate
• Inhibitors: Pgp_inhibitor, BCRP_inhibitor, MRP1_inhibitor, MRP2_inhibitor

Labels:
• 1 (substrate/inhibitor), 0 (non-substrate/non-inhibitor)
Daood et al. (Manuscript Submitted) Curated over 24,000 ChEMBL bioactivity records related to human ABC transporters (P-gp, BCRP, MRP1, MRP2), with each record manually reviewed for cited literature source, noting details such as cell lines, substrates, substrate concentrations, and positive controls where available. Combined the data collected from PubChem and Metrabase.
BBB
(Blood Brain Barrier)
438 logBB Wang et al. Compounds with experimental logBB values was compiled and curated using ChemAxon and CASE Ultra tools.
BCRP
(Breast Cancer Resistance Protein)
395 µM
(evidence of inhibition at 10 µM)
Sedykh et al.
Zhao et al.
The BCRP dataset was curated for experimental consistency and structural quality, and filtered to include only reliable binary classification labels for substrates and inhibitors.
Bioavailability 1141 oral bioavailability (%F) Kim et al. Compiled across public and literature sources. Chemical structures were standardized, and %F values were harmonized to resolve discrepancies.
BSEP
(Bile Salt Export Pump)
725 µM
(evidence of inhibition at 100 µM)
Zhao et al. Collected from publicly available experimental data. Structures were curated and standardized to ensure consistency and dataset includes binary labels.
Cancer
(Human Oral Carcinogenicity)
342 Binary, 0=Non-Carcinogen; 1=Carcinogen Chung et al. 342 unique organic compounds from the EPA’s IRIS database, labeled as carcinogenic or noncarcinogenic based on oral slope factor (OSF), a quantitative measure for oral cancer risk.
Cosmetics 4129 Activity_Cosmetics (All chemicals fall under the cosmetic category) Chung et al. Cosmetic dataset collected from COSMOS Cosmetics Inventory knowledge base.
DART
(Developmental and Reproductive Toxicity)
1280 Oral Developmental, Inhalation Maternal, ToxRefDB Maternal Ciallella et al. Collected from U.S. EPA’s in vivo prenatal developmental toxicity studies in rats and rabbits based on oral or inhalation studies.
Drugbank 8055 Activity_Drugbank (All chemicals fall under the drug category) Chung et al. Collected from DrugBank database.
Embryotox 764 Binary Labels (1:Safe; 0: Teratogen) Aljarf et al. Collected from FDA drug labeling data and literature annotations for known teratogens. Drugs with strong evidence of teratogenicity were classified as positives. Non-teratogenic drugs were chosen from non-reproductive risk categories to avoid mislabeling.
Estrogen 2103 Agonist, Antagonist, Binding, and Uterotrophic class Ciallella et al. Collected from the Tox21 screening program using high-throughput in vitro assays that assess estrogen receptor (ER) activation and inhibition.
FM
(Fathead Minnow)
675 -log10 of Conc. (µmol/L) Klopman et al. Collected from standardized 96-hour LC₅₀ test data for Pimephales promelas (fathead minnow), sourced from the EPA’s ECOTOX database and additional public toxicology resources.
Hepatotoxicity 5177 Several classification endpoints for hepatotoxicity are provided at standard or dose-based thresholds. Activity is the main endpoint. Endpoints labeled H-* represent Human data and PC-* represent Pre-Clinical rat data. The abbreviations used include: CC – clinical chemistry, HC – hepatocellular injury, HT – hepatotoxicity, HB – hepatobiliary injury, and MF – morphological findings. Mulliner et al. Compiled from multiple public toxicology databases, including the U.S. FDA’s Liver Toxicity Knowledge Base (LTKB), EMEA, LiverTox, and published scientific literature, with a focus on liver toxicity endpoints in both humans and animals.
High Production Volume 1672 Activity_HPV (All chemicals fall under the HPV category) Chung et al. U.S. EPA HPV Challenge Program's chemical database was used for collection.
Httk_ADME_Parameters 1449 Human CLint (intrinsic clearance) in µL/min/10^6 cells, Human Fu (fraction unbound), Rat CLint (intrinsic clearance) in µL/min/10^6 cells, and Rat Fu (fraction unbound) High-Throughput Toxicokinetics The HTTK dataset, developed by the U.S. EPA, contains high-throughput toxicokinetic data and models covering pharmaceuticals and environmental chemicals. It includes in vitro measurements like plasma protein binding and hepatic clearance rates, as well as species-specific physiological data such as tissue volumes and blood flow rates.
LD50
(Rat oral)
7332 log10 mol/kg-bw Zhu et al. Used publicly available rat oral acute toxicity data, with LD₅₀ values classified into toxicity categories (e.g., high, moderate, low) according to Globally Harmonized System (GHS) thresholds.
MDR1
(Multidrug Resistance 1 transporter)
1585 µM
(evidence of inhibition at 10 µM)
Sedykh et al. Collected from the Intestinal Transporter Database using high-confidence experimental data sourced from literature and public databases.
Natural Products 2479 Activity_NP (All chemicals fall under the natural product category) Chung et al. The natural products dataset from the traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) database, was curated.
Pesticides 1009 Activity_Pesticides (All chemicals fall under the pesticide category) Chung et al. Collected from literature and public databases including U.S. EPA CompTox Chemistry Dashboard.
Please Wait, It may take several minutes...