Developer Portal

INCIDB Technical Documentation

Comprehensive guide to data architecture, regulatory integrations, and analytical integration with Python PyArrow and DuckDB.

1. Production Scale Metrics

INCIDB unifies high-precision chemical composition data across four core tables:

19,847 Commercial Formulations: Covering prestige skincare lines, K-Beauty functional cosmetics, and OTC clinical topicals.
57,181 Canonical INCI Compounds: Standardized according to International Nomenclature Cosmetic Ingredient conventions.
5,994 Global Brands: Enriched with ethical certifications (cruelty-free and vegan indicators).
330,088 Relational Junction Mappings: Preserving exact package label concentration order (`position_index`).

2. 8-Source Regulatory & Clinical Architecture

Every ingredient profile is cross-referenced against 8 international regulatory monographs:

EU Commission CosIng Registry: Official restricted thresholds and CAS registry numbers (`69-72-7` Salicylic Acid).
US FDA MoCRA Allergen Registry: Flags mandatory contact dermatitis fragrance allergens (*Linalool, Limonene, Geraniol*).
EWG Skin Deep Toxicology: Granular hazard sub-scores highlighting explicit Carcinogenicity alerts (`cancer_hazard_flag = 1`) and Endocrine Disruption flags (`endocrine_hazard_flag = 1`).
CIR Scientific Safety Verdicts: Expert Panel safety conclusions (*"Safe as used"*, *"Safe with qualifications"*).
NLM DailyMed Clinical OTC: Formulations harvested from US National Library of Medicine topical dermatological labels.
K-Beauty MFDS Standards: Functional whitening, anti-wrinkle, and barrier repair standards from South Korean regulations.

3. Audit & Verification Results

All database releases pass rigorous 1-to-1 physical parity and string sanitization audits:

Zero Linebreak Corruption: Stripped unescaped carriage returns (`\r`) and linebreaks (`\n`) inside string fields to guarantee exact line count correspondence (`wc -l` exactly equals record counts).
82% Parquet Compression Ratio: Compressed via Apache PyArrow snappy/zstd algorithms for rapid cloud ingestion.
Full Foreign Key Integrity: Audited 100% referential integrity across all 330,088 junction records.

4. Quick Start — Python PyArrow & Pandas

Load high-performance Parquet datasets directly into memory for filtering toxicological flags:

import pyarrow.parquet as pq
import pandas as pd

# Load formulations and canonical ingredients
products = pq.read_table('data/exports/parquet/products.parquet').to_pandas()
ingredients = pq.read_table('data/exports/parquet/ingredients.parquet').to_pandas()

# Filter endocrine disruptors and carcinogenic flagged ingredients
high_risk = ingredients[(ingredients['cancer_hazard_flag'] == 1) | (ingredients['endocrine_hazard_flag'] == 1)]
print(f"Flagged {len(high_risk)} toxicological hazard compounds across {len(products):,} formulations.")

5. Quick Start — DuckDB In-Memory SQL

Execute analytical joins directly over flat Parquet files without database servers:

import duckdb

query = """
SELECT 
    b.name AS brand,
    p.name AS product,
    COUNT(pi.ingredient_id) AS total_ingredients,
    SUM(i.is_common_allergen) AS mocra_allergens
FROM 'data/exports/parquet/products.parquet' p
JOIN 'data/exports/parquet/brands.parquet' b ON p.brand_id = b.brand_id
JOIN 'data/exports/parquet/product_ingredients.parquet' pi ON p.product_id = pi.product_id
JOIN 'data/exports/parquet/ingredients.parquet' i ON pi.ingredient_id = i.ingredient_id
GROUP BY b.name, p.name
HAVING SUM(i.is_common_allergen) > 0
ORDER BY mocra_allergens DESC
LIMIT 5;
"""

print(duckdb.query(query).to_df())