Comprehensive guide to data architecture, regulatory integrations, and analytical integration with Python PyArrow and DuckDB.
INCIDB unifies high-precision chemical composition data across four core tables:
Every ingredient profile is cross-referenced against 8 international regulatory monographs:
All database releases pass rigorous 1-to-1 physical parity and string sanitization audits:
Load high-performance Parquet datasets directly into memory for filtering toxicological flags:
import pyarrow.parquet as pq
import pandas as pd
# Load formulations and canonical ingredients
products = pq.read_table('data/exports/parquet/products.parquet').to_pandas()
ingredients = pq.read_table('data/exports/parquet/ingredients.parquet').to_pandas()
# Filter endocrine disruptors and carcinogenic flagged ingredients
high_risk = ingredients[(ingredients['cancer_hazard_flag'] == 1) | (ingredients['endocrine_hazard_flag'] == 1)]
print(f"Flagged {len(high_risk)} toxicological hazard compounds across {len(products):,} formulations.")
Execute analytical joins directly over flat Parquet files without database servers:
import duckdb
query = """
SELECT
b.name AS brand,
p.name AS product,
COUNT(pi.ingredient_id) AS total_ingredients,
SUM(i.is_common_allergen) AS mocra_allergens
FROM 'data/exports/parquet/products.parquet' p
JOIN 'data/exports/parquet/brands.parquet' b ON p.brand_id = b.brand_id
JOIN 'data/exports/parquet/product_ingredients.parquet' pi ON p.product_id = pi.product_id
JOIN 'data/exports/parquet/ingredients.parquet' i ON pi.ingredient_id = i.ingredient_id
GROUP BY b.name, p.name
HAVING SUM(i.is_common_allergen) > 0
ORDER BY mocra_allergens DESC
LIMIT 5;
"""
print(duckdb.query(query).to_df())