YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
The Balance Sheet Test
Does your synthetic data add up? Literally?
A single command to audit any synthetic HNWI dataset for mathematical integrity. Open a file, pick a record, check whether Assets β Liabilities = Net Worth. If it fails on even one record, you have a data quality problem that will propagate through every model you train on it.
Most synthetic financial datasets fail this test.
pip install -r requirements.txt # no dependencies β pure Python
python balance_sheet_test.py your_data.csv
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BALANCE SHEET TEST β Data Quality Report
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dataset: 100 records
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Net Worth Identity [PASS] 100/100 records
2. Asset Decomposition [PASS] 100/100 records
3. Narrative Consistency [PASS] 100/100 records
4. Geographic Coherence [PASS] 17/17 records
5. Distribution Realism [PASS] Pareto-like (skew: 3.68, top-20%: 63.3%)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Overall: 5/5 checks passed β EXCELLENT
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
That's Sovereign Forger data. Run it on yours.
The Five Checks
1. Net Worth Identity
Total Assets β Total Liabilities = Net Worth
The fundamental accounting identity. If a single record fails, the dataset's financial fields were generated independently rather than constrained by algebraic relationships. Your model learns that inconsistency is normal.
2. Asset Decomposition
Property Value + Core Equity + Cash Liquidity = Total Assets
Many datasets include detailed sub-fields that look granular but don't sum to the stated total. The appearance of detail without the substance.
3. Narrative Consistency
Dollar amounts mentioned in biographical or asset narrative text are cross-checked against the structured numerical fields. If the narrative says "$130M in assets" but the field says $127M, the text was generated without awareness of the numbers.
4. Geographic Coherence
A $2M net worth in Atherton or a $500K profile in Bel Air signals that the location and wealth tier were generated independently. This check flags profiles where the residential zone doesn't match the wealth tier.
5. Distribution Realism
UHNWI wealth follows a Pareto distribution β a small number of profiles at very high net worth, with a long tail. If your dataset shows a bell curve, it was not modeled on real-world wealth patterns. The test checks skewness and top-20% wealth concentration.
Usage
# Basic test
python balance_sheet_test.py data.csv
# See which records fail and why
python balance_sheet_test.py data.csv --verbose
# Machine-readable output for CI pipelines
python balance_sheet_test.py data.csv --format json
CSV Column Mapping
The tool auto-detects common column naming conventions:
| Field | Accepted column names |
|---|---|
| Net Worth | net_worth_usd, net_worth, networth |
| Total Assets | total_assets, assets |
| Total Liabilities | total_liabilities, liabilities |
| Property Value | property_value, property, real_estate_value |
| Core Equity | core_equity, equity, equity_holdings |
| Cash Liquidity | cash_liquidity, cash, liquid_assets |
| Narrative | narrative_bio, bio, assets_composition |
| Location | residence_city, residence_zone |
Why This Exists
Most synthetic data pipelines β GANs, VAEs, LLMs β generate fields probabilistically. Each value is sampled from a distribution with some correlation modeling to keep things "loosely coherent."
Loosely coherent is not algebraically exact. At scale, small error rates compound: a 2% failure rate on 10,000 records means 200 broken balance sheets polluting your training data.
The alternative is born-synthetic data: built from mathematical constraints first, enriched by AI second. Net worth is computed from a Pareto distribution. Assets and liabilities are derived from constrained splits. Sub-components are allocated proportionally. The numbers are locked before any language model touches the profile.
The AI adds biography, profession, philanthropy. It never touches the numbers.
This is how Sovereign Forger builds UHNWI data. The Balance Sheet Test is how we prove it works β and how you can verify any dataset, including ours.
Try It On Real Data
We publish a free sample of 100 born-synthetic Silicon Valley UHNWI profiles. Every record passes all five checks β 100 out of 100.
Run the test. If the math works, consider what 10,000 records with the same integrity could do for your product.
License
MIT β use it, fork it, test your vendors with it.
Built by Sovereign Forger β born-synthetic UHNWI data with mathematical integrity.