Common Mistakes

A correctly structured data warehouse is critical for BPP's identity resolution, AI model training, and marketing activation. Below are the most frequent mistakes teams make, with explanations and how to avoid them.

1. Missing or incorrect timestamps in event tables

Issue: an event table lacks a DATETIME field, or uses a creation/update date that doesn't represent when the event actually occurred.

Wrong — updated_at is overwritten, so history is lost:

deal_id | stage      | updated_at
D123    | Closed Won | 2024-09-15

Correct — one immutable row per state change:

deal_id | crm_contact_id | hem           | old_stage     | new_stage  | event_timestamp
D123    | C456           | abc123hash... | Proposal Sent | Closed Won | 2024-09-15 11:42:00

Always ensure 1 row = 1 immutable event with a UTC timestamp, and each event links to a user (via crm_contact_id, hem, or another stable ID).

2. Missing user identifiers

Issue: events carry only an entity ID (deal_id, order_id) and no user identifier, which blocks identity resolution.

Fix: always include at least one user identifier (hem, hphone, crm_contact_id, cookie_id) alongside entity IDs.

3. Inconsistent ID handling across tables

Issue: the same identifier appears under different column names or with inconsistent transformations (email_hashed vs hem; hashed in one table, cleartext in another).

Fix: standardize identifier meaning across tables and apply consistent hashing rules. BPP supports clear PII only if flagged at ingestion — it will then lowercase, trim, normalize, and SHA-256 it automatically. See Identifiers & Hashing.

4. Incorrect hashing of PII

Issue: hashes don't match the ad platforms because preprocessing wasn't applied — email hashed without lowercasing, Gmail dots kept (john.doe@gmail.com vs johndoe@gmail.com), phone hashed without E.164 normalization.

Fix:

Emails: lowercase, trim, remove Gmail dots and +tags, then SHA-256.
Phones: remove spaces/symbols, ensure +countrycode (E.164), then SHA-256.
Validate by comparing test hashes against Google/Meta documentation — or let BPP hash by flagging the column as PII.

5. Overwriting event states instead of tracking history

Issue: tables overwrite the current state instead of logging changes (orders updated pending → completed; deals Proposal → Closed Won).

Fix: create status history tables (order_status_history, deal_stage_history). Immutable logs are essential for Action Prediction, pcLTV, and RFM. Always include user identifiers on every row.

6. Using high-cardinality or dirty categoricals

Issue: fields like job_title or industry have thousands of free-text variations, so models can't learn effectively.

Fix: normalize into controlled categories (Manager, Director, VP); use enrichment providers (Clearbit, ZoomInfo, Apollo); store both the raw value and a normalized version.

7. Mixing user- and event-grain data

Issue: putting both user attributes and event history in the same table (e.g. a customers table with last_pageview_url or last_order_date).

Fix:

User tables = 1 row per user, stable attributes.
Event tables = 1 row per event, timestamped.
If you need behaviour at user level, pre-aggregate it via ETL or the Fields Builder (total_orders, avg_basket_value).

8. Missing business-critical fields

Issue: event tables lack essential metrics — orders without a monetary value, subscriptions without a status, web events without a page_url.

Fix:

Orders → transaction_id, monetary, currency.
Subscriptions → plan_type, mrr/arr, status.
Web events → absolute page_url, device, country, gclid/gbraid/wbraid.

9. Wrong timezone handling

Issue: event timestamps stored in local timezone, causing mismatches with ad platforms and ETL.

Fix: always store UTC timestamps. Keep the original timezone as a separate field only if the business requires it.

10. Missing or wrong use of transaction IDs

Issue: orders without a transaction_id, so reconciliation with ad platforms fails.

Fix: transaction_id is optional for BPP models but strongly recommended for ad integrations (Enhanced Conversions, Conversion Adjustments, value-based bidding).

11. Poor web event tracking

Issue: only page titles stored (Home, Checkout), making topic classification impossible; no attribution parameters (utm, gclid).

Fix: always store absolute URLs, include attribution params (UTM, click IDs), and add device/browser/country for feature engineering.

12. Mixing contact vs account grain in B2B

Issue: account-level attributes in contact tables, or vice versa.

Fix:

Contact tables = individuals (emails, phones, job roles).
Add firmographics at contact level via ETL enrichment (company_size, industry).
Keep consistent joins (crm_contact_id → crm_account_id).

13. Over-aggregating data too early

Issue: only providing aggregated monthly revenue or total logins.

Fix: always provide atomic events (each order, each login). Aggregations can then be derived as features (total_orders_90d, avg_session_duration).

Issue: missing or inconsistent consent tracking.

Fix: store consent_marketing and consent_timestamp, and respect local privacy requirements (GDPR, CCPA).

Summary

Avoiding these mistakes lets you:

Ensure accurate model training (Action Prediction, pcLTV, Interest, RFM).
Maximize match rates in ad platforms (correctly hashed PII + click IDs).
Maintain data integrity across CRM, web, product, and transactions.
Build a warehouse that supports both analytics and activation.

1. Missing or incorrect timestamps in event tables​

2. Missing user identifiers​

3. Inconsistent ID handling across tables​

4. Incorrect hashing of PII​

5. Overwriting event states instead of tracking history​

6. Using high-cardinality or dirty categoricals​

7. Mixing user- and event-grain data​

8. Missing business-critical fields​

9. Wrong timezone handling​

10. Missing or wrong use of transaction IDs​

11. Poor web event tracking​

12. Mixing contact vs account grain in B2B​

13. Over-aggregating data too early​

14. Ignoring consent & privacy​

Summary​

1. Missing or incorrect timestamps in event tables

2. Missing user identifiers

3. Inconsistent ID handling across tables

4. Incorrect hashing of PII

5. Overwriting event states instead of tracking history

6. Using high-cardinality or dirty categoricals

7. Mixing user- and event-grain data

8. Missing business-critical fields

9. Wrong timezone handling

10. Missing or wrong use of transaction IDs

11. Poor web event tracking

12. Mixing contact vs account grain in B2B

13. Over-aggregating data too early

14. Ignoring consent & privacy

Summary