V2 Schema & Bloom Filter Deduplication
Implemented V2 Schema Architecture for scaling to 500M+ domains:
- Math-indexed keys: Base-37 SLD encoding (a-z=1-26, 0-9=27-36, hyphen=37)
- Prefix range queries: Find all domains starting with "app" using index range scans
- Trigram search: Fast substring matching via GIN indexes on encoded arrays
- Dual key strategy: NUMERIC for ranges, BIGINT hash for JOINs
- Imported 1.07M SLDs and 1.15M domains from Crux top million
Built Bloom Filter Deduplication Layer for fast SLD ingestion:
- 184k inserts/sec and 276k lookups/sec without database queries
- 200M capacity with ~0.5% false positive rate, zero false negatives
- Time segmentation: bloom_ever (global) + bloom_YYYY (yearly)
- xxhash64 hashing with 10 hash functions
- Initialized with 4.67M existing SLDs from sld_keywords table
DNS scanning validated with 82% resolution rate on Crux .com domains using Google DNS (8.8.8.8).