Our Data Methodology

Every insight we deliver is backed by data. This document explains our methodology — how we collect data, detect patterns, and generate narratives that bookmakers can trust.

Data Collection

Data Sources

BidCanvas aggregates data from multiple sources to build a comprehensive view of each match:

Primary Sources

Source Type	Data Collected	Update Frequency
Prediction Markets	Polymarket, Kalshi probabilities, volume, wallet activity	Real-time (WebSocket)
Sportsbook Odds	Line movements, opening/closing odds, market consensus	Every 60 seconds
Match Statistics	Goals, cards, corners, shots, xG, player stats	Post-match + live
Team/Player Data	Form, injuries, transfers, lineup history	Daily
Referee Data	Card tendencies, foul counts, penalty rates	Post-match

Data Validation

Raw data goes through validation before entering our analysis pipeline:

Cross-source verification — Match results verified across 2+ sources
Outlier detection — Statistical anomalies flagged for manual review
Completeness checks — Matches with missing critical data excluded
Timestamp normalization — All data converted to UTC

Pattern Detection

We use statistical methods to identify patterns that have predictive value. Not all correlations are patterns — we apply strict criteria.

Minimum Requirements

                        For a pattern to be published:
                        Sample size: Minimum 50 instances
Statistical significance: p < 0.05
Effect size: Meaningful deviation from baseline
Temporal stability: Pattern holds across multiple seasons

                    

50 minimum instances before any pattern earns publication — combined with p < 0.05 significance and out-of-sample validation. Small samples get discarded, not published.

Pattern Categories

Category	Example	Detection Method
Referee Tendencies	Anthony Taylor averages 4.2 cards/match	Historical averaging + confidence intervals
Team Form Patterns	Teams on 4-win streak win 53.6% of next match	Conditional probability analysis
League Baselines	Bundesliga BTTS rate: 59.3%	Aggregate statistics by league
Probability Regression	Markets at 70%+ regress toward 60%	Time-series analysis
Sharp Money Signals	3+ sharp wallets aligned = 64% accuracy	Wallet performance tracking

Avoiding False Patterns

Sports data is noisy. Many apparent patterns are random variation. We guard against false positives through:

Multiple testing correction — Bonferroni adjustment when testing many hypotheses
Out-of-sample validation — Patterns discovered on training data must hold on test data
Logical review — Patterns must have plausible causal mechanism
Regular re-validation — Published patterns re-tested quarterly

AI Narrative Generation

Raw statistics are hard to consume. We use AI to transform data into readable narratives.

The Process

# Simplified narrative generation flow

1. Data Assembly
   - Match context (teams, league, competition stage)
   - Relevant patterns (referee, form, H2H)
   - Current odds and market movements
   - Sharp wallet positions

2. Pattern Ranking
   - Score each pattern by relevance to this match
   - Filter to top 3-5 most relevant insights
   - Ensure diversity (don't repeat similar points)

3. Narrative Generation
   - LLM (Claude) generates human-readable text
   - Structured prompts ensure consistency
   - Facts grounded in source data

4. Quality Control
   - Automated fact-checking against source data
   - Confidence scoring based on pattern strength
   - Human review for high-stakes outputs

What AI Does (and Doesn't) Do

AI Role	Human Role
Summarize complex statistics in natural language	Validate pattern detection methodology
Combine multiple data points into coherent narrative	Set minimum thresholds for pattern inclusion
Generate bet slip suggestions with reasoning	Review edge cases and unusual outputs
Adapt tone and detail level to context	Define business rules and constraints

                        Key Principle: AI explains patterns — it doesn't invent them. Every fact in a narrative traces back to validated source data.
                    

4 layers of quality control before any narrative reaches your players: data assembly, pattern ranking, AI generation, and automated fact-checking against source data.

Continuous Improvement

Our methodology evolves based on results:

Feedback Loops

Pattern accuracy tracking — Monitor hit rate of published patterns over time
Bet slip performance — Track suggested bets against actual outcomes
Client feedback — Incorporate operator insights on what bettors find valuable
New data sources — Continuously evaluate emerging data providers

Deprecation Policy

Patterns that no longer meet our criteria are deprecated:

If accuracy drops below statistical significance for 2 consecutive quarters
If sample size becomes too small (e.g., referee retires)
If underlying conditions change (rule changes, team restructuring)

Transparency

Every insight we deliver includes:

Sample size — How many instances the pattern is based on
Time period — When the data was collected
Confidence level — Statistical significance and effect size
Source attribution — Where the underlying data came from

We believe transparency builds trust. Bookmakers can evaluate our methodology and decide how to weight our insights alongside their own analysis.

Data Sources

Primary Sources

Data Validation

Pattern Detection

Minimum Requirements

Pattern Categories

Avoiding False Patterns

AI Narrative Generation

The Process

What AI Does (and Doesn't) Do

Continuous Improvement

Feedback Loops

Deprecation Policy

Transparency

Data Sources Referenced

Related Research

Our Data Methodology

Data Sources

Primary Sources

Data Validation

Pattern Detection

Minimum Requirements

Pattern Categories

Avoiding False Patterns

AI Narrative Generation

The Process

What AI Does (and Doesn't) Do

Continuous Improvement

Feedback Loops

Deprecation Policy

Transparency

Data Sources Referenced

Related Research

Prediction Markets: The Untapped Goldmine

Referee Card Patterns

Sharp Wallet Tracking