Batching

Process large datasets by splitting them into configurable batches with deduplication, filtering, and automatic result aggregation.


How Batching Works

When a workflow step receives an array of items (leads, investors, URLs), batching splits it into smaller chunks, processes each batch independently, then aggregates the results. This prevents timeouts, manages API rate limits, and enables parallel processing.

The pipeline: Item ResolutionBatch CreationParallel ExecutionAggregation.


Configuration

Enable batching on any step by adding a batchConfig object:

{
  "batchConfig": {
    "enabled": true,
    "size": 25,
    "sourceVariable": "leads",
    "idField": "email",
    "qualifiedOnly": false,
    "maxItems": 500
  }
}
FieldTypeDescription
sizenumberItems per batch (default: 25)
sourceVariablestringVariable name containing the array to batch
idFieldstringField used for deduplication (e.g., "email")
qualifiedOnlybooleanFilter to only items that passed prior scoring
maxItemsnumberMaximum total items to process

Deduplication

When an idField is set, batching deduplicates items before processing. If duplicates exist, the item with the highest score is kept. This prevents re-processing the same lead or entity across batches.


Result Aggregation

After all batches complete, results are merged and metrics are computed automatically:

  • Totals — total items processed, passed, failed
  • Averages — mean score, median score
  • Tier counts — distribution across score tiers (high, medium, low)
  • Score distribution — histogram of scores across all items

Example: Scoring 2,000 Investors

An investor matching workflow processes a database of 2,000 investors with batching:

Step: "Score Investors"
  batchConfig:
    size: 50
    sourceVariable: "investors"
    idField: "investor_id"
    maxItems: 2000

Execution:
  40 batches of 50 investors each
  Deduplication removed 23 duplicates
  Processing time: 4m 12s

Aggregated results:
  Total scored:    1,977
  High tier (8+):    312  (15.8%)
  Medium tier (5-7): 891  (45.1%)
  Low tier (<5):     774  (39.1%)
  Mean score:       5.7

Edge Cases

  • Empty arrays — Batching completes immediately with zero-count metrics
  • Failed batches — Individual batch failures don't halt the entire run; failed items are reported in aggregated results
  • Over maxItems — Items beyond the limit are silently dropped before batch creation

Next Steps