Implementing pipeline components¶

This chapter describes how to extend Veil by implementing new pipeline components, with a special focus on entity detectors. You will see what input each stage receives, what output it produces, and how it fits into the overall flow.

Shared Types and Data¶

Document (input to almost all components): see veil.core.document.Document API in veil.pipeline (used within the pipeline). Key fields: text, doc_id, ground_truth.
Span (entity unit): see veil.core.span.Span referenced from veil.pipeline. Fields: start, end, entity_type, id, replacement, confidence.
MaskResult (output of the masker and the pipeline): see veil.core.mask_result.MaskResult in veil.masker. Fields: masked_text, entities, evaluation.

These classes are propagated through the pipeline stages and define the standard inputs/outputs.

Flow and Interfaces by Component¶

At a high level, the pipeline (see Pipeline.process in veil.pipeline) executes the following steps, given an input Document:

Entity detectors → Lists of Span per detector.
Entity resolvers (optional) → Adjust/merge Spans and can assign consistent ids.
Overlap resolution → Final selection of Span according to priorities/hierarchy.
Masking → MaskResult with masked_text and entities.
Evaluation (optional) → Metrics with ground_truth if available.

1) Entity Detectors (main extension)¶

Base class to inherit from: veil.core.base_entity_detector.BaseEntityDetector in veil.entity_detectors.
Associated config: subclasses of veil.config.entity_detectors.BaseEntityDetectorConfig in veil.config.
Registry: veil.entity_detectors.registry.EntityDetectorRegistry in veil.entity_detectors maps EntityDetectorType → concrete class.

Implementation contract:

Define ENTITY_TYPES: Set[EntityTypeBase] or a specific EntityTypeBase (enum) with the types supported by the detector. Examples of types: RegexEntityType, GlinerEntityType (see veil.entity_detectors).
Implement detect_entities(self, doc: Document) -> List[Span] returning spans with start, end, entity_type, and, if applicable, confidence.
Use the config passed in the constructor (a subclass of BaseEntityDetectorConfig) for its own parameters. Common available fields: priority (by canonical type) and hierarchy_position (global precedence). See veil.config.

Inputs and outputs:

Input: Document (doc.text, doc.doc_id).
Output: List[Span] detected by the detector.

Automatic integration into the pipeline:

The pipeline instantiates detectors from the entity_detectors list in PipelineConfig using the registry. The order in the list defines the execution order. See Pipeline.__init__ and Pipeline.process in veil.pipeline.

Quick steps to add a new detector

Define the entity types (subclass of EntityTypeBase) and, if applicable, their aliases().
Implement the detector class inheriting from BaseEntityDetector, define ENTITY_TYPES and detect_entities().
Create the Config class inheriting from BaseEntityDetectorConfig, add your parameters and get_type().
Register the class in EntityDetectorRegistry under an EntityDetectorType.
Add the block to your PipelineConfig.entity_detectors YAML with type: and your parameters.
Adjust priority by type and hierarchy_position for the desired overlap behavior.

Important note on polymorphic types (type):

In addition to registering the class, import your new configuration class in veil/config/__init__.py to expose it to the configuration system, and if you create a new detector module/package, expose it also in veil/entity_detectors/__init__.py. This allows selecting it by type from YAML/CLI.

Best practices for detectors:

Fill in confidence when a natural score exists; it helps to break ties in overlaps.
Maintain ENTITY_TYPES and EntityTypeBase.aliases() if you need to normalize alternative names (the pipeline uses EntityTypeBase.global_alias_map() for canonical names).
Use priority by type and hierarchy_position to guide selection in conflicts (see overlap section).

Complete example (skeleton)

# 1) Types
from veil.core.base_entity_type import EntityTypeBase
class MyEntityType(EntityTypeBase):
    NAME = 1

# 2) Detector
from typing import List, Set
from veil.core.base_entity_detector import BaseEntityDetector
from veil.core.document import Document
from veil.core.span import Span
class MyDetector(BaseEntityDetector[MyEntityType]):
    ENTITY_TYPES: Set[MyEntityType] = {MyEntityType.NAME}
    def __init__(self, config):
        super().__init__(config)
    def detect_entities(self, doc: Document) -> List[Span]:
        return []

# 3) Config
from dataclasses import field
from veil.config.core.frozen_dataclass import frozen_dataclass
from veil.config.entity_detectors import BaseEntityDetectorConfig
from veil.core.enums.entity_detector_type import EntityDetectorType
@frozen_dataclass
class MyDetectorConfig(BaseEntityDetectorConfig):
    my_threshold: float = field(default=0.5)
    @classmethod
    def get_type(cls):
        return EntityDetectorType.GLINER  # or a new type if added

# 4) Registry
from veil.entity_detectors.registry import EntityDetectorRegistry
EntityDetectorRegistry.register(EntityDetectorType.GLINER, MyDetector)

# 5) (polymorphic) Expose Config and detector in __init__.py
# - Add `from .entity_detectors import MyDetectorConfig` in `veil/config/__init__.py`
# - Add `from .my_detector import MyDetector` in `veil/entity_detectors/__init__.py` (if applicable)

And in YAML:

entity_detectors:
  - type: gliner  # or your new type
    my_threshold: 0.6
    priority:
      NAME: 0
    hierarchy_position: 0

2) Entity Resolvers (optional)¶

Practical interface: any class with resolve(self, doc: Document, spans: List[Span], entity_cache: Optional[Dict[str, Dict[int, Set[str]]]] = None) -> List[Span] works (see EmbeddingsEntityResolver in veil.entity_resolvers).
Config: inherit from BaseEntityResolverConfig and return get_type().
Input: the document and the combined list of detected spans so far (accumulated from all detectors); optionally an entity_cache by type.
Output: the same list of spans, potentially with assigned ids or deduplications/merges applied.

Simple example: assign sequential IDs by type

from typing import Dict, List, Optional, Set
from veil.core.document import Document
from veil.core.span import Span

class SimpleIdResolver:
    def __init__(self, config):
        self.config = config

    def resolve(self, doc: Document, spans: List[Span], entity_cache: Optional[Dict[str, Dict[int, Set[str]]]] = None) -> List[Span]:
        next_id_per_type: Dict[str, int] = {}
        out: List[Span] = []
        for s in spans:
            et = getattr(s.entity_type, "name", "")
            if getattr(s, "id", None) is None:
                nid = next_id_per_type.get(et, 1)
                next_id_per_type[et] = nid + 1
                out.append(Span(start=s.start, end=s.end, entity_type=s.entity_type, id=nid, replacement=s.replacement, confidence=s.confidence))
            else:
                out.append(s)
        return out

3) Overlap Resolution (OverlapResolver)¶

Class: veil.overlap_resolver.OverlapResolver in veil.pipeline and configuration in veil.config.
Inputs: per detector, its list of spans and a priority map by canonical type; also a global hierarchy_position per detector. iou_threshold parameter in OverlapResolverConfig.
Output: List[Span] selected, resolving conflicts between spans of the same type.

It is not usually necessary to implement a new one; adjust priority and hierarchy_position from the detectors’ configuration.

4) Masker¶

Class: veil.masker.Masker in veil.masker. It receives a Document and spans and returns a MaskResult.
Mask types: controlled by MaskerConfig.method (see veil.config).

You will generally not need to extend it to add detectors; its interface is stable.

5) Evaluation (Evaluator)¶

Orchestration class: veil.evaluator.Evaluator in veil.pipeline; configuration in veil.config.
Main input: document, mask_result (final spans), component_spans (spans per detector), a map of supported types per component, and metric_store.
Output: a dictionary with metric variants (exact, iou@THRESH, etc.) which is attached to MaskResult.evaluation.

For most extensions (new detectors), you do not need to touch the evaluation.