Implementing pipeline components

This chapter describes how to extend Veil by implementing new pipeline components, with a special focus on entity detectors. You will see what input each stage receives, what output it produces, and how it fits into the overall flow.

See also:

Shared Types and Data

  • Document (input to almost all components): see veil.core.document.Document API in veil.pipeline (used within the pipeline). Key fields: text, doc_id, ground_truth.

  • Span (entity unit): see veil.core.span.Span referenced from veil.pipeline. Fields: start, end, entity_type, id, replacement, confidence.

  • MaskResult (output of the masker and the pipeline): see veil.core.mask_result.MaskResult in veil.masker. Fields: masked_text, entities, evaluation.

These classes are propagated through the pipeline stages and define the standard inputs/outputs.

Flow and Interfaces by Component

At a high level, the pipeline (see Pipeline.process in veil.pipeline) executes the following steps, given an input Document:

  1. Entity detectors → Lists of Span per detector.

  2. Entity resolvers (optional) → Adjust/merge Spans and can assign consistent ids.

  3. Overlap resolution → Final selection of Span according to priorities/hierarchy.

  4. Masking → MaskResult with masked_text and entities.

  5. Evaluation (optional) → Metrics with ground_truth if available.

1) Entity Detectors (main extension)

  • Base class to inherit from: veil.core.base_entity_detector.BaseEntityDetector in veil.entity_detectors.

  • Associated config: subclasses of veil.config.entity_detectors.BaseEntityDetectorConfig in veil.config.

  • Registry: veil.entity_detectors.registry.EntityDetectorRegistry in veil.entity_detectors maps EntityDetectorType → concrete class.

Implementation contract:

  • Define ENTITY_TYPES: Set[EntityTypeBase] or a specific EntityTypeBase (enum) with the types supported by the detector. Examples of types: RegexEntityType, GlinerEntityType (see veil.entity_detectors).

  • Implement detect_entities(self, doc: Document) -> List[Span] returning spans with start, end, entity_type, and, if applicable, confidence.

  • Use the config passed in the constructor (a subclass of BaseEntityDetectorConfig) for its own parameters. Common available fields: priority (by canonical type) and hierarchy_position (global precedence). See veil.config.

Inputs and outputs:

  • Input: Document (doc.text, doc.doc_id).

  • Output: List[Span] detected by the detector.

Automatic integration into the pipeline:

  • The pipeline instantiates detectors from the entity_detectors list in PipelineConfig using the registry. The order in the list defines the execution order. See Pipeline.__init__ and Pipeline.process in veil.pipeline.

Quick steps to add a new detector

  1. Define the entity types (subclass of EntityTypeBase) and, if applicable, their aliases().

  2. Implement the detector class inheriting from BaseEntityDetector, define ENTITY_TYPES and detect_entities().

  3. Create the Config class inheriting from BaseEntityDetectorConfig, add your parameters and get_type().

  4. Register the class in EntityDetectorRegistry under an EntityDetectorType.

  5. Add the block to your PipelineConfig.entity_detectors YAML with type: and your parameters.

  6. Adjust priority by type and hierarchy_position for the desired overlap behavior.

Important note on polymorphic types (type):

  • In addition to registering the class, import your new configuration class in veil/config/__init__.py to expose it to the configuration system, and if you create a new detector module/package, expose it also in veil/entity_detectors/__init__.py. This allows selecting it by type from YAML/CLI.

Best practices for detectors:

  • Fill in confidence when a natural score exists; it helps to break ties in overlaps.

  • Maintain ENTITY_TYPES and EntityTypeBase.aliases() if you need to normalize alternative names (the pipeline uses EntityTypeBase.global_alias_map() for canonical names).

  • Use priority by type and hierarchy_position to guide selection in conflicts (see overlap section).

Complete example (skeleton)

# 1) Types
from veil.core.base_entity_type import EntityTypeBase
class MyEntityType(EntityTypeBase):
    NAME = 1

# 2) Detector
from typing import List, Set
from veil.core.base_entity_detector import BaseEntityDetector
from veil.core.document import Document
from veil.core.span import Span
class MyDetector(BaseEntityDetector[MyEntityType]):
    ENTITY_TYPES: Set[MyEntityType] = {MyEntityType.NAME}
    def __init__(self, config):
        super().__init__(config)
    def detect_entities(self, doc: Document) -> List[Span]:
        return []

# 3) Config
from dataclasses import field
from veil.config.core.frozen_dataclass import frozen_dataclass
from veil.config.entity_detectors import BaseEntityDetectorConfig
from veil.core.enums.entity_detector_type import EntityDetectorType
@frozen_dataclass
class MyDetectorConfig(BaseEntityDetectorConfig):
    my_threshold: float = field(default=0.5)
    @classmethod
    def get_type(cls):
        return EntityDetectorType.GLINER  # or a new type if added

# 4) Registry
from veil.entity_detectors.registry import EntityDetectorRegistry
EntityDetectorRegistry.register(EntityDetectorType.GLINER, MyDetector)

# 5) (polymorphic) Expose Config and detector in __init__.py
# - Add `from .entity_detectors import MyDetectorConfig` in `veil/config/__init__.py`
# - Add `from .my_detector import MyDetector` in `veil/entity_detectors/__init__.py` (if applicable)

And in YAML:

entity_detectors:
  - type: gliner  # or your new type
    my_threshold: 0.6
    priority:
      NAME: 0
    hierarchy_position: 0

2) Entity Resolvers (optional)

  • Practical interface: any class with resolve(self, doc: Document, spans: List[Span], entity_cache: Optional[Dict[str, Dict[int, Set[str]]]] = None) -> List[Span] works (see EmbeddingsEntityResolver in veil.entity_resolvers).

  • Config: inherit from BaseEntityResolverConfig and return get_type().

  • Input: the document and the combined list of detected spans so far (accumulated from all detectors); optionally an entity_cache by type.

  • Output: the same list of spans, potentially with assigned ids or deduplications/merges applied.

Simple example: assign sequential IDs by type

from typing import Dict, List, Optional, Set
from veil.core.document import Document
from veil.core.span import Span

class SimpleIdResolver:
    def __init__(self, config):
        self.config = config

    def resolve(self, doc: Document, spans: List[Span], entity_cache: Optional[Dict[str, Dict[int, Set[str]]]] = None) -> List[Span]:
        next_id_per_type: Dict[str, int] = {}
        out: List[Span] = []
        for s in spans:
            et = getattr(s.entity_type, "name", "")
            if getattr(s, "id", None) is None:
                nid = next_id_per_type.get(et, 1)
                next_id_per_type[et] = nid + 1
                out.append(Span(start=s.start, end=s.end, entity_type=s.entity_type, id=nid, replacement=s.replacement, confidence=s.confidence))
            else:
                out.append(s)
        return out

3) Overlap Resolution (OverlapResolver)

  • Class: veil.overlap_resolver.OverlapResolver in veil.pipeline and configuration in veil.config.

  • Inputs: per detector, its list of spans and a priority map by canonical type; also a global hierarchy_position per detector. iou_threshold parameter in OverlapResolverConfig.

  • Output: List[Span] selected, resolving conflicts between spans of the same type.

It is not usually necessary to implement a new one; adjust priority and hierarchy_position from the detectors’ configuration.

4) Masker

  • Class: veil.masker.Masker in veil.masker. It receives a Document and spans and returns a MaskResult.

  • Mask types: controlled by MaskerConfig.method (see veil.config).

You will generally not need to extend it to add detectors; its interface is stable.

5) Evaluation (Evaluator)

  • Orchestration class: veil.evaluator.Evaluator in veil.pipeline; configuration in veil.config.

  • Main input: document, mask_result (final spans), component_spans (spans per detector), a map of supported types per component, and metric_store.

  • Output: a dictionary with metric variants (exact, iou@THRESH, etc.) which is attached to MaskResult.evaluation.

For most extensions (new detectors), you do not need to touch the evaluation.