Etichetta durante lo sviluppo

Gli sviluppatori che creano applicazioni GenAI devono tenere traccia delle osservazioni relative alla qualità degli output dell'applicazione. MLflow Tracing consente di aggiungere feedback o aspettative direttamente alle tracce durante lo sviluppo, offrendo un modo rapido per registrare problemi di qualità, contrassegnare esempi riusciti o aggiungere note per riferimento futuro.

Prerequisiti

L'applicazione viene instrumentata con MLflow Tracing
Sono state generate tracce eseguendo l'applicazione

Aggiungere etichette di valutazione

Le valutazioni associano feedback, punteggi o verità di base strutturati a tracce e intervalli per la valutazione e il miglioramento della qualità in MLflow.

Interfaccia utente di Databricks

È possibile aggiungere annotazioni (etichette) direttamente alle tracce tramite l'interfaccia utente di MLflow.

Annotazioni

Se si usa un notebook di Databricks, è anche possibile eseguire questi passaggi dall'interfaccia utente di traccia che esegue il rendering inline nel notebook.

feedback umano

Passare alla scheda Tracce nell'interfaccia utente dell'esperimento MLflow.
Aprire una singola traccia.
Nell'interfaccia utente di traccia fare clic sull'intervallo specifico da etichettare.
- Se si seleziona l'intervallo radice, il feedback viene associato all'intera traccia.
Espandere la scheda Valutazioni all'estrema destra.
Compilare il modulo per aggiungere il feedback.
- Tipo di valutazione
  - Feedback: valutazione soggettiva della qualità (valutazioni, commenti)
  - Previsione: output o valore previsto (cosa dovrebbe essere stato prodotto)
- Nome valutazione
  - Un nome univoco per ciò che riguarda il feedback
- Tipo di dati
  - Numero
  - Booleano
  - Stringa
- Valore
  - Valutazione
- Razionale
  - Note facoltative sul valore
Fare clic su Crea per salvare l'etichetta.
Quando si torna alla scheda Tracce, l'etichetta viene visualizzata come nuova colonna.

MLflow SDK

È possibile aggiungere etichette a livello di codice alle tracce usando L'SDK di MLflow. Ciò è utile per l'etichettatura automatizzata in base alla logica dell'applicazione o per l'elaborazione batch di tracce.

MLflow offre due API:

mlflow.log_feedback() - Registra commenti e suggerimenti che valutano gli output effettivi o i passaggi intermedi dell'app (ad esempio, "La risposta è stata buona?", valutazioni, commenti).
mlflow.log_expectation() - Registra le aspettative che definiscono il risultato desiderato o corretto (verità di base) che l'app deve avere prodotto.

import mlflow
from mlflow.entities.assessment import (
    AssessmentSource,
    AssessmentSourceType,
    AssessmentError,
)


@mlflow.trace
def my_app(input: str) -> str:
    return input + "_output"


# Create a sample trace to demonstrate assessment logging
my_app(input="hello")

trace_id = mlflow.get_last_active_trace_id()

# Handle case where trace_id might be None
if trace_id is None:
    raise ValueError("No active trace found. Make sure to run a traced function first.")

print(f"Using trace_id: {trace_id}")


# =============================================================================
# LOG_FEEDBACK - Evaluating actual outputs and performance
# =============================================================================

# Example 1: Human rating (integer scale)
# Use case: Domain experts rating response quality on a 1-5 scale
mlflow.log_feedback(
    trace_id=trace_id,
    name="human_rating",
    value=4,  # int - rating scale feedback
    rationale="Human evaluator rating",
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="evaluator@company.com",
    ),
)

# Example 2: LLM judge score (float for precise scoring)
# Use case: Automated quality assessment using LLM-as-a-judge
mlflow.log_feedback(
    trace_id=trace_id,
    name="llm_judge_score",
    value=0.85,  # float - precise scoring from 0.0 to 1.0
    rationale="LLM judge evaluation",
    source=AssessmentSource(
        source_type=AssessmentSourceType.LLM_JUDGE,
        source_id="gpt-4o-mini",
    ),
    metadata={"temperature": "0.1", "model_version": "2024-01"},
)

# Example 3: Binary feedback (boolean for yes/no assessments)
# Use case: Simple thumbs up/down or correct/incorrect evaluations
mlflow.log_feedback(
    trace_id=trace_id,
    name="is_helpful",
    value=True,  # bool - binary assessment
    rationale="Boolean assessment of helpfulness",
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="reviewer@company.com",
    ),
)

# Example 4: Multi-category feedback (list for multiple classifications)
# Use case: Automated categorization or multi-label classification
mlflow.log_feedback(
    trace_id=trace_id,
    name="automated_categories",
    value=["helpful", "accurate", "concise"],  # list - multiple categories
    rationale="Automated categorization",
    source=AssessmentSource(
        source_type=AssessmentSourceType.CODE,
        source_id="classifier_v1.2",
    ),
)

# Example 5: Complex analysis with metadata (when you need structured context)
# Use case: Detailed automated analysis with multiple dimensions stored in metadata
mlflow.log_feedback(
    trace_id=trace_id,
    name="response_analysis_score",
    value=4.2,  # single score instead of dict - keeps value simple
    rationale="Analysis: 150 words, positive sentiment, includes examples, confidence 0.92",
    source=AssessmentSource(
        source_type=AssessmentSourceType.CODE,
        source_id="analyzer_v2.1",
    ),
    metadata={  # Use metadata for structured details
        "word_count": "150",
        "sentiment": "positive",
        "has_examples": "true",
        "confidence": "0.92",
    },
)

# Example 6: Error handling when evaluation fails
# Use case: Logging when automated evaluators fail due to API limits, timeouts, etc.
mlflow.log_feedback(
    trace_id=trace_id,
    name="failed_evaluation",
    source=AssessmentSource(
        source_type=AssessmentSourceType.LLM_JUDGE,
        source_id="gpt-4o",
    ),
    error=AssessmentError(  # Use error field when evaluation fails
        error_code="RATE_LIMIT_EXCEEDED",
        error_message="API rate limit exceeded during evaluation",
    ),
    metadata={"retry_count": "3", "error_timestamp": "2024-01-15T10:30:00Z"},
)

# =============================================================================
# LOG_EXPECTATION - Defining ground truth and desired outcomes
# =============================================================================

# Example 1: Simple text expectation (most common pattern)
# Use case: Defining the ideal response for factual questions
mlflow.log_expectation(
    trace_id=trace_id,
    name="expected_response",
    value="The capital of France is Paris.",  # Simple string - the "correct" answer
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="content_curator@example.com",
    ),
)

# Example 2: Complex structured expectation (advanced pattern)
# Use case: Defining detailed requirements for response structure and content
mlflow.log_expectation(
    trace_id=trace_id,
    name="expected_response_structure",
    value={  # Complex dict - detailed specification of ideal response
        "entities": {
            "people": ["Marie Curie", "Pierre Curie"],
            "locations": ["Paris", "France"],
            "dates": ["1867", "1934"],
        },
        "key_facts": [
            "First woman to win Nobel Prize",
            "Won Nobel Prizes in Physics and Chemistry",
            "Discovered radium and polonium",
        ],
        "response_requirements": {
            "tone": "informative",
            "length_range": {"min": 100, "max": 300},
            "include_examples": True,
            "citations_required": False,
        },
    },
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="content_strategist@example.com",
    ),
    metadata={
        "content_type": "biographical_summary",
        "target_audience": "general_public",
        "fact_check_date": "2024-01-15",
    },
)

# Example 3: Multiple acceptable answers (list pattern)
# Use case: When there are several valid ways to express the same fact
mlflow.log_expectation(
    trace_id=trace_id,
    name="expected_facts",
    value=[  # List of acceptable variations of the correct answer
        "Paris is the capital of France",
        "The capital city of France is Paris",
        "France's capital is Paris",
    ],
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="qa_team@example.com",
    ),
)

Visualizzare le valutazioni nel riepilogo

Databricks REST API

Creare valutazioni usando l'API REST di Databricks per registrare feedback e valutazioni in modo programmatico sulle tracce da qualsiasi ambiente.

Vedere la documentazione dell'API REST di Databricks.

Endpoint dell'API REST

POST https://<workspace-host>.databricks.com/api/3.0/mlflow/traces/{trace_id}/assessments

Richiesta di esempio:

curl -X POST \
  "https://<workspace-host>.databricks.com/api/3.0/mlflow/traces/<trace-id>/assessments" \
  -H "Authorization: Bearer <databricks-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "assessment": {
      "assessment_name": "string",
      "create_time": "2019-08-24T14:15:22Z",
      "expectation": {
        "serialized_value": {
          "serialization_format": "string",
          "value": "string"
        },
        "value": {}
      },
      "feedback": {
        "error": {
          "error_code": "string",
          "error_message": "string",
          "stack_trace": "string"
        },
        "value": {}
      },
      "last_update_time": "2019-08-24T14:15:22Z",
      "metadata": {
        "property1": "string",
        "property2": "string"
      },
      "overrides": "string",
      "rationale": "string",
      "source": {
        "source_id": "string",
        "source_type": "HUMAN"
      },
      "span_id": "string",
      "valid": true
    }
  }'

Risposta di esempio:

{
  "assessment": {
    "assessment_id": "string",
    "assessment_name": "string",
    "create_time": "2019-08-24T14:15:22Z",
    "expectation": {
      "serialized_value": {
        "serialization_format": "string",
        "value": "string"
      },
      "value": {}
    },
    "feedback": {
      "error": {
        "error_code": "string",
        "error_message": "string",
        "stack_trace": "string"
      },
      "value": {}
    },
    "last_update_time": "2019-08-24T14:15:22Z",
    "metadata": {
      "property1": "string",
      "property2": "string"
    },
    "overrides": "string",
    "rationale": "string",
    "source": {
      "source_id": "string",
      "source_type": "HUMAN"
    },
    "span_id": "string",
    "trace_id": "string",
    "valid": true
  }
}

Passaggi successivi

Continuare il percorso con queste azioni e esercitazioni consigliate.

Raccogliere commenti e suggerimenti degli esperti di dominio - Configurare sessioni di etichettatura strutturate
Creare set di dati di valutazione : usare le tracce con etichetta per creare set di dati di test
Raccogliere commenti e suggerimenti degli utenti finali - Acquisire feedback dalle applicazioni distribuite

Guide di riferimento

Esplorare la documentazione dettagliata per i concetti e le funzionalità menzionati in questa guida.

Schemi di etichettatura - Informazioni sulla raccolta di feedback strutturati

Commenti e suggerimenti

Questa pagina è stata utile?

Last updated on 2026-06-01