개발 중 레이블 지정

GenAI 앱을 빌드할 때 MLflow 추적을 사용하면 추적에 직접 피드백 또는 기대치를 추가할 수 있습니다. 품질 문제를 기록하거나, 성공적인 예제를 표시하거나, 나중에 참조할 메모를 추가할 수 있습니다. 이를 통해 공식 평가를 설정하기 전에 개발 중에 품질을 추적할 수 있습니다.

필수 조건

애플리케이션이 MLflow 추적을 사용하여 계측됩니다.
애플리케이션을 실행하여 추적을 생성했습니다.

평가 레이블 추가

평가는 MLflow의 품질 평가와 개선을 위해 트레이스 및 스팬에 구조화된 피드백, 점수 또는 기준 데이터를 연결합니다.

Databricks 사용자 인터페이스

MLflow UI를 통해 추적에 주석(레이블)을 직접 추가할 수 있습니다.

비고

Databricks Notebook을 사용하는 경우 Notebook에서 인라인으로 렌더링되는 추적 UI에서 이러한 단계를 수행할 수도 있습니다.

사용자 의견

MLflow 실험 UI의 추적 탭으로 이동합니다.
개별 추적을 엽니다.
추적 UI 내에서 레이블을 지정할 특정 범위를 클릭합니다.
- 루트 범위를 선택하면 전체 추적에 피드백이 연결됩니다.
맨 오른쪽에 있는 평가 탭을 확장합니다.
양식을 작성하여 피드백을 추가합니다.
- 평가 유형
  - 피드백: 품질에 대한 주관적인 평가(등급, 의견)
  - 예상: 예상 출력 또는 값(생성해야 하는 항목)
- 평가 이름
  - 피드백에 대한 고유한 이름
- 데이터 형식
  - 숫자
  - 불리언 (Boolean)
  - 문자열
- 값
  - 평가
- 근거
  - 값에 대한 선택적 참고 사항
만들기를 클릭하여 레이블을 저장합니다.
추적 탭으로 돌아가면 레이블이 새 열로 표시됩니다.

MLflow SDK

MLflow의 SDK를 사용하여 추적에 레이블을 프로그래밍 방식으로 추가할 수 있습니다. 이는 애플리케이션 논리를 기반으로 하는 자동화된 레이블 지정 또는 추적의 일괄 처리에 유용합니다.

MLflow는 다음 두 가지 API를 제공합니다.

mlflow.log_feedback() - 앱의 실제 출력 또는 중간 단계를 평가하는 피드백을 기록합니다(예: "응답이 좋았습니까?", 등급, 의견).
mlflow.log_expectation() - 앱이 생성해야 하는 원하는 결과 또는 올바른 결과(근거 진리)를 정의하는 기대치를 기록합니다.

import mlflow
from mlflow.entities.assessment import (
    AssessmentSource,
    AssessmentSourceType,
    AssessmentError,
)


@mlflow.trace
def my_app(input: str) -> str:
    return input + "_output"


# Create a sample trace to demonstrate assessment logging
my_app(input="hello")

trace_id = mlflow.get_last_active_trace_id()

# Handle case where trace_id might be None
if trace_id is None:
    raise ValueError("No active trace found. Make sure to run a traced function first.")

print(f"Using trace_id: {trace_id}")


# =============================================================================
# LOG_FEEDBACK - Evaluating actual outputs and performance
# =============================================================================

# Example 1: Human rating (integer scale)
# Use case: Domain experts rating response quality on a 1-5 scale
mlflow.log_feedback(
    trace_id=trace_id,
    name="human_rating",
    value=4,  # int - rating scale feedback
    rationale="Human evaluator rating",
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="evaluator@company.com",
    ),
)

# Example 2: LLM judge score (float for precise scoring)
# Use case: Automated quality assessment using LLM-as-a-judge
mlflow.log_feedback(
    trace_id=trace_id,
    name="llm_judge_score",
    value=0.85,  # float - precise scoring from 0.0 to 1.0
    rationale="LLM judge evaluation",
    source=AssessmentSource(
        source_type=AssessmentSourceType.LLM_JUDGE,
        source_id="gpt-4o-mini",
    ),
    metadata={"temperature": "0.1", "model_version": "2024-01"},
)

# Example 3: Binary feedback (boolean for yes/no assessments)
# Use case: Simple thumbs up/down or correct/incorrect evaluations
mlflow.log_feedback(
    trace_id=trace_id,
    name="is_helpful",
    value=True,  # bool - binary assessment
    rationale="Boolean assessment of helpfulness",
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="reviewer@company.com",
    ),
)

# Example 4: Multi-category feedback (list for multiple classifications)
# Use case: Automated categorization or multi-label classification
mlflow.log_feedback(
    trace_id=trace_id,
    name="automated_categories",
    value=["helpful", "accurate", "concise"],  # list - multiple categories
    rationale="Automated categorization",
    source=AssessmentSource(
        source_type=AssessmentSourceType.CODE,
        source_id="classifier_v1.2",
    ),
)

# Example 5: Complex analysis with metadata (when you need structured context)
# Use case: Detailed automated analysis with multiple dimensions stored in metadata
mlflow.log_feedback(
    trace_id=trace_id,
    name="response_analysis_score",
    value=4.2,  # single score instead of dict - keeps value simple
    rationale="Analysis: 150 words, positive sentiment, includes examples, confidence 0.92",
    source=AssessmentSource(
        source_type=AssessmentSourceType.CODE,
        source_id="analyzer_v2.1",
    ),
    metadata={  # Use metadata for structured details
        "word_count": "150",
        "sentiment": "positive",
        "has_examples": "true",
        "confidence": "0.92",
    },
)

# Example 6: Error handling when evaluation fails
# Use case: Logging when automated evaluators fail due to API limits, timeouts, etc.
mlflow.log_feedback(
    trace_id=trace_id,
    name="failed_evaluation",
    source=AssessmentSource(
        source_type=AssessmentSourceType.LLM_JUDGE,
        source_id="gpt-4o",
    ),
    error=AssessmentError(  # Use error field when evaluation fails
        error_code="RATE_LIMIT_EXCEEDED",
        error_message="API rate limit exceeded during evaluation",
    ),
    metadata={"retry_count": "3", "error_timestamp": "2024-01-15T10:30:00Z"},
)

# =============================================================================
# LOG_EXPECTATION - Defining ground truth and desired outcomes
# =============================================================================

# Example 1: Simple text expectation (most common pattern)
# Use case: Defining the ideal response for factual questions
mlflow.log_expectation(
    trace_id=trace_id,
    name="expected_response",
    value="The capital of France is Paris.",  # Simple string - the "correct" answer
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="content_curator@example.com",
    ),
)

# Example 2: Complex structured expectation (advanced pattern)
# Use case: Defining detailed requirements for response structure and content
mlflow.log_expectation(
    trace_id=trace_id,
    name="expected_response_structure",
    value={  # Complex dict - detailed specification of ideal response
        "entities": {
            "people": ["Marie Curie", "Pierre Curie"],
            "locations": ["Paris", "France"],
            "dates": ["1867", "1934"],
        },
        "key_facts": [
            "First woman to win Nobel Prize",
            "Won Nobel Prizes in Physics and Chemistry",
            "Discovered radium and polonium",
        ],
        "response_requirements": {
            "tone": "informative",
            "length_range": {"min": 100, "max": 300},
            "include_examples": True,
            "citations_required": False,
        },
    },
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="content_strategist@example.com",
    ),
    metadata={
        "content_type": "biographical_summary",
        "target_audience": "general_public",
        "fact_check_date": "2024-01-15",
    },
)

# Example 3: Multiple acceptable answers (list pattern)
# Use case: When there are several valid ways to express the same fact
mlflow.log_expectation(
    trace_id=trace_id,
    name="expected_facts",
    value=[  # List of acceptable variations of the correct answer
        "Paris is the capital of France",
        "The capital city of France is Paris",
        "France's capital is Paris",
    ],
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN,
        source_id="qa_team@example.com",
    ),
)

요약에서 평가 보기

Databricks REST API

Databricks REST API를 사용하여 모든 환경의 추적에 대한 피드백 및 평가를 프로그래밍 방식으로 기록하는 평가를 만듭니다.

Databricks REST API 설명서를 참조하세요.

REST API 엔드포인트

POST https://<workspace-host>.databricks.com/api/3.0/mlflow/traces/{trace_id}/assessments

요청 예제:

curl -X POST \
  "https://<workspace-host>.databricks.com/api/3.0/mlflow/traces/<trace-id>/assessments" \
  -H "Authorization: Bearer <databricks-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "assessment": {
      "assessment_name": "string",
      "create_time": "2019-08-24T14:15:22Z",
      "expectation": {
        "serialized_value": {
          "serialization_format": "string",
          "value": "string"
        },
        "value": {}
      },
      "feedback": {
        "error": {
          "error_code": "string",
          "error_message": "string",
          "stack_trace": "string"
        },
        "value": {}
      },
      "last_update_time": "2019-08-24T14:15:22Z",
      "metadata": {
        "property1": "string",
        "property2": "string"
      },
      "overrides": "string",
      "rationale": "string",
      "source": {
        "source_id": "string",
        "source_type": "HUMAN"
      },
      "span_id": "string",
      "valid": true
    }
  }'

예제 응답:

{
  "assessment": {
    "assessment_id": "string",
    "assessment_name": "string",
    "create_time": "2019-08-24T14:15:22Z",
    "expectation": {
      "serialized_value": {
        "serialization_format": "string",
        "value": "string"
      },
      "value": {}
    },
    "feedback": {
      "error": {
        "error_code": "string",
        "error_message": "string",
        "stack_trace": "string"
      },
      "value": {}
    },
    "last_update_time": "2019-08-24T14:15:22Z",
    "metadata": {
      "property1": "string",
      "property2": "string"
    },
    "overrides": "string",
    "rationale": "string",
    "source": {
      "source_id": "string",
      "source_type": "HUMAN"
    },
    "span_id": "string",
    "trace_id": "string",
    "valid": true
  }
}

추가 리소스

도메인 전문가 피드백 수집 - 구조적 레이블 지정 세션 설정
평가 데이터 세트 빌드 - 레이블이 지정된 추적을 사용하여 테스트 데이터 세트 만들기
최종 사용자 피드백 수집 - 배포된 애플리케이션에서 피드백 캡처

참조 가이드

이 가이드에 언급된 개념 및 기능에 대한 자세한 설명서를 살펴보세요.

스키마 레이블 지정 - 구조적 피드백 수집에 대해 알아보기

피드백

이 페이지가 도움이 되었나요?

Last updated on 2026-06-24