쿼리 추론 모델

Important

베타에서 새로운 Unity AI 게이트웨이 환경을 사용할 수 있습니다. 새로운 Unity AI 게이트웨이는 향상된 기능을 사용하여 LLM 엔드포인트 및 코딩 에이전트를 관리하기 위한 엔터프라이즈 제어 평면입니다. Unity AI 게이트웨이를 사용한 AI 거버넌스를 참조하세요.

이 문서에서는 추론 작업에 최적화되고 Unity AI Gateway에서 제공하는 기본 모델에 대한 쿼리 요청을 작성하는 방법을 알아봅니다.

Tip

지니 코드 (에이전트 모드)는 이 작업을 수행할 수 있습니다. 다음 예제 프롬프트를 사용해 보세요.

Query the databricks-claude-sonnet-4-5 model using the OpenAI client with extended thinking enabled (budget_tokens set to 10240). Send a reasoning question and print both the thinking summary and the final answer.

Databricks Foundation Model API는 추론 모델을 포함하여 모든 Foundation 모델과 상호 작용하는 통합 API를 제공합니다. 추론은 기초 모델에서 복잡한 작업을 처리할 수 있는 향상된 기능을 제공합니다. 또한 일부 모델은 최종 답변을 제공하기 전에 단계별 사고 프로세스를 공개하여 투명성을 제공합니다.

추론 모델 유형

추론 전용 및 하이브리드의 두 가지 모델 유형이 있습니다. 다음 표에서는 다양한 모델에서 다양한 방법을 사용하여 추론을 제어하는 방법을 설명합니다.

추론 모델 유형	세부 정보	모델 예제	매개 변수
하이브리드 추론	필요할 때 빠르고 즉각적인 회신과 심층적인 추론을 모두 지원합니다.	Claude 모델(예: `databricks-claude-sonnet-4-6`, `databricks-claude-sonnet-4-5`, `databricks-claude-sonnet-4`, `databricks-claude-opus-4-8`, `databricks-claude-opus-4-7`, `databricks-claude-opus-4-6`, `databricks-claude-opus-4-5` 및 `databricks-claude-opus-4-1`).	하이브리드 추론을 사용하려면 다음 매개 변수를 포함합니다. `thinking` `budget_tokens`: 모델이 내부 사고에 사용할 수 있는 토큰 수를 제어합니다. 예산이 높을수록 복잡한 작업의 품질이 향상될 수 있지만 32K를 초과하는 사용량은 다를 수 있습니다. `budget_tokens`는 `max_tokens`보다 작아야 합니다.
추론만	이러한 모델은 응답에서 항상 내부 추론을 사용합니다.	`databricks-gpt-oss-120b` 및 `databricks-gpt-oss-20b`와 같은 GPT OSS 모델.	요청에 다음 매개 변수를 사용합니다. `reasoning_effort`: , `"low"` (기본값) 또는 `"medium"`.의 `"high"`값을 허용합니다. 추론 작업이 많을수록 더 사려 깊고 정확한 응답이 발생할 수 있지만 대기 시간 및 토큰 사용량이 증가할 수 있습니다. 이 매개 변수는 `databricks-gpt-oss-120b` 및 `databricks-gpt-oss-20b`을 포함한 제한된 모델 집합에서만 허용됩니다.

쿼리 예제

메모

다음 예제는 Unity AI 게이트웨이 및 모델 서비스를 기반으로 합니다. 모델 서비스 대신 엔드포인트를 제공하는 모델을 사용하는 경우 모델 서비스 이름을 엔드포인트 이름으로 바꿉니다. 사용 가능한 파운데이션 모델과 해당 모델 서비스 및 엔드포인트 이름 목록은 Foundation Model API에서 사용할 수 있는 Databricks 호스팅 파운데이션 모델을 참조하세요.

모든 추론 모델은 채팅 완료 엔드포인트를 통해 액세스됩니다.

Claude 모델 예제

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.environ.get('YOUR_DATABRICKS_TOKEN'),
  base_url=os.environ.get('YOUR_DATABRICKS_BASE_URL')
  )

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

msg = response.choices[0].message
reasoning = msg.content[0]["summary"][0]["text"]
answer = msg.content[1]["text"]

print("Reasoning:", reasoning)
print("Answer:", answer)

GPT-5.1

reasoning_effort GPT-5.1의 매개변수는 none으로 기본 설정되어 있지만 요청에서 재정의할 수 있습니다. 추론 작업이 많을수록 더 신중하고 정확한 응답이 발생할 수 있지만 대기 시간 및 토큰 사용량이 증가할 수 있습니다.

curl -X POST "https://<workspace_host>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gpt-5-1",
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 4096,
    "reasoning_effort": "none"
  }'

GPT OSS 모델 예제

매개 변수는 reasoning_effort ( "low" 기본값) 또는 "medium" 값을 허용합니다"high". 추론 작업이 많을수록 더 신중하고 정확한 응답이 발생할 수 있지만 대기 시간 및 토큰 사용량이 증가할 수 있습니다.

curl -X POST "https://<workspace_host>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gpt-oss-120b",
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 4096,
    "reasoning_effort": "high"
  }'

Gemini 모델 예제

이 예제에서는 system.ai.gemini-3-1-pro를 사용합니다. reasoning_effort 매개 변수는 기본적으로 설정 "low" 되지만 다음 예제와 같이 요청에서 재정의할 수 있습니다.

curl -X POST "https://<workspace_host>/ai-gateway/mlflow/v1/chat/completions" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "system.ai.gemini-3-1-pro",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 2000,
    "stream": true,
    "reasoning_effort": "high"
  }'

API 응답에는 사고 및 텍스트 콘텐츠 블록이 모두 포함됩니다.

ChatCompletionMessage(
    role="assistant",
    content=[
        {
            "type": "reasoning",
            "summary": [
                {
                    "type": "summary_text",
                    "text": ("The question is asking about the scientific explanation for why the sky appears blue... "),
                    "signature": ("EqoBCkgIARABGAIiQAhCWRmlaLuPiHaF357JzGmloqLqkeBm3cHG9NFTxKMyC/9bBdBInUsE3IZk6RxWge...")
                }
            ]
        },
        {
            "type": "text",
            "text": (
                "# Why the Sky Is Blue\n\n"
                "The sky appears blue because of a phenomenon called Rayleigh scattering. Here's how it works..."
            )
        }
    ],
    refusal=None,
    annotations=None,
    audio=None,
    function_call=None,
    tool_calls=None
)

여러 차례에 걸친 추론 관리

이 섹션은 모델과 관련이 있습니다 databricks-claude-sonnet-4-5 .

다중 턴 대화에서는 마지막 도우미 턴 또는 도구 사용 세션과 연결된 추론 블록만 모델에 표시되고 입력 토큰으로 계산됩니다.

추론 토큰을 모델에 다시 전달하지 않으려는 경우(예: 이전 단계에서 추론할 필요가 없는 경우) 추론 블록을 완전히 생략할 수 있습니다. 다음은 그 예입니다.

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

그러나 이전 추론 프로세스에 대해 추론할 모델이 필요한 경우(예: 중간 추론을 표시하는 환경을 빌드하는 경우) 이전 턴의 추론 블록을 포함하여 수정되지 않은 전체 도우미 메시지를 포함해야 합니다. 전체 도우미 메시지를 사용하여 스레드를 계속 진행하는 방법은 다음과 같습니다.

assistant_message = response.choices[0].message

response = client.chat.completions.create(
    model="system.ai.claude-sonnet-4-5",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"},
        assistant_message,
        {"role": "user", "content": "Can you simplify the previous answer?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

응답 API 열기

Open Response API를 사용하면 추론이 응답reasoning의 항목으로 output 반환됩니다. 나중 턴에서 모델이 자신의 이전 생각을 바탕으로 추론하도록 하려면, 해당 reasoning 항목을 encrypted_content 필드는 변경하지 않은 채 다음 요청의 input에 포함하세요.

reasoning 응답 출력에 반환된 항목의 모양은 다음과 같습니다.

{
  "type": "reasoning",
  "id": "rs_abc123",
  "content": [{ "type": "reasoning_text", "text": "Let me work through the question..." }],
  "encrypted_content": "<opaque-provider-signature>"
}

대화를 계속하려면 input 항목은 원문 그대로 유지한 채 이전 턴의 출력을 다시 reasoning로 보내세요.

{
  "model": "databricks-claude-sonnet-4-5",
  "input": [
    { "role": "user", "content": "Why is the sky blue?" },
    {
      "type": "reasoning",
      "id": "rs_abc123",
      "content": [{ "type": "reasoning_text", "text": "Let me work through the question..." }],
      "encrypted_content": "<opaque-provider-signature>"
    },
    { "role": "assistant", "content": "The sky is blue because of Rayleigh scattering..." },
    { "role": "user", "content": "Can you explain it for a five-year-old?" }
  ]
}

이 값은 encrypted_content 공급자별 추론 상태를 전달합니다. 삭제되거나 수정된 경우 모델은 이전의 사고에 대해 추론할 수 없습니다. 이는 Anthropic Claude 및 Google Gemini 모델에 적용됩니다.

추론 모델은 어떻게 작동하나요?

추론 모델은 표준 입력 및 출력 토큰 외에도 특별한 추론 토큰을 도입합니다. 이러한 토큰을 통해 모델은 프롬프트를 통해 "생각"하여 분석하고 다양한 응답 방법을 고려할 수 있습니다. 이 내부 추론 프로세스 후에 모델은 최종 답변을 표시되는 출력 토큰으로 생성합니다. 일부 모델(예: databricks-claude-sonnet-4-5이러한 추론 토큰)을 사용자에게 표시하는 반면 OpenAI o 시리즈와 같은 일부 모델은 이를 삭제하고 최종 출력에 노출하지 않습니다.

추가 리소스

피드백

이 페이지가 도움이 되었나요?

Last updated on 2026-06-29