Share via

WFP API Race Condition: BSOD in FwpsCompleteOperation0 due to undocumented TCP Endpoint Teardown vulnerability at ALE Layers

Alex Zilberkant 0 Reputation points
2026-04-01T09:21:33.32+00:00

Hello, I'm developing a WFP callout driver and have identified a severe, reproducible race condition involving FwpsPendOperation0 and FwpsCompleteOperation0 at the Application Layer Enforcement (ALE) layers (specifically FWPS_LAYER_ALE_RESOURCE_ASSIGNMENT_V4/V6 and FWPS_LAYER_ALE_AUTH_CONNECT_V4/V6).

The Issue: When our callout pends an operation to query user-mode for a verdict, third-party actors (such as EDRs or the OS itself) can forcefully terminate the underlying process or TCP connection.

In these scenarios, the TCP stack (via TcpCleanupEndpointWorkQueueRoutine) tears down the underlying endpoint object. However, WFP does not invalidate the completionContext handle returned by FwpsPendOperation0.

When our async worker thread eventually calls FwpsCompleteOperation0(completionContext, NULL), the WFP engine (fwpkclnt.sys / netio.sys) blindly dereferences the stale context pointing to the now-freed TCP endpoint, resulting in a BSOD (typically PAGE_FAULT_IN_NONPAGED_AREA or SYSTEM_THREAD_EXCEPTION_NOT_HANDLED).

The API Gap: There is no API provided by the WDK to validate the completionContext before calling FwpsCompleteOperation0. Furthermore, this race condition and the strict requirement for driver-implemented synchronization (e.g., using flowDeleteFn to manually track context validity) is not documented in the official WFP API reference.

I'm opening this ticket to request:

Official Confirmation: A statement from the WFP engineering team confirming whether FwpsCompleteOperation0 is fundamentally unsafe to call if the underlying endpoint has been torn down.

Official Mitigation: The officially supported architectural pattern to prevent this BSOD. (e.g., Are we strictly required to use FWPS_CALLOUT_FLOW_DELETE_NOTIFY_FN combined with Rundown Protection to block the completion call?)

Documentation Update: A formal Design Change Request (DCR) to update the WDK documentation for FwpsCompleteOperation0 to warn developers of this catastrophic race condition.

Windows development | Windows Driver Kit (WDK)

2 answers

Sort by: Most helpful
  1. Taki Ly (WICLOUD CORPORATION) 690 Reputation points Microsoft External Staff Moderator
    2026-04-02T04:36:01.14+00:00

    Hello @Alex Zilberkant ,

    This Q&A channel is a community support forum, where people help each other based on experience and available documentation. Because of that, I’m not able to provide an official statement from the product team or speak on behalf of the product team.

    If you need an official confirmation from Microsoft, or if you want to raise this directly with the product team, you can submit feedback through Windows Feedback Hub.

    Regarding the documentation update, if you would like Microsoft to consider adding the kind of warning or guidance you described for FwpsCompleteOperation0, you can use one of the following paths:

    1. Submit feedback directly from the official documentation page On the left side of the relevant Microsoft Learn page, there is usually a “Was this page helpful?” section with Yes/No options. You can click No and then click Suggest a fix? to submit documentation feedback and explain the warning or guidance you think should be added. User's image User's image User's image
    2. Use the Microsoft documentation GitHub repository You can also submit documentation-related feedback through the Windows Driver documentation GitHub repository, where Microsoft provides contribution and feedback paths for documentation updates. User's image

    Disclaimer: While this GitHub repository is non-Microsoft website, it is maintained by Microsoft and is one of the standard channels used for documentation feedback and contributions.

    If your goal is to push for a formal documentation change request around this race condition, those are the best channels to get that feedback in front of the documentation and product teams.

    I hope this information helps. If you found my response helpful, please follow this guide to give feedback.

    Thank you.


  2. Q&A Assist
    2026-04-01T09:22:05.4566667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The available documentation and articles do not provide an official statement that matches the level of detail requested about this specific race condition, nor do they describe an API to validate a completionContext before calling FwpsCompleteOperation0.

    What is documented:

    1. FwpsPendOperation0 / FwpsCompleteOperation0 behavior at ALE layers
    • FwpsPendOperation0 is used to postpone a classify operation and returns a completionContext that must later be passed to FwpsCompleteOperation0 to resume processing.
    • Only initial authorizations can be postponed; a call to FwpsPendOperation0 fails if the FWP_CONDITION_FLAG_IS_REAUTHORIZE flag is set.
    • After FwpsCompleteOperation0 executes, pended connections are reauthorized. TCP connections, if allowed, complete the handshake; non‑TCP connections create state entries.
    • Any pended packet data is flushed from memory when FwpsPendOperation0 completes, and the callout is responsible for buffering/reinjecting data if needed.

    Relevant excerpts:

    • "The reauthorization is triggered immediately following the FwpsCompleteOperation0 call, and it allows the callout driver to permit or to block the flow. Only an initial authorization can be postponed. A call to FwpsPendOperation0 will fail if FWP_CONDITION_FLAG_IS_REAUTHORIZE flag is set."
    • "Pended connections are authenticated again after the FwpsCompleteOperation0 function executes. TCP connections, if allowed, are created by completing the handshake operation, but non-TCP connections just create state entries. Any pended packet data is flushed from memory when the FwpsPendOperation0 function completes…"

    These remarks describe the logical flow and reauthorization behavior but do not document:

    • Any guarantee that the underlying endpoint or flow remains valid until FwpsCompleteOperation0 is called.
    • Any API to test or invalidate a completionContext after teardown.
    1. Known issues related to WFP and TCP/IP implementation

    The context includes a documented bug where a WFP callout using FwpsPendOperation0 can encounter corrupted metadata due to a TCP/IP implementation issue:

    • On Windows Vista / Windows Server 2008, when a WFP callout driver calls FwpsPendOperation0, the Remote Address field in the completion context can be zeroed out, and the system may crash depending on how the callout driver uses that field. This is explicitly attributed to an error in Tcpip.sys.

    This shows that:

    • There is precedent for WFP/TCPIP interaction bugs causing crashes when a callout uses completion‑time metadata that has become invalid or corrupted.
    • However, this article is specific to the Remote Address field being zeroed and does not describe endpoint teardown, TcpCleanupEndpointWorkQueueRoutine, or a general contract about completionContext validity.
    1. FwpsCompleteOperation0 contract

    The documented contract for FwpsCompleteOperation0 is minimal:

    • It takes a completionContext returned by FwpsPendOperation0 and an optional NET_BUFFER_LIST pointer.
    • It has no return value and no documented error reporting.
    • The remarks simply state that it is called when the callout is ready to resume packet processing, and that for ALE AUTH RECV/ACCEPT layers the callout must reinject the cloned packet and call FwpsCompleteOperation0.

    There is no documented:

    • Safety check API for completionContext.
    • Requirement or pattern involving FWPS_CALLOUT_FLOW_DELETE_NOTIFY_FN or rundown protection in conjunction with completionContext.
    • Statement that calling FwpsCompleteOperation0 after endpoint teardown is unsupported or unsafe.
    1. ALE reauthorization and pending semantics

    The ALE reauthorization documentation reiterates the high‑level behavior:

    • A callout may postpone a classify operation at ALE layers using FwpsPendOperation0 and later complete it via FwpsCompleteOperation0.
    • Reauthorization is triggered immediately after FwpsCompleteOperation0 and allows the callout to permit or block the flow.
    • Only initial authorization can be postponed; FwpsPendOperation0 fails on reauthorization.

    Again, there is no additional guidance on:

    • Synchronization with process/endpoint teardown.
    • How WFP behaves if the underlying flow or endpoint has already been destroyed when FwpsCompleteOperation0 is invoked.
    1. No documented official confirmation or mitigation pattern

    Within the provided material, there is:

    • No official confirmation that FwpsCompleteOperation0 is "fundamentally unsafe" if the underlying endpoint has been torn down.
    • No documented requirement to use FWPS_CALLOUT_FLOW_DELETE_NOTIFY_FN plus rundown protection to guard calls to FwpsCompleteOperation0.
    • No design note or warning in the FwpsCompleteOperation0 or FwpsPendOperation0 documentation about BSOD‑causing race conditions when the connection or endpoint is destroyed while an operation is pended.
    • No DCR or documentation change request text addressing this scenario.

    Because of this, the specific confirmations and architectural prescriptions requested (official statement on safety, mandated mitigation pattern, and a DCR for documentation) are not present in the available context.


    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.