Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Npcap kernel buffer can be excessively fragmented on systems with many CPU cores #317

Closed
dmiller-nmap opened this issue Mar 14, 2020 · 2 comments

Comments

@dmiller-nmap
Copy link
Contributor

The Npcap driver allocates a buffer for each open instance (handle returned by PacketOpenAdapter()) that stores packets until the user program retrieves them (via PacketReceivePacket()). The buffer is split into N segments, where N is the number of logical CPUs on the system. When a packet is processed, it's put into the segment of the buffer corresponding to the number of the processor where the filter callback is running (KeGetCurrentProcessorNumber()). If the segment is full, the packet is dropped and not captured.

On systems with a high number of processors, the segments could be too small and suffer packet drops even if the buffer as a whole is largely empty. This is even more likely if Windows just doesn't schedule NDIS threads on some processors, leaving those buffer segments entirely empty. This is probably the case many times, since the OS wants to schedule threads on the same processors they have run on before in order to avoid cache misses.

Additionally, the Nmap currently handles numbers of processors greater than sizeof(KAFFINITY) (64 on 64-bit Windows, 32 on 32-bit) very oddly. In at least one place it maps any system processor ID over 63 to 63, which means that if Npcap is running on a system with 128 processors in 2 groups, any threads which run on processors 63 through 127 will all share the same segment of the buffer. If the proposed change below is not implemented, we will have to rewrite this code anyway for NDIS 6.20, which adds support for more than 64 processors.

We need to better document the current behavior (issue to be opened later), but even better would be to limit the number of segments the buffer is split into and somehow encourage threads to use them all. I haven't worked out all the details, but here are some thoughts:

  1. The limit should be user-configurable via Packet.DLL function call, which would communicate with the driver via a new IoControl. The default value would probably be close to the number of CPU cores on a typical lower-end system, though we could set an upper limit which would be the number of logical processors on the system.
  2. If it's necessary for threads on a single processor to always use the same buffer segment, we would calculate the buffer number as processor_number % number_of_segments. However, I don't think this is even necessary: the threads could round-robin on the buffer segments to ensure they are all used. The benefit of segments at this point would be to avoid waiting on a spinlock to access the buffer; that is to say that the chances of not having to wait on a spinlock are number_of_segments times better than with a single buffer. The current scheme probably has a better performance characteristic in this sense, at the cost of wasted kernel memory and higher likelihood of dropped packets.
@dmiller-nmap
Copy link
Contributor Author

New incoming fix for this issue abandons the idea of per-CPU segments or even any segments at all. Instead, NDIS packet processing callbacks will enqueue buffer write requests to be processed by a single worker thread. This will speed up processing of network data by separating the processing (BPF filter, etc) from the buffer access, reducing contention to a single interlocked enqueue operation.

@dmiller-nmap
Copy link
Contributor Author

For the sake of documentation: the worker thread concept was employed in Npcap 0.9991, but was unable to keep up with network traffic, leading to excessive memory usage as the queue of packets waiting to be written to the buffer grew and grew. Instead, from 0.9992 onward we simply use a queue of packet objects instead of a contiguous buffer. The queue grows up to the "buffer size" limit set by existing API functions, and calls to Read() remove objects from the queue and write them to the user buffer.

@fyodor fyodor transferred this issue from nmap/nmap May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant