• j4k3@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    ·
    11 months ago

    Introducing EEVDF

    The “Earliest Eligible Virtual Deadline First” (EEVDF) scheduling algorithm is not new; it was described in this 1995 paper by Ion Stoica and Hussein Abdel-Wahab. Its name suggests something similar to the Earliest Deadline First algorithm used by the kernel’s deadline scheduler but, unlike that scheduler, EEVDF is not a realtime scheduler, so it works in different ways. Understanding EEVDF requires getting a handle on a few (relatively) simple concepts. …

    … For each process, EEVDF calculates the difference between the time that process should have gotten and how much it actually got; that difference is called “lag”. A process with a positive lag value has not received its fair share and should be scheduled sooner than one with a negative lag value.

    In fact, a process is deemed to be “eligible” if — and only if — its calculated lag is greater than or equal to zero; any process with a negative lag will not be eligible to run. For any ineligible process, there will be a time in the future where the time it is entitled to catches up to the time it has actually gotten and it will become eligible again; that time is deemed the “eligible time”.

    The calculation of lag is, thus, a key part of the EEVDF scheduler, and much of the patch set is dedicated to finding this value correctly. Even in the absence of the full EEVDF algorithm, a process’s lag can be used to place it fairly in the run queue; processes with higher lag should be run first in an attempt to even out lag values across the system.

    End of my little TL;DR


    So basically ‘Even more + CFS’.

    I want the opposite in a scheduler, I want bottleneck optimisation for AI and CAD.

    I also wish there was an easy way to discover if my 12th gen i7 has AVX512 instructions fused off or just missing from microcode (early 12th gen had the instruction, but it was unofficial and discovered by motherboard makers. Intel later fused it off entirely according to some articles. Allegedly, all it took was running the microcode from the enterprise P-core which is identical on the consumer stuff.) I would love to have the option to set a CPUset affinity and isolation for llama.cpp (that automatically detects if the AVX ISA is available), and compare the inference speed with my present strategy.

    My understanding is that the primary bottleneck in tensor table math is the L2 to L1 cache bus, but I’m basically just parroting that info on the edge of my mental understanding.