There appears to be dedicated silicon for e.g. ADD AH, BL, see uops.info showing it having 1 uop across multiple microarchitectures (e.g. 1*p0156 being notation that it takes one uop on any port between 0/1/5/6, i.e. theoretical throughput of 4 instrs/cycle; I think the displayed 0.4 is just an artifact of it only testing 3 different destination registers despite there being a dependency on it). The newer Alder Lake actually has less throughput, but still takes only one uop.
There appears to be dedicated silicon for e.g.
ADD AH, BL, see uops.info showing it having 1 uop across multiple microarchitectures (e.g.1*p0156being notation that it takes one uop on any port between 0/1/5/6, i.e. theoretical throughput of 4 instrs/cycle; I think the displayed 0.4 is just an artifact of it only testing 3 different destination registers despite there being a dependency on it). The newer Alder Lake actually has less throughput, but still takes only one uop.