This experiment involves each LLM responding to 128 or 256 prompts. AI detection is generally focused on determining the writer of a single document, not comparing two analagous sets of 128 documents and determining if the same person/tool wrote both. Totally different problem.
It might be because detecting if output is AI generated and mapping output which is known to be from an LLM to a specific LLM or class of LLMs are different problems.
Though, if this is so clearly seen, how come AI detectors perform so badly?