Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.
Is that still on CPU or did you get it working on GPU?
I have seen a few people recommending GLM 4.5 at lower quants primarily for more intricate writing, might be worth the lower speed and context size for shorter texts.
Is that still on CPU or did you get it working on GPU?
I have seen a few people recommending GLM 4.5 at lower quants primarily for more intricate writing, might be worth the lower speed and context size for shorter texts.
Thanks for testing!
That was GPU, CPU was 5.
I’ve also tested the image processing more, a 512x512 takes about a minute, 1400x900 takes about 7-10, and image to image takes about 10 minutes
Most of the time is spent on the encoder decoder layers for image to image, and decoding is what shapes the slowest with image size