@raymierussell
You've done some great work handling text! You're right that and'ing and or'ing on display memory is slow.
I have a routine that renders the text offscreen, then copies the buffer to screen using the stack very fast.
Very rough and off the top of my memory:
LD SP, <screen memory>
LD HL, <offscreen buffer>
LD BC, <byte count>
LOOP:
LD DE,(HL)
PUSH DE
INC HL
INC HL
DEC BC
JP NZ, LOOP
So you only AND and OR the head and the tail of the string on screen, and only in the cases that's needed.
Another optimisation is to keep two pointers, one for each of the two characters you have to print next, each pointer pointing to the start of the character bits in the font table. So after fetching pixel bits, you only and and or between registers, and you only EX pointers, also swapping fast.