Really enjoyed @tonybaloney talk at #pyconau around how to make your #LLM models faster in production.
Key takeaways are that smaller models are faster, and you need to make your models smaller through quantisation, distillation or semantic caching.
Really tractable, immediately implementable 馃憦馃憦
More of this, pls