Cost wise it does not seem very effective. .5 token / sec (the optimized one) is 3600 tokens an hour, which costs about 200-300 watts for an active 3090+system. Running 3600 tokens on open router @.4$ for llama 3.1 (3.3 costs less), is about $0,00144. That money buys you about 2-3 watts (in the Netherlands).
Great achievement for privacy inference nonetheless.
Something to consider is that input tokens have a cost too. They are typically processed much faster than output tokens. If you have long conversations then input tokens will end up being a significant part of the cost.
But why not cross that bridge then. By that time you might have much more optimized local infrastructure. Although I do see that someone suffering through the local slowness now is what drives the development of these local options.
Why is this so damn important? Isn't it more important to end up with the best result?
I (in Norway) use a homelab with Ollama to generate a report every morning. It's slow, but it runs between 5-6 am, energy prices are at a low, and it doesn't matter if it takes 5 or 50 minutes.
Great achievement for privacy inference nonetheless.