#pair
You can assume that applies for most tokenizers used by LLM currently. Also it's 4 tokens for 3 words on average, so 0.75 word per token. It varies based on the total number of possible tokens, if you have only a few hundreds (letter and numbers for example) then that average would be a lot lower, many token needed for a single word and if you have every single word that exists then the average would be closer to 1. For ChatGpt their vocabulary size is 50k+. Also this number applies only to English, for languages such as Japanese or Chinese the token per word is way higher.
You can assume that applies for most tokenizers used by LLM currently. Also it's 4 tokens for 3 words on average, so 0.75 word per token. It varies based on the total number of possible tokens, if you have only a few hundreds (letter and numbers for example) then that average would be a lot lower, many token needed for a single word and if you have every single word that exists then the average would be closer to 1. For ChatGpt their vocabulary size is 50k+. Also this number applies only to English, for languages such as Japanese or Chinese the token per word is way higher.