FAQに戻る
Enterprise Applications

How to calculate the number of tokens used in a piece of text

Calculating text tokens involves breaking down input text into the fundamental units a language model processes, such as words, subwords, or characters. Tools like OpenAI's `tiktoken` library or Hugging Face's tokenizers handle this automatically for their respective models.

Token counts vary significantly between different models and tokenization methods. The same word might be one token or split into multiple subword tokens. Whitespace and punctuation are included in the count, and multilingual text often consumes more tokens per word. Accessing the specific tokenizer corresponding to your target model is essential for accurate calculation, as manual counting is impractical and error-prone.

To calculate tokens: first, choose the tokenizer matching your LLM (e.g., `cl100k_base` for GPT-4). Initialize this tokenizer using its designated library. Pass your input text to the tokenizer's encoding method. The output provides the tokenized list; its length (`len(tokens)`) gives the exact token count. This allows precise estimation of API usage costs and input/output limits.

関連する質問