Chunkers
Overview
Chunkers are used to split arbitrarily long text into chunks of certain token length. Each chunker has a tokenizer, a max token count, and a list of default separators used to split up text into TextArtifacts. Different types of chunkers provide lists of separators for specific text shapes:
- TextChunker: works on most texts.
- PdfChunker: works on text from PDF docs.
- MarkdownChunker works on markdown text.
Here is how to use a chunker:
from griptape.chunkers import TextChunker
from griptape.tokenizers import OpenAiTokenizer
TextChunker(
# set an optional custom tokenizer
tokenizer=OpenAiTokenizer(model="gpt-4o"),
# optionally modify default number of tokens
max_tokens=100,
).chunk("long text")
The most common use of a Chunker is to split up a long text into smaller chunks for inserting into a Vector Database when doing Retrieval Augmented Generation (RAG).
See RagEngine for more information on how to use Chunkers in RAG pipelines.