naive_speculate.utils.tokenizer¶

`Tokenizer` ¶

Wraps the tokenizer of drafter and verifier.

It is assumed that drafter and verifier share the same tokenizer.

Attributes:

Name	Type	Description
`tokenizer`	`PreTrainedTokenizerFast`	The huggingface tokenizer instance.

Construct prompt text from chat messages using the tokenizer's chat template.

Parameters:

Name	Type	Description	Default
`messages`	`list[dict[str, str]]`	List of chat messages, where each message is a dict with keys "role" and "content".	required
`enable_thinking`	`bool`	If True, append and to the generation prompt.	`True`

Returns:

Name	Type	Description
`str`	`str`	Constructed prompt text.

Detokenize a batch of token ID sequences back into strings.

Parameters:

Name	Type	Description	Default
`token_ids`	`Tensor`	Batch of token id sequences. Shape: `[batch_size, seq_len]`.	required
`skip_special_tokens`	`bool`	If True, special tokens will be removed from the output strings.	`False`

Returns:

Type	Description
`list[str]`	list[str]: Detokenized strings. Length: `[batch_size]`.

tokenize(input_texts: list[str], *, return_tensors: Literal[True] = True) -> tuple[Tensor, Tensor]

tokenize(input_texts: list[str], *, return_tensors: Literal[False]) -> BatchEncoding

tokenize(input_texts: list[str], *, return_tensors: bool) -> BatchEncoding | tuple[Tensor, Tensor]

Tokenize a batch of input sequences into token id sequences.

Returns either a BatchEncoding object or a tuple of tensors (input_ids and attention_mask) based on return_tensors flag.

Parameters:

Name	Type	Description	Default
`input_texts`	`list[str]`	List of input strings to tokenize.	required
`return_tensors`	`bool`	Return a tuple of `Tensor` or `BatchEncoding`.	`True`

Returns:

Type	Description
`BatchEncoding \| tuple[Tensor, Tensor]`	BatchEncoding \| tuple[Tensor, Tensor]: Tokenized output.