Skip to content

naive_speculate.infer.inferencer.chunkwise

Define ChunkwiseDecodeInferencer, implementing Inferencer with chunk-wise decoding strategy.

ChunkwiseDecodeInferencer

Bases: BasicInferencer

ChunkwiseDecodeInferencer implements chunk-wise decoding to reduce device synchronization overhead.

ChunkwiseDecodeInferencer only checks eos token after each decode_chunk_size iterations, in order to reduce the device synchronization overhead caused by frequent eos token checking.

Refers to base class BaseInferencer for more details.

Attributes:

Name Type Description
decode_chunk_size int

Number of tokens to decode before checking for EOS token.

decode(query_token_ids, kv_cache, max_new_tokens, sample_strategy)

Process query_token_ids and generate new tokens, auto-regressively repeat.

Check for EOS token after each self.decode_chunk_size generation iterations.

Refers to the interface Inferencer.decode for more details.