naive_speculate.infer.inferencer.chunkwise¶
Define ChunkwiseDecodeInferencer, implementing Inferencer with chunk-wise decoding strategy.
ChunkwiseDecodeInferencer
¶
Bases: BasicInferencer
ChunkwiseDecodeInferencer implements chunk-wise decoding to reduce device synchronization overhead.
ChunkwiseDecodeInferencer only checks eos token after each decode_chunk_size iterations,
in order to reduce the device synchronization overhead caused by frequent eos token checking.
Refers to base class BaseInferencer for more details.
Attributes:
| Name | Type | Description |
|---|---|---|
decode_chunk_size |
int
|
Number of tokens to decode before checking for EOS token. |
decode(query_token_ids, kv_cache, max_new_tokens, sample_strategy)
¶
Process query_token_ids and generate new tokens, auto-regressively repeat.
Check for EOS token after each self.decode_chunk_size generation iterations.
Refers to the interface Inferencer.decode for more details.