Conversation
ref STREAM-882 Introduce passthrough mode, so that a broker can be used to spawn tasks from any topic with any type of message format. This will make it easier to migrate existing consumers to be tasks instead, without changing data layout in prod. For more information refer to the ticket above.
| /// Maps every application to its worker endpoint, both represented as strings. | ||
| pub worker_map: BTreeMap<String, String>, | ||
|
|
||
| /// Enable passthrough mode for consuming raw bytes from legacy topics. |
There was a problem hiding this comment.
| /// Enable passthrough mode for consuming raw bytes from legacy topics. | |
| /// Enable passthrough mode for consuming raw bytes from raw topics. |
| #[derive(Serialize)] | ||
| struct Params<'a> { | ||
| args: (&'a [u8],), | ||
| kwargs: HashMap<(), ()>, | ||
| } |
There was a problem hiding this comment.
For my own understanding, why define this struct here vs outside of the function?
| /// In passthrough mode, raw Kafka message bytes are wrapped into TaskActivation. | ||
| pub passthrough_mode: bool, |
There was a problem hiding this comment.
Maybe a nit, but the naming feel backwards. Passthrough means letting things through:
- When consuming a tasks topic we expect activations and do not do anything to them till they are stored. This seems the passthrough mode to me.
- When consuming raw topics we need to pre-process them, into Activations before passing them to the rest of the pipeline.
Am I getting this wrong ?
| /// Processing deadline duration in seconds for passthrough activations. | ||
| pub passthrough_processing_deadline_duration: u64, |
There was a problem hiding this comment.
What would happen if we did not assign them a deadline? Today there is no explicit deadline apart for max poll time in kafka.
I am not saying we should set one, I'd much rather having a deadline that is explicit and we control rather than letting the infrastructure kill tasks in unexpected way.
We would have to verify if we have tasks that routinely take long period of times in multi processing or in any other setup that does not cause issues with the max poll time.
| kwargs: HashMap::new(), | ||
| }; | ||
|
|
||
| rmp_serde::to_vec_named(¶ms).map_err(|e| anyhow!("Failed to encode msgpack: {}", e)) |
There was a problem hiding this comment.
What is a real world scenario where this error would happen? Would we consider the message invalid and DLQ? Please add a TODO to make this clear if that is the case.
| ) -> impl Fn(Arc<OwnedMessage>) -> Result<InflightActivation, Error> { | ||
| move |msg: Arc<OwnedMessage>| { | ||
| let Some(payload) = msg.payload() else { | ||
| return Err(anyhow!("Message has no payload")); |
There was a problem hiding this comment.
I think we need to be careful before erroring here
https://github.com/getsentry/arroyo/blob/main/arroyo/backends/kafka/consumer.py#L521
Arroyo manages this turning the Null message into an empty binary string.
I doubt anybody process those messages but I cannot verify either.
Would this be a DLQ scenario as well? If not, what is the value in crashing here ?
| let id = Uuid::new_v4().to_string(); | ||
| let parameters_bytes = encode_passthrough_params(payload)?; | ||
| let now = Utc::now(); | ||
| let received_at = prost_types::Timestamp { |
There was a problem hiding this comment.
Do we not have a task timestamp in taskbroker?
We have consumers that do something with old messages (where the broker timestamp is old), it is convenient, if we only provide the timestamp at which the task is received by the consumer that information (old message) is lost. Have you considered using the message timestamp here ?
| metrics::histogram!( | ||
| "consumer.passthrough.payload_size_bytes", | ||
| "namespace" => config.namespace.clone(), | ||
| "taskname" => config.taskname.clone() | ||
| ) | ||
| .record(payload.len() as f64); |
There was a problem hiding this comment.
Are these metrics batched or sampled ? Producing metrics per message is a performance bottleneck
ref STREAM-882
Introduce passthrough mode, so that a broker can be used to spawn tasks
from any topic with any type of message format. This will make it easier
to migrate existing consumers to be tasks instead, without changing data
layout in prod. For more information refer to the ticket above.