Response Recording and Resumption

This guide covers streaming responses with recording, resumption, and cancellation so users can reconnect after a disconnect and pick up where they left off. For a conceptual overview of how response recording and resumption works, including the gRPC service contract and multi-instance redirect behavior, see Response Recording and Resumption Concepts.

The route stays /response-resumption/ for link stability, but the feature name used in the Quarkus docs is “Response Recording and Resumption.”

Prerequisites

Starting checkpoint: This guide starts from java/quarkus/examples/doc-checkpoints/03-with-history

Make sure you’ve completed the previous guides:

Getting Started - Basic memory service integration
Conversation History - History recording and APIs

Streaming Responses

When users disconnect during a streaming response (page reload, network issues), you can resume the streaming response from where they left off. This requires that the agent uses streaming responses. Let’s update the agent to use streaming responses with Multi<String> return types instead of String.

Update the Agent.java to use streaming responses:

Agent.java

package org.acme;

import dev.langchain4j.service.MemoryId;
import io.quarkiverse.langchain4j.RegisterAiService;
import io.smallrye.mutiny.Multi;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
@RegisterAiService
public interface Agent {
    Multi<String> chat(@MemoryId String conversationId, String userMessage);
}

What changed: The chat method return type changed from String to Multi<String>. Why: Multi<String> is Mutiny’s reactive stream type and tells LangChain4j to emit each token as it arrives from the LLM rather than waiting for the complete response. This is a prerequisite for streaming to clients and for response recording and resumption — the Memory Service can only replay a stream if it was produced as a stream in the first place.

Update the HistoryRecordingAgent.java to use streaming responses:

HistoryRecordingAgent.java

package org.acme;

import io.github.chirino.memory.history.annotations.ConversationId;
import io.github.chirino.memory.history.annotations.RecordConversation;
import io.github.chirino.memory.history.annotations.UserMessage;
import io.smallrye.mutiny.Multi;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class HistoryRecordingAgent {

    private final Agent agent;

    @Inject
    public HistoryRecordingAgent(Agent agent) {
        this.agent = agent;
    }

    @RecordConversation
    public Multi<String> chat(
            @ConversationId String conversationId, @UserMessage String userMessage) {
        return agent.chat(conversationId, userMessage);
    }
}

What changed: The wrapper’s chat method return type was updated from String to Multi<String> to match the agent. Why: The @RecordConversation interceptor needs to observe the full token stream to record the complete response only after it finishes. By passing through Multi<String>, the interceptor can transparently tap into the stream, accumulate tokens, and write the final entry to the history channel once the stream completes.

Update ChatResource.java to use Server Sent Events (SSE) to stream the responses to the client:

ChatResource.java

package org.acme;

import io.smallrye.common.annotation.Blocking;
import io.smallrye.mutiny.Multi;
import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.PathParam;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;

@Path("/chat")
public class ChatResource {

    @Inject HistoryRecordingAgent agent;

    @POST
    @Path("/{conversationId}")
    @Blocking
    @Consumes(MediaType.TEXT_PLAIN)
    @Produces(MediaType.TEXT_PLAIN)
    public Multi<String> chat(
            @PathParam("conversationId") String conversationId, String userMessage) {
        return agent.chat(conversationId, userMessage);
    }
}

What changed: The chat method now returns Multi<String> instead of String. Why: Returning a reactive stream from a JAX-RS endpoint causes Quarkus RESTEasy Reactive to automatically stream each emitted token to the HTTP client as a Server-Sent Event. The client receives tokens incrementally as the LLM produces them, instead of waiting for the full response.

Test it with curl:

curl -NsSfX POST http://localhost:9090/chat/bafd53b0-ee0d-4d5f-83e3-406fbc3506cb \
  -H "Content-Type: text/plain" \
  -H "Authorization: Bearer $(get-token)" \
  -d "Write a 4 paragraph story about a cat."

Example output:

Once upon a time, there was a curious little cat named Whiskers who loved to explore the garden behind her cozy cottage. Every morning, she would stretch lazily in the sunlight before venturing out to discover what new wonders the day might bring.

One particularly bright afternoon, Whiskers stumbled upon a mysterious path she had never seen before. It wound through a thicket of wildflowers and disappeared into a shady grove of ancient oak trees.

At the end of the path, she discovered a beautiful hidden pond where dragonflies danced across the water and frogs sang their evening chorus. Whiskers sat quietly by the edge, mesmerized by the peaceful scene.

As the sun began to set, painting the sky in shades of orange and pink, Whiskers made her way back home. She curled up by the fireplace, purring contentedly, already dreaming of tomorrow's adventure at her secret pond.

You should see the response streaming to your command line.

Now browse to the demo agent app at http://localhost:8080/?conversationId=bafd53b0-ee0d-4d5f-83e3-406fbc3506cb and you should see the response streaming to the browser.

Response Recording and Resumption

ResumeResource.java

package org.acme;

import static io.github.chirino.memory.security.SecurityHelper.bearerToken;

import io.github.chirino.memory.history.runtime.ResponseRecordingManager;
import io.github.chirino.memory.runtime.MemoryServiceProxy;
import io.quarkus.security.identity.SecurityIdentity;
import io.smallrye.common.annotation.Blocking;
import io.smallrye.mutiny.Multi;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.PathParam;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
import java.util.List;

@Path("/v1/conversations")
@ApplicationScoped
public class ResumeResource {

    @Inject ResponseRecordingManager recordingManager;
    @Inject SecurityIdentity securityIdentity;
    @Inject MemoryServiceProxy proxy;

    @POST
    @Produces(MediaType.APPLICATION_JSON)
    @Consumes(MediaType.APPLICATION_JSON)
    @Path("/resume-check")
    public List<String> check(List<String> conversationIds) {
        return recordingManager.check(conversationIds, bearerToken(securityIdentity));
    }

    @GET
    @Path("/{conversationId}/resume")
    @Blocking
    @Produces(MediaType.SERVER_SENT_EVENTS)
    public Multi<String> resume(@PathParam("conversationId") String conversationId) {
        String bearerToken = bearerToken(securityIdentity);
        return recordingManager.replay(conversationId, bearerToken);
    }

    @POST
    @Path("/{conversationId}/cancel")
    public Response cancelResponse(@PathParam("conversationId") String conversationId) {
        return proxy.cancelResponse(conversationId);
    }
}

What changed: Created ResumeResource with three endpoints: POST /resume-check to check whether a response is in progress for a list of conversation IDs, GET /{conversationId}/resume to replay the in-progress stream via SSE, and POST /{conversationId}/cancel to abort a running response. Why: When a user disconnects mid-stream (page reload, network drop), the LLM is still generating tokens on the server. The ResponseRecordingManager lets the client reconnect and receive the remainder of the response without resubmitting the message, and cancelResponse lets users abort a generation that is taking too long.

Test it with curl:

curl -NsSfX POST http://localhost:9090/chat/bafd53b0-ee0d-4d5f-83e3-406fbc3506cb \
  -H "Content-Type: text/plain" \
  -H "Authorization: Bearer $(get-token)" \
  -d "Write a 4 paragraph story about a cat."

And while the response is streaming, you can check use the following to check to see if the conversation has responses in progress:

curl -sSfX POST http://localhost:9090/v1/conversations/resume-check \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(get-token)" \
  -d '["bafd53b0-ee0d-4d5f-83e3-406fbc3506cb"]' | jq

And to resume a conversation, you can run the following command in a new terminal:

curl -NsSfX GET http://localhost:9090/v1/conversations/bafd53b0-ee0d-4d5f-83e3-406fbc3506cb/resume \
  -H "Authorization: Bearer $(get-token)"

You should see the response streaming to your command line.

Canceling a Response

To cancel a response, you can run the following command:

curl -sSfX POST http://localhost:9090/v1/conversations/bafd53b0-ee0d-4d5f-83e3-406fbc3506cb/cancel \
  -H "Authorization: Bearer $(get-token)"

You should see the response get canceled.

Next Steps

Conversation Sharing - Share conversations with other users.
Dev Services - Automatic memory-service container startup for development and testing.