Bits & Bytes: AI,LLMs and latest tech

Building your first agentic system

Aravind — Tue, 06 Jan 2026 18:20:08 GMT

PS: Content and direction are mine, article heavily edited using AI.

2025 was called the year of agents. But until Claude Code shipped, truly agentic systems were sparse. I had built a multi-agent system earlier (read here), but it was more orchestration than agentic.

Then it struck me: Claude Code, Codex — they’re just while loops.

After exploring open-source tools and using Claude Code and Codex, I built a data analysis agent: upload a file, ask any question, and the agent figures out the columns it needs, runs the analysis, and returns a summary with charts. Code shared here.

It worked great in development. Then it broke. I’ve built 4-5 versions next, all of which failed at various stages before reaching an optimal stage.

This post covers five reliability lessons that don’t show up in tutorials — but will show up in your debugging logs.

TL;DR: The Agent Builder Checklist

Memory isn’t free: Runtimes (GC or Ownership) won't save you from logical leaks in session state. Track your memory usage and explicitly clear intermediate resources after every turn.
Prompt for efficiency: LLMs are greedy; they will load 50 columns when they need 1. Use proper context engineering.
Compact between turns: Never truncate context mid-reasoning. Track token usage and prune “dead weight” before the next turn starts.
Schemas are contracts: Vague tool descriptions lead to hallucinated parameters. Use strict, explicit schemas.
Set circuit breakers: Agents loop until they fail or go broke. Implement hard limits on tool calls and retries.

1) Memory That Never Dies

The assumption: Each agent turn is stateless. Resources clean up automatically.
The reality: Most implementations keep session state between tool calls — intermediate results, loaded resources, computed values. That state accumulates.

def execute_tool(self, tool_name, args):
    result = run_tool(tool_name, args)
    self.resources[args["name"]] = result  # accumulates forever
    return result

Multiply that by 100 concurrent users running multi-step chains. Memory climbs. Garbage collection can’t help because references still exist. Eventually: OOM crashes.

What happened: Data files loaded for analysis stayed in RAM after each turn. Python’s GC couldn’t reclaim them because the session held references. With concurrent users, memory grew unbounded until containers crashed.

The fix: Explicit lifecycle management (clear references + close handles + enforce TTL/LRU).

import gc

def reset_after_turn(session):
    # If your objects have a close()/cleanup() method, call it here.
    for key, obj in list(session.resources.items()):
        try:
            if hasattr(obj, "close"):
                obj.close()
        finally:
            session.resources.pop(key, None)
            del obj  # break references (especially helpful for complex graphs)

    session.resources.clear()
    gc.collect()  # GC is a backstop; real fix is removing session refs + closing handles

Key insight: GC is a backstop, not a solution. If your session holds references, memory won't be freed. This is especially dangerous when using data libraries like Pandas or NumPy—they often allocate memory in C-extensions that Python’s GC struggles to track.Explicitly drop references + close resources, and enforce TTL/LRU for bounded session state.

2) The Agent Loads Everything

The assumption: The model will make efficient choices about resource loading.

The reality: LLMs optimize for correctness, not efficiency. Given “load everything” vs “load only what’s needed,” they choose safety every time.

What happened: I provided three loading tools — sample (schema only), partial (specific columns), and full (everything). The agent consistently chose full loads even for “what’s the average of column X?” Loading 50 columns when it needed 1.

The fix: It wasn’t better tools — it was a better system prompt. Don’t just provide tools; tell the agent when to use each:

LOADING STRATEGY (follow this order):
1) get_metadata()          - Check what's available FIRST
2) load_partial(fields)    - PRIMARY: Load ONLY what's needed
3) load_full()             - RARE: Only when ALL fields are required

Examples:
- "average sales by region" -> load_partial(["region", "sales"])
- "filter by date"          -> load_partial(["date", "amount"])

The pattern: provide a clear hierarchy with concrete examples. It’s still prompt engineering in 2026.

3) Context Overflow Mid-Turn

The assumption: Context windows are large enough.

The reality: In practice, I found ~100k–250k tokens to be the usable range. The industry is moving away from large context windows, we now talk doing context engineering. So when agent loops accumulate context fast: tool calls, tool results, intermediate reasoning — it adds up, we need to compact the context. Hit the limit mid-turn and the user gets nothing.

A typical turn:

user message (100 tokens)
system prompt (1000)
tool call (300)
tool result (2000)
another call (100)
another result (1500)

After a few turns, you’re at 80% capacity. One unexpectedly large tool result (e.g., analysis output on a 1000×10 table) and you hit the wall mid-turn. The model can’t complete. The user’s question goes unanswered.

What happened: Our compaction logic triggered mid-turn when tokens exceeded a threshold. This interrupted the agent’s reasoning. Users asked questions and got nothing back.

The fix: Compact between turns, never during. Also be deliberate about what you compact: messages, tool outputs, raw data dumps — anything that won’t help the next turn. The goal here is Context Precision. By compacting between turns, you ensure the agent starts every new reasoning step with a "clean" but relevant history, rather than hitting a wall halfway through a critical computation.

def run_agent_turn(session, message):
    # Check BEFORE the turn, not during
    if session.total_tokens > (MAX_TOKENS * 0.80):
        compact_history(session)  # summarize / prune / store large artifacts out-of-band
    return execute_turn(session, message)

Don’t punish the user for token limits. Track usage and proactively compact before hitting the ceiling (unless your KPI is token usage, not outcomes, Claude Code - I am talking to you).

4) Tool Schemas Are Your API Contract

The assumption: Tool descriptions are just documentation for the model.

The reality: Tool schemas are the contract between the model and your tool runner. The model uses them to decide what to call and how. Your runner uses them to validate and execute. Vague schemas = wrong calls = runtime errors on both sides.

{
  "name": "process_data",
  "description": "Process the data",
  "parameters": {
    "data": { "type": "string" },
    "options": { "type": "object" }
  }
}

What data? What format? What options? The model guesses — and guesses wrong.

The fix: Whether you’re using JSON Schema or a simplified tool schema format or a pydantic type-safe models, the point is: make it strict.Strict, explicit schemas:

{
  "name": "load_dataset_columns",
  "description": "Load specific columns from a dataset. Returns a DataFrame.",
  "parameters": {
    "dataset_name": {
      "type": "string",
      "description": "Name from get_dataset_info()"
    },
    "columns": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Column names to load"
    },
    "df_name": {
      "type": "string",
      "description": "Variable name for the DataFrame in REPL"
    }
  },
  "required": ["dataset_name", "columns", "df_name"]
}

Key Insights:

prefer explicit enums over free-form strings
describe relationships between tools (“use output from X”)
document return values (shape + keys)
mark required fields clearly

The model can only be as precise as your schema allows.

5) Runaway Loops Will Burn Your Budget

The assumption: The agent will finish in a reasonable number of steps.

The reality: Without limits, agents can loop indefinitely. A confused model retries the same failing tool. A complex query spawns endless sub-tasks. Each iteration costs tokens, time, and money.

What happened: An edge case caused the agent to repeatedly call the same tool with slightly different parameters, expecting different results.

The fix: Hard limits at multiple levels: max tool calls per turn, max retries per tool, and a timeout per turn.

MAX_TOOL_CALLS_PER_TURN = 10
MAX_RETRIES_SAME_TOOL = 3

def run_agent_turn(session, message):
    tool_calls = 0
    tool_counts = {}

    while not done:
        response = model.generate(...)

        if response.has_tool_calls:
            for call in response.tool_calls:
                tool_counts[call.name] = tool_counts.get(call.name, 0) + 1
                if tool_counts[call.name] > MAX_RETRIES_SAME_TOOL:
                    return {"error": "Tool retry limit exceeded"}

            tool_calls += len(response.tool_calls)
            if tool_calls > MAX_TOOL_CALLS_PER_TURN:
                return {"error": "Tool call limit exceeded"}

When limits hit, fail gracefully with a clear message. The user can retry with a narrower question, or you can ask a clarifying question. Also add per-tool timeouts, exponential backoff, and a circuit breaker when the same error repeats.

Why Not Use Agentic Libraries?

You might ask: why build from primitives instead of using LangChain, CrewAI, or similar frameworks?

Honestly, I chose not to. These libraries are still maturing, and the tools I admired — Claude Code, Codex — were built on raw foundations, not abstractions. I wanted to understand the while loop, not hide it.

That said, frameworks have their place. If you need to ship fast and your use case fits their patterns, use them. But if you’re debugging a memory leak at 2:30 AM (yes, I hit the Claude limits too 😅), you’ll want to know what’s actually happening under the hood.

Conclusion

The biggest mistake is thinking an agent is just “an LLM in a while loop with tools.” It’s not. It’s a system with:

State management (sessions, resources, context)
Resource constraints (memory, tokens, time)
Failure modes (partial, silent, cascading)
Concurrency concerns (multiple users, multiple turns)

The while loop is the easy part. Everything around it — context management, resource efficiency, error handling, prompt engineering — that’s where the work lives.

Build agents like you’d build any production system: explicit resource management, defensive error handling, clear contracts, and guardrails everywhere.

Subscribe now

If you’re interested in the intersection of LLM orchestration and reliable AI engineering to solve analytics and other interesting problems, subscribe.

Does "Thinking" Always Help an LLM?

Aravind — Mon, 01 Dec 2025 17:26:49 GMT

TL;DR

I ran ~100 hard, instruction-heavy tasks on the same base model (gpt-5.1) with three reasoning settings: none, low, and high. The high-reasoning variant had the highest average score and was picked best about half the time.
The biggest gains are on structured, multi-step, format-constrained tasks (analysis plans, comms, structured writing, accessible frontend). On precise code, SQL, and schema tasks, extra reasoning sometimes adds small syntax or logic errors.
Use high reasoning for complex, instruction-heavy prompts, and always pair it with tests or validation when the output must be executable (code, SQL, schemas).
All data, prompts, code are shared at the end of article

Why I Ran This

I wanted to understand a simple question:

If I only change the reasoning setting on a model, does quality actually improve, and where?

Using the same base model (gpt-5.1), I varied only reasoning.effort:

model_1: reasoning none
model_2: reasoning low
model_3: reasoning high

I then checked:

Does higher reasoning effort improve instruction following and correctness?
Is the benefit uniform across task types, or concentrated in a few?

This was partly inspired by the Apple paper “The Illusion of Thinking” and by watching ChatGPT “think” at length on prompts where it did not seem to help.

Task Setup in Brief

I asked gpt-5.1 with high reasoning to generate all tasks, with instructions to make them:

Instruction heavy, multi step, and non trivial
Focused on realistic domains like SQL and analytics, Python jobs, frontend components, communications, docs and schemas, and planning or writing
Strict about format, for example “return only code blocks, JSON, or ordered sections”
Full of edge cases and explicit constraints, and to avoid simple or one line tasks

I deduped by (category, task_name, prompt) and ended up with 98 tasks.

How I Evaluated

For each task:

Answer generation
I called gpt-5.1 three times with the same prompt, once with each reasoning setting, and stored the outputs as model_1, model_2, and model_3.
Scoring
Another gpt-5.1 instance(could setup another LLM model here), acting as a strict grader, scored each answer from 0 to 10.
Scores were penalized for:
- Wrong format (for example, extra text when the prompt asked for “JSON only”)
- Broken ordering or missing required sections
- Clear logical or syntax errors
Best pick
The grader then saw all three answers and had to pick one best model, no ties. This gives a “win rate” per model in addition to average scores.

I also broke down results by broad category (sql, python, frontend, analysis, communication, schema, docs, planning, writing, and so on).

Results

Across all 98 tasks:

Average scores
- model_1 (none): 7.52
- model_2 (low): 8.19
- model_3 (high): 9.09
Best model picks
- model_1: 22.4% of tasks
- model_2: 28.6% of tasks
- model_3: 49.0% of tasks

model_3 wins most often, but the other two still win a meaningful fraction of tasks.

Where High Reasoning Helps Most

Looking by category, in this experiment, high reasoning shines where tasks are multi-step and format constrained:

Analysis plans and analytics writeups
Communications with strict sections (for example, incident updates or executive summaries)
Structured planning and writing tasks
Accessible frontend components that combine behavior, structure, and constraints

In these cases, model_3 often:

Uses all required sections and ordering
Respects “only output X format” instructions
Covers more edge cases or reasoning steps that the other variants miss

The category score lift plot makes this visible:

Bars to the right are categories where high reasoning clearly adds value. For example, analysis, communication, planning, resume, and philosophy all see a strong lift.

Where High Reasoning Falls Short

High reasoning did not win everywhere.

On several categories that demand exactness:

SQL queries
JSON schemas
Some Python snippets

There are cases where model_3 produced a rich, well explained answer with one small syntax or logic error, while model_2 or model_1 produced something simpler but correct enough to win under the rubric.

Some patterns:

In SQL related tasks, model_2 actually won slightly more tasks than model_3, even though model_3 had the best average score.
In schema related task, wins were roughly split between model_2 and model_3.
In a couple of writing tasks that demanded a very strict format, model_1 won because it simply followed the format while the higher reasoning variants tried to be more creative.

The lesson is simple:

Extra “thinking” can still introduce small mistakes on deterministic tasks where a missing comma or a wrong join breaks the whole thing.

Limitations of this exercise: I used the same model family (gpt-5.1) to grade, took only one sample per model per task, and had 98 tasks with some thin categories, so results are directional and do not account for cost or latency.

Results and code are posted here: Reasoning Test

Learnings from implementing Multi-Agent Systems & Sequential Workflows

Aravind — Fri, 29 Aug 2025 13:25:49 GMT

I wanted to replicate deep research pipeline to learn more about agents for use cases like automating market research reports. Instead of relying on existing frameworks like Agents SDK, I decided to implement agents myself to understand the concepts better and also have control over state and storage.

My Initial Multi-Agent Approach

I built a single master orchestrator agent that analyzed incoming messages and routed them to specialized agents for:

Web search
Tool calls
Analysis
Summary
Chat

This worked at first, but revealed some issues as the context grew:

After 5–10 messages, the orchestrator started sending queries to the chat agent instead of the analysis agent, likely due to memory handoff issues.
The tool call accuracy dropped to 50-60% after 10 messages.
It would call the web search tool 3 times for the same question
The orchestrator struggled to keep track of what it was doing, a common limitation of simple master-agent setups without structured memory.
Despite extensive prompt engineering, the fundamental issues persisted.

Moving to Sequential Workflows

To overcome these issues, I rebuilt the system with a deterministic, stage-based workflow:

Clarification check - asks clarifying question after assessing user request
Topic generation - generates topics to analyze based on user query
Question generation - generates 5-10 questions related to the topic for the user query
Answer research - web search enabled LLM request to answer the question
Analysis on Data - LLM request to analyze data uploaded by user in context of the query.
Summary generation - Final summary generation covering all topics
Chat - Chat on the summary and data generated during the analysis

Benefits of this architecture:

Each stage is independent and resumable (if stage 3 fails, retry just that stage), with proper handoff and state
Defined Scope reduced hallucinations — each LLM call has one specific job.
Deterministic tool use — no agent confusion about whether to use tool or not, is defined by the stage.

Key learnings for complex workflows

No method is absolutely perfect: For well-defined workflows, let code enforce orchestration, retries, and state — while agents handle judgment tasks like clarification, research, and synthesis.
For reliability, make tool calls stage driven, not agent decided. Yes, this means unnecessary web searches when entering the "research" stage, but the trade-off is worth it for predictable, reliable behavior.
Agents work better when scope is staged narrowly and state is saved, so errors are contained and retries only affect that step, not the whole workflow.

Next target: Implementing the same deep-research workflow using Agents SDK to explore structured memory, state persistence, and more reliable task handoff.

Interested in reading more content on AI systems and startup journey? Subscribe to thought.bitsnbytes.in

Speech Unleashed: Talk to AI in any language

Aravind — Tue, 26 Mar 2024 05:32:22 GMT

Image generated by Dall-E by OpenAI with a prompt from the author

How to build a voice-based custom LLM pipeline that works for any language and gives more accurate responses with fewer hallucinations

When I watched the Hume AI demo, I tested it for Indian languages,it got the emotion levels correct, but the output was gibberish in a few cases. I then tested the same with ChatGPT using voice mode, it worked for some questions and did not for some (eg: Fig 1).

Fig 1: ChatGPT responses in Hindi using its voice mode

I then tested translating the speech into English language and the responses were more accurate and with fewer hallucinations. To automate this process, I started building a pipeline that translates any language to English and answers the user question in the same source language. This pipeline can be used in a voice-based RAG system, where user questions are queried on a custom knowledge base.

This article explains the main elements and code required to build this pipeline. Most of the components used here are open-source models (except for the GPT 3.5 turbo model used as LLM) making it easy to replicate. This article assumes you understand the terms large language model (LLM), Retrieval Augment Generation (RAG), RAG frameworks like Llama-index, and Vector database (like Qdrant) and their functionality.

Fig 2: Answer the voice questions using LLMs and covert back to speech

TL; DR version: The input is a speech recording, which is sent to the Seamless M4T model for accurate translation to English. Once translated it is sent to LLM answering the question asked. The response is converted to audio of the source language and played to the user. The entire process can be considered as a 3 step process.

Step 1: Pre-process the recording into specific dimensions for speech translation to work. Send the request to translate the recording using M4T models and identify the source language using Whisper models.
Step 2: Send the translation to query on the knowledge base (a vector database) to retrieve the required information and then send the text to LLM to answer the question.
Step 3: Translate the answer to the source language and play the audio output to the user.

Let's start building……….

Translating the speech using Seamless M4T models

The first element of the pipeline is translating the question into English. This is done because LLM response accuracy is higher when the language is English. To translate the speech into English, we use Meta’s Seamless M4T.

The first step is to convert the sample rate to 16000 Hz using the librosa library as Seamless M4T models are trained on a sample rate of 16000 Hz.

import librosa

def adjust_speed_librosa(audio_path, output_path, speed_factor):
    y, sr = librosa.load(audio_path, sr=None)
    y_fast = librosa.effects.time_stretch(y,rate= speed_factor)
    sf.write(output_path, y_fast, sr)

Then we send the resampled recording file for translation. We set up a Python application using fastapi to host the m4t model on a GPU using the script m4t_app.py (shared at the end of the article) and this application can process our requests.

Seamless M4T being multimodal, can convert text to speech or text, and similarly speech to speech or text. Source language is required as input for text translations, for speech translations it is not required. For this article, we will focus on speech as input.

def send_file_for_translation(input_data, mode, tgt_lang, src_lang):
    if mode == "t2st" or mode == "t2tt":  # text to speech translation or text to text translation
        data = {
            'input_data': input_data,  # this is the text input
            'mode': mode,
            'tgt_lang': tgt_lang,
            'src_lang': src_lang
        }
        response = requests.post(API_URL, data=data)

    else:  # mode == "s2tt" for speech to text translation
        with open(input_data, 'rb') as f:  # here input_data is the file path
            files = {'file': (input_data, f, 'audio/wav')}
            data = {
                'mode': mode,
                'tgt_lang': tgt_lang,
                'src_lang': src_lang
            }
            response = requests.post(API_URL, files=files, data=data)
    response_json = response.json()  # Parse the JSON response
    if mode in ["s2tt", "t2tt"]:
        return response_json.get('translated_text', None)
    elif mode == "t2st":
        return response_json.get('audio_link', None)

Behind the scenes, when the request is sent, first the seamless m4t model is used to translate the text. In the later steps of the process, we convert the output to speech, which requires the source language. Seamless M4T does not have built-in source language detection. We use OpenAI’s Whisper model to get the source language (why not use the Whisper model for both translation and source language detection, story for another time). Both the translated content and the source language are sent back as the response.

async def translate_audio(file: UploadFile = File(None),
                          input_data: str = Form(None), 
                          mode: str = Form(...),
                          tgt_lang: str = Form(...),
                          src_lang: str = Form(None),
                         model:str = Form(None)):
    
    if mode == "s2tt" and file:  # speech to text translation and file is provided
        temp_file = _save_temp_file(await file.read())
        input_path = temp_file
    elif mode == "t2st" and input_data:  # text to speech translation and text is provided
        input_path = input_data
    elif mode == "t2tt" and input_data:
        input_path = input_data
    else:
        raise ValueError("Invalid mode or input not provided!")

    # Set up arguments for m4t_predict
    output_filename = "output_t2st.wav" 
    if mode in ['t2st','s2tt']:
        if mode == "t2st":
            output_filename = "output_t2st.wav" 
        elif mode == "s2tt":
            output_filename = f"output_{file.filename}"
        args = Namespace(
            input=input_path,
            task=mode,
            tgt_lang=tgt_lang,
            src_lang=src_lang,
            output_path=os.path.join(UPLOAD_DIR, output_filename),
            model_name="seamlessM4T_large",
            vocoder_name="vocoder_36langs",
            ngram_filtering=False)
    else:
        args = Namespace(
            input=input_path,
            task=mode,
            tgt_lang=tgt_lang,
            src_lang=src_lang,
            model_name="seamlessM4T_large",
            vocoder_name="vocoder_36langs",
            ngram_filtering=False)


    # Call the prediction function
    translated_text = m4t_predict(args)

    if (mode == "s2tt"):
        model_w = whisper.load_model("large")
        
        audio = whisper.load_audio(input_path)
        audio = whisper.pad_or_trim(audio)

        # make log-Mel spectrogram and move to the same device as the model
        mel = whisper.log_mel_spectrogram(audio).to(model_w.device)
        # detect the spoken language
        _, probs = model_w.detect_language(mel)
        print(f"Detected language: {max(probs, key=probs.get)}")
        src_lang = {max(probs, key=probs.get)}
        if file:
            os.remove(temp_file)
        
        return {"translated_text": translated_text,
                'src_lang' :src_lang} 
    elif mode == 't2tt':
        if file:
            os.remove(temp_file)
        return {"translated_text": translated_text} 
        # Use translated_text
    else:  # mode == "t2st"
        print(f"./{os.path.basename(args.output_path)}")
        return {"audio_link": f"./{os.path.basename(args.output_path)}"}  # Use args.output_path

Use RAG to add information to answer the translated question

Once we translate it to text, the next step is to send the question to an LLM along with the required information. This can be achieved by setting up a RAG system using frameworks like Llama-Index, Langchain, or even the OpenAI assistants. Retrieval Augmented Generation (RAG) system is a framework that combines LLMs with a private knowledge base and queries the required information for a specific input/question. I have used Llama-index using the Qdrant vector database adding text data related to a business. The translated question is sent as text input. Llama-index calculates the cosine similarity of the vectors, ranks the vectors, and sends the top k nodes to the LLM to answer the question.

def get_llm_response(question):
    count = 3
    mode = 'compact'
    async_mode=False
    llm_predictor = LLMPredictor(AzureChatOpenAI (deployment_name='gpt-35-turbo',model='gpt-3.5-turbo', 
                                             temperature=0,max_tokens=256,
                                             openai_api_key=openai.api_key,openai_api_base=openai.api_base,
                                             openai_api_type=openai.api_type,openai_api_version='2023-05-15',
                                             ))
    embeddings = LangchainEmbedding(OpenAIEmbeddings( deployment="text-embedding-ada-002",model="text-embedding-ada-002",
                                                    openai_api_base=openai.api_base,openai_api_key=openai.api_key,
                                                    openai_api_type=openai.api_type,openai_api_version=openai.api_version),
                                    embed_batch_size=1,)    
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor,embed_model=embeddings)

    client = qdrant_client.QdrantClient(url=f"{hostname}:6333")
    vector_store = QdrantVectorStore(client=client, collection_name=f"collection_name")
    index = VectorStoreIndex.from_vector_store(vector_store=vector_store,service_context=service_context)
    response_synthesizer = get_response_synthesizer(response_mode=mode, use_async=async_mode,service_context=service_context, text_qa_template = SYSTEM_MESSAGE)
    query_engine = index.as_query_engine(response_synthesizer=response_synthesizer,similarity_top_k=count))
    response = query_engine.query(question)
    return response.response

Convert the text to speech using Google text to speech synthesizer

The last part of the pipeline is converting the text answer from the LLM to speech. The answer in English from the previous step (a text output) is sent to the translation model to convert it into source language text (Another way to achieve this is to add instructions to the LLM to answer it in the source language). We can skip this step if the source language is English.

We use Google’s gTTS library, which can convert text to speech when the language is specified. We use the source language detected (by OpenAI's Whisper model) and convert the text into speech, which is played to the user.

def text_to_audio(text, lang=gtts_lang_input):
    """Converts given text to audio and plays it."""
    tts = gTTS(text=text, lang=lang, slow=False)
    tts.save("output_audio.mp3")
    return ipd.Audio("output_audio.mp3")

Things to consider while deploying this pipeline

We now have a complete pipeline where users can ask a question and we can answer the question in the same source language. Few things to consider here in the pipeline:

Both Seamless M4T and Whisper models are available in different sizes. The accuracy of the model and the time required to process the input changes based on the size of the model selected. So, a lot of experiments are required based on the use case.
gTTS was used instead of M4T for text-to-speech because the speech output from the model, as it required post-processing. gTTS model’s output was clear and did not require further processing.
Seamless M4T is multi-modal, it can work for S2ST, S2TT, T2ST, and T2TT, so we need to mention the input type for the model. It can translate 100 languages and it’s an open-source model.
This article is based on Seamless M4T models (released in Aug 2023), since I have drafted this article (and not published for a long time), Meta has released the V2 model, which has similar functionality but improved accuracy.
The code is based on older versions of llama-index and OpenAI. Both packages have had major breaking changes since then. I’ve included the versions in the requirements file.

I’ve shared all the code I’ve used in this git repo