CodeFlow: Visualizing Complicated Code

Community Article Published June 12, 2026

I created CodeFlow (https://huggingface.co/spaces/build-small-hackathon/CodeFlow). This project comes from a simple concept: take code, make a cool flowchart.

This blog will document my approaches, failures, learning moments, and other relevant interesting things that happened in this Hackathon.

Before I begin, I would like to note that this writing is not AI-generated. It is rather distasteful to use AI on a written blog post meant for conveying what one learned and the details on what one built.

With that in mind, let's get into it.

What Is This Thing?

I intended on creating an app that just takes some code in and creates a flowchart. You'd paste in some code, let a model do its thing, and then read the resulting code as a flowchart all on the CPU. Why on the CPU? Well, mainly because I didn't realize we had access to ZeroGPU until I'd already created a functional CPU version of this app.

I have a couple realistic use-cases in mind, including a personal one:

AI-generated code. We can't just trust the LLM 24/7. At some point, we have to review the AI's code, whether because it may contain an error or because we fear it hallucinating. But how do we review code that we have only had a short prompt's worth of thinking about? Flowchart it!
Open-source code. Maintainers are hard-pressed to deal with legions of pull requests (PRs). How can a maintainer reliably inspect new code in a PR before merging it? Why, they could flowchart it!
Understanding codebases. I was inspecting CleanRL the other day to understand its logic for implementing PPO and how I could create GRPO from that implementation. Safe to say, I was in over my head, and I really could not understand how collecting rollouts worked. So, I flowcharted it!

I was also chasing a couple badges. Mainly, Off-The-Grid, Off-Brand, Agentic Tracing, Llama-CPP, and Field Notes.

The Concept

I settled on having an LLM act as a transpiler. A code model should be used over a general one for the best results. I used llama.cpp for inference-mode efficiency.

I installed llama-cpp-python with prebuilt wheels to avoid timeouts from the C++ compilation of llama-cpp. I initially used Qwen2.5-Coder-7B-Instruct-GGUF (https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF), but later changed the model to a more powerful one.

A very simple version of this app would consist of:

Use HF Hub + llama-cpp to download and create the LLM locally.
Create a generate_mermaid function that runs a user's input through the LLM. The system prompt is also in this function.
Create a basic Gradio interface using gr.Blocks() and gr.Code().
Create a function to display the Mermaid.js code that the LLM outputs.

I've essentially constructed that, but a bit more complicated: a special UI, animations for creating the graph, example inputs to the model, and the ability to download the flowchart after generation.

More on this in "The First Working Architecture?..."

The First Working Architecture?...

The plan was described above. Here's how I created the first working architecture.

First, using llama-cpp and HF Hub, I downloaded the model. A core roadblock was that llama-cpp-python needs C++ compilation, causing a build timeout on Spaces. To resolve this, I used pre-built wheels.

An alternative solution would be to make your own wheel using GitHub Action, but that would take a while.

But how did I decide on what model to use? This is actually pretty interesting. I had three models I was eyeing: Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-Nemo-12B-Instruct.

Here are metrics on those models.

Metric / Benchmark	Qwen2.5-Coder-7B-Instruct	Llama 3.1 8B Instruct	Mistral-Nemo 12B Instruct
HumanEval (Python code generation)	88.4%	72.6%	77.4%
MBPP (Multi-language coding tasks)	83.5%	71.4%	73.2%
LiveCodeBench (Real-world logic & reasoning)	37.6%	25.1%	28.4%
Context Window	128K	128K	128K
GGUF Size (Q4_K_M)	~4.7 GB	~4.8 GB	~7.5 GB

As seen, Qwen beats its rivals whilst being smaller. Thus, I chose Qwen for the model.

Some more analysis than just that is required, of course:

Llama-3.1-8B is very conversational and good at following prose, but it fails relative to Qwen for complex coding.
Mistral-Nemo-12B is similar, but its large file size is a bigger issue.
Qwen2.5-Coder-7B-Instruct is trained with data focusing on syntax, logic, and layout. Additionally, it excels with reading nested conditional chains, complex iterations, and try-catch blocks.

I set temperature of the model to 0.1. Not strictly deterministic, but close to it. I want the model to have some variance in its interpretation of code. After all, if the user could not understand the code, there surely must be a complicating factor to the code that may require varying approaches.

The next thing is the generate_flowchart() function. I used llm create_chat_completion() in llama-cpp to create an interface for the model's response.

I also wrote the system prompt. I structured it as:

Role
Persona
Context/Goal
Strict Constraints
Banned vocabulary (we want strictly code output)
Response Workflow
Few-Shot Examples

I also added a reasoning channel. The model does parsing in a <thinking> tag before emitting the diagram. Reasoning gets stripped by regex before returning the mermaid code.

The full system prompt can be seen in app.py in the Space.

While testing, what I found is that the failure is not syntax or logic. Rather, it is formatting. Much of the system prompt is intended to fight the formatting problem using things like constraints, banned vocabulary, few-shot examples, and a reasoning channel. A potential error that I haven't yet encountered is a truncated generation not closing the <thinking> tag, and thus leaking the reasoning into the code and erroring when trying to generate the graph. This would happen if max_tokens was hit mid-thinking, and, at the expense of inference time, could be simply resolved by increasing the max_tokens value (make sure it does not increase to over the n_ctx values though).

Next, I created a custom frontend. I used gr.Server() instead of gr.Blocks() for this. My way of using gr.Server() was by creating the app HTML and then mounting it to the server. Thus, I set app = gr.Server(title="Code-to-Flowchart Generator")

However, I caught an error. As it turns out, gr.Server() is a FastAPI subclass, not a variant on gr.Blocks()! Thus, I needed to register my functions with the app in order to use them. Specifically, I used @app.api(...) to register my generate_flowchart() function.

Once I created the HTML, the first working architecture was complete.

Or so I thought.

Debugging the First Working Architecture

To debug, I first tested the architecture on a couple code snippets.

Basic if/else

def check_status(val):
    if val > 10:
        return "Active"
    else:
        return "Inactive"

Loop & accumulation

def sum_positives(items):
    total = 0
    for x in items:
        if x > 0:
            total += x
    return total

Nested branches & early returns

def grade(score):
    if score < 0 or score > 100:
        raise ValueError("out of range")
    if score >= 90:
        return "A"
    elif score >= 80:
        return "B"
    elif score >= 70:
        return "C"
    return "F"

While loop

def find_first_even(nums):
    i = 0
    while i < len(nums):
        if nums[i] % 2 != 0:
            i += 1
            continue
        if nums[i] == 0:
            break
        return nums[i]
    return None

Try/except/finally

def safe_divide(a, b):
    try:
        result = a / b
    except ZeroDivisionError:
        result = None
    finally:
        print("done")
    return result

Multi-function & nested loop

def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

def primes_up_to(limit):
    result = []
    for n in range(limit):
        if is_prime(n):
            result.append(n)
    return result

I also temporarily added print statements into the generate_flowchart() function:

content = response["choices"][0]["message"]["content"]
print("FINISH REASON:", response["choices"][0].get("finish_reason"))
print("RAW CONTENT >>>", repr(content))
cleaned = re.sub(r'<thinking>.*?</thinking>', '', content, flags=re.DOTALL)
print("CLEANED >>>", repr(cleaned.strip()))

On the first test, nothing happened. Nothing was flowcharted and no error propagated. Why?

Well, I think it's because I missed a return type annotation. Forgot to add -> str to the generate_flowchart() function.

I really didn't think this would matter that much, but it broke the entire function of the app in the first place.

Basically,@app.api infers the output from the function's return type annotation. If there was no return type annotation, the output registered as nothing. So, the model ran and created Mermaid, but Gradio discarded it because the output isn't passed to the frontend for displaying it.

And then I got another problem! Instead of a great flowchart, I was shown a Mermaid error. The model output was perfect, but the renderer threw: Parse error on line 3: ...C[Return "Active"] ... Expecting 'SQE' ... got 'STR'`.

Why? Well, the model turned the code return "Active"

into the node label C[Return "Active"]

Mermaid can't parse normal text and a quoted string inside brackets. It read "Active" as a string token (also known as STR). Any input code with a string literal gets this problem. This problem isn't really a structural problem with the model so as it is an issue with sanitizing the model's outputs.

So, I implemented a couple fixes:

Regex to the rescue! It's rather complicated, but regex can solve most of these issues. More explanation on this for the later issues with node labels.
Prompting. I edited the few shot prompts and added "no double quotes in labels".

Ok, now the full end-to-end works on CPU. Let's go! That was the 6th of June, so you can imagine the rest of this project took a while because UI design took a while to find testers who would give me good feedback and because I had ideas for a couple extensions to the basic app.

Model Upgrade & A Roadmap (kind of)

I had a couple ideas for improving the first architecture:

Use a bigger model. We have up to 32B parameters: why not use them?
Enable inference on ZeroGPU. This is when I figured out we could use ZeroGPU.
Create a cool UI.

My Roadmap Was Wrong

Ok, so let's do all three points of the roadmap above.

I decided to swap the model to Qwen-3-Coder-30B-A3B (https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) using llama-cpp. The Mixture of Experts (MoE) layer has ~3B params active per token. That actually means this model would be stronger than the old Qwen-2.5-Coder-7B, but also faster because less parameters are active per token during inference (I think: I'm no expert in this). And a stronger model is a better app, hopefully.

I used a dynamically quantized version of this model (UD-Q3_K_XL — unsloth dynamic quant) to preserve quality in the MoE layer while also reducing overall model size.

However, I ran into an issue. I was considering using vLLM, but that requires a GPU. But the wheel I used for getting llama-cpp-python was for the CPU! Switching to ZeroGPU without a CUDA build for the llama-cpp-python would be a no-op because we'd still run inference on the CPU either way.

Additionally, ZeroGPU is built around Gradio Blocks, but this app's custom gr.Server() FastAPI subclass may not be able to perfectly integrate with ZeroGPU.

That was my thinking, and then I discovered some more, quite interesting information. vLLM and ZeroGPU just don't work together at all! vLLM calls torch.cuda.set_device at model load, but ZeroGPU mandates that CUDA must not be initialized in the main process. So, they're exclusive.

That same thing happens with llama.cpp and ZeroGPU. A CUDA-built llama.cpp also initializes CUDA in the main process, so ZeroGPU won't boot and it'll just be a no-op.

The solution to this is constructing the llm Llama object inside @spaces.GPU, but this is secretly a disastrous solution because it would reload the full model into VRAM every request, which causes inference time to increase dramatically.

Also, the 3.3B MoE of the new Qwen-3 makes CPU inference viable because of how little of the Qwen-3 model is being used per token. That could be faster than the previous dense 7B Qwen-2.5, while also being a much stronger coder.

Ok, with the upgraded model and the same device (CPU), let's run the tests on the model again.

First error I found was cross-request contamination. Allow me to elaborate:

If I submit code X, and then I submit code Y as a fresh submission, my flowchart of code Y is contaminated by code X and code X's flowchart. This isn't an issue with conversation history. Rather, the llm object keeps the llama.cpp's KV cache across requests.

The fix is pretty simple. We just add llm.reset() before each create_chat_completion() call.

However, this also trades off with latency by wiping the cache each run... And this horribly implicates testing. What if a success of a previous test was because of the code preceding it? Yikes.

My second error was parentheses and operators in node labels. Test four failed because the model gave C[i < len(nums)] as a node label. ( can't be inside a node label. The fix? Pretty similar to the past issue with node labels:

Change the system prompt. Added "paraphrase conditions into plain words — never put raw code, operators, quotes, or parentheses inside labels (write Index in bounds?, not i < len(nums))"
Regex! The sanitization function that runs through model outputs now wraps each label's body in double quotes. They're read as literal text to Mermaid, so any parentheses and other operators parse and don't cause an error.

And here came another error. Test 4 (again) created an error where:

The model put an array subscript into a label.
The sanitizer doubled-down, giving D[nums["i"] % 2 != 0] which hits the [ and errors.

Wrapping the label in quotes via regex can't work if a label constains brackets by construction. We have to keep code out of the labels from the source (the model).

So, I extended the system prompt to explicitly forbid including brackets/subscripts, and I added a paraphrase example. I also updated the relevant few-shot prompts.

Test again! Failed. As it turns out, we need regex. Changing the prompt alone didn't work because the model did correctly convert the operator expression into prose, but then (for some reason) it added the operator expression directly after the prose, causing an error. So, I updated the regex again to identify each node label and render the [ within a node label as a result of code (such as nums[i]) to be rendered as literal text and not a node within a node (which would error).

I won't bore you with more errors — the vast majority of them are syntax errors. But I had my agent compile a massive markdown file that details every step of this project and my resolution to various errors. Let me know if you want to see that.

All 6 test snippets now work. Let's go! This is a massive milestone.

UI

Ok, UI time. When I approached UI, I had to think a lot about how UI should be created. Here are my conclusions: When designing UI, it's crucial to not make it overly complicated. I imagine a judge reviewing my Space: if there are too many buttons, not enough instructions, and things look ugly, then I don't imagine I'd do that well.

Conversely, the best UIs are clean, polished, and simple with lots of free space. Filling the free space would complicate things too much.

The first UI I came up with utilized a cyan and violet color schema with a progress bar for the generation of the flowchart and a light and dark mode.

The second UI made it a bit brighter. My "cyan" color had drifted very close to pure darkness. I also increased the chart size to prevent charts from being tiny as the code's logic got more complex, and I added a custom color for the scroll bar.

The third UI I made added a "Scroll to view full chart" pill when the user hadn't scrolled to the bottom yet.

The fourth UI I made added new labels to the input boxes: the first box is labeled source code and allows you to choose the language. The second box is the flowchart box, where the flowchart is displayed.

This went on for a while. After some more thinking, this is the UI I settled on:

What is it? Well, it utilizes a light-green-ish color schema. There is a nice instructions box in the top left corner that is meant to stand out with its dark green highlight. There is an arrow to indicate inputs -> outputs. There's also small buttons that allow you to download the PNG/SVG of the flowchart and the Mermaid code.

Oh, and did I mention the animations? They're so cool!

Basically, once the model is finished processing, the flowchart will draw itself starting from the start node.

Connections will spread from the start node, eventually leading to more nodes with their own connections that spread. It's very mesmerizing at a large scale, and slow enough such that the user can experience and admire the animation while also not being held up for too long.

Also, the chart automatically scrolls down so the entire animation is within view. How cool is that!?

I also added CodeMirror 6 to make it so the input for the source code feels like an actual code editor.

Lastly, I added the ability to move your cursor over a node and have that node's logic in the code be highlighted. In other words, you can hover over parts of the flowchart to see the corresponding section of your code."

Meeting Off-The-Grid

So I might've messed up somewhere in the process of creating the UI. Specifically, the inference was local but the frontend pulled from a Content Delivery Network (CDN) at runtime for CodeMirror, jsDelivr, and Google Fonts.

How did I resolve this? Well, I vendored (meaning: taking the assets and copy pasting them into my repo) all the assets. It wasn't too big, just about ~4 MB and definitely worth it for staying off the grid.

Agent Traces

You can view them at https://huggingface.co/datasets/build-small-hackathon/codeflow-agent-traces. All I did was add a button for downloading the agent traces, and then ran these ten inputs and downloaded the traces.

Two-branch conditional

def grade(score):
    if score >= 60:
        return "pass"
    else:
        return "fail"

Elif chain

def classify(temp):
    if temp < 0:
        return "freezing"
    elif temp < 15:
        return "cold"
    elif temp < 25:
        return "mild"
    else:
        return "hot"

For loop with accumulator

def sum_positives(nums):
    total = 0
    for n in nums:
        if n > 0:
            total += n
    return total

While loop with early break

def first_even(nums):
    i = 0
    while i < len(nums):
        if nums[i] % 2 == 0:
            return nums[i]
        i += 1
    return None

Nested loops

def find_target(grid, target):
    for row in grid:
        for cell in row:
            if cell == target:
                return True
    return False

Recursion

def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)

Try/except/finally

def safe_divide(a, b):
    try:
        result = a / b
    except ZeroDivisionError:
        return None
    finally:
        print("done")
    return result

Guard clauses

def withdraw(account, amount):
    if amount <= 0:
        return "invalid amount"
    if account.frozen:
        return "account frozen"
    if amount > account.balance:
        return "insufficient funds"
    account.balance -= amount
    return "ok"

Retry loop (JavaScript)

function fetchWithRetry(url, maxTries) {
  let attempt = 0;
  while (attempt < maxTries) {
    if (tryFetch(url)) {
      return "success";
    }
    attempt += 1;
  }
  return "gave up";
}

State-machine switch (JavaScript)

function nextState(state, event) {
  switch (state) {
    case "idle":
      return event === "start" ? "running" : "idle";
    case "running":
      if (event === "stop") return "idle";
      return "running";
    default:
      return "idle";
  }
}

What I Learned

I learned a lot, mainly the following:

One symptom does not mean one cause. I can change something to fix a bug but that doesn't guarantee success. If anything, my trouble with node label formatting just shows that it could create further errors.
Verify APIs and syntax/formatting guidelines before using them. Otherwise, I'll get into another "you missed the return annotation"-like problem.
I learned a lot about quantization, vLLM, and Gradio.

That's all from me. I spent hours writing and reviewing my work to create this post. Thank you for reading this!