Skip to main content

Command Palette

Search for a command to run...

My Adventures in Vibe Coding Tools — The Pain of Versioning (Part 2)

Updated
9 min read

I’m a tech person—not the type who grinds out code all day, but not the type who never touches it either. For me, coding is pure interest; I’m not looking to make a full living out of it, just looking to find some joy in the process of exploration. As my handle suggests, my writing style will be a "raw log"—because I believe the exploration of technology is always a journey. If anyone cares about the process, a slightly refined "exploration log" is often more valuable than a polished summary.

I’m Hurting—So What’s the Move?

After getting burned by versioning issues in that interview and the struggles I mentioned before, I knew I had to change something. Having built a few small AI apps, I knew the solution: I needed to feed the AI as much information as it was missing—specifically, the latest documentation and code samples. But how do I present that to the AI? My first thought was MCP (Model Context Protocol). If the AI knows that it doesn't know something during the Vibe Coding process, it should be able to go out, find the relevant new features or code samples, and then code based on those examples. MCP is essentially the interface that allows the AI to call external tools during runtime.

That led to two questions:

  1. Where do I find these features and code samples?

  2. Once I have this info, how do I serve it via MCP?

I consulted Gemini, and here is the verdict: currently, there isn't a one-stop shop, but there are ways to do it if you split it into two steps. For the first question—latest features and samples—they usually exist in the library’s docs or auto-generated documentation. But scraping them is the real hurdle. Let’s talk about that.

Scraping the Documentation

Gemini suggested two main paths: SaaS-based scrapers or local scraping libraries, both of which can convert pages into Markdown. I tested two:

  1. The SaaS Route: Firecrawl
  • The free tier exists, but it’s limited to about 500 pages a month.

  • It supports real-time AI crawling, but that’s an extra cost.

  • It supports "MCP + Real-time AI Crawl," but doesn't seem to support "MCP + Pre-scraped content" out of the box.

  • Firecrawl is open-source and can be self-hosted via Docker, but the build process for Docker Compose is painfully slow and full of errors (I’m on an aging Mac that can't be updated anymore—I'm not going to torture myself). I gave up on this after one try.

  1. The Local Route: Crawl4ai
  • This is a Python library that supports various crawling methods and filtering. It works great and, most importantly, it’s free. We’ll look at the results in a bit.

  • It’s very easy to use: I just had the AI write a call script, gave it a root directory for the docs and a prefix filter, and I was good to go.

AI told me that Cursor's built-in scraping is likely powered by Firecrawl, so the SaaS version is probably top-tier. I tried using it to crawl some LangChain docs, and the quality was solid. But the credits vanish fast—without paying, it’s basically a non-starter for large docs.

When I moved to Crawl4ai, I had to decide where to run it. Documentation is usually hosted globally, and scraping requires a lot of back-and-forth communication. Plus, network latency (and the Great Firewall if you're in certain regions) can be a pain. After some thought, I found the perfect spot: Google Colab.

Some people might ask why I’m so obsessed with Google products. The truth is, Google provides so many low-cost (often free) and open solutions for developers to play with. At its core, Google Colab is just an online Jupyter Notebook running on a temporary VM. It’s meant for ML/AI and data science, but you can absolutely use it as a standard, cloud-based Jupyter environment. A pro-tip: Colab gives you free GPU credits, so you can even use it for light LLM fine-tuning or small-scale BERT pre-training. Here, by mounting Google Drive to the Jupyter Notebook, I could have Colab scrape the pages for me. By setting a reasonable sleep interval, I scraped thousands of pages without a single IP ban.

Here is the Jupyter script I used to scrape the Kubernetes JavaScript SDK documentation (I’m hosting a mirror of it on a github.io page):

# Install crawl4ai
!pip install crawl4ai 
# Initialize crawl4ai - note: you might need to restart the Colab runtime after first run
!crawl4ai-setup 

from google.colab import drive 
drive.mount('/content/drive') # Mount Google Drive

import asyncio
import os
import re
from urllib.parse import urlparse
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

# ==================== Config ====================
# 1. Target & Scope
BASE_URL = "https://zhibinyang.github.io/kubernetes-client-node-docs-v1.4-unofficial/modules/index.html"
PREFIX = "https://zhibinyang.github.io/kubernetes-client-node-docs-v1.4-unofficial/"

# 2. Extension Filter (Set to None or [] to crawl all pages under the prefix)
ALLOWED_EXTENSIONS = ['.html', '.htm']

# 3. Storage Settings
OUTPUT_DIR = "/content/drive/MyDrive/Docs/Kubernetes-Node"
TRACKER_FILE = os.path.join(OUTPUT_DIR, "crawled_urls.txt")

# 4. Crawling Preferences
SLEEP_TIME = 1.0  # Seconds between requests
# ===============================================

os.makedirs(OUTPUT_DIR, exist_ok=True)

def clean_url(url):
    """Remove anchors and query params to ensure uniqueness"""
    return url.split('#')[0].split('?')[0].rstrip('/')

def should_crawl(url, prefix, processed_set, allowed_exts):
    """Check if a URL meets the crawling criteria"""
    cleaned = clean_url(url)

    # Basic check: Not processed and matches prefix
    if cleaned in processed_set or not cleaned.startswith(prefix):
        return False

    # Extension check: Must match if a list is provided
    if allowed_exts:
        path = urlparse(cleaned).path
        # Handle index cases: paths ending in / usually correspond to index.html
        if path.endswith('/') or not os.path.basename(path):
            return True
        return any(path.lower().endswith(ext.lower()) for ext in allowed_exts)

    return True

def get_file_name(url):
    """Naming logic: Use the last two parts of the path"""
    path = urlparse(clean_url(url)).path.strip('/')
    # Strip known extensions to save as .md
    if ALLOWED_EXTENSIONS:
        for ext in ALLOWED_EXTENSIONS:
            path = re.sub(re.escape(ext) + r'$', '', path, flags=re.IGNORECASE)

    parts = [p for p in path.split('/') if p]
    if len(parts) >= 2:
        name = f"{parts[-2]}-{parts[-1]}"
    elif len(parts) == 1:
        name = parts[0]
    else:
        name = "index"

    return re.sub(r'[^\w\-]', '_', name) + ".md"

async def universal_crawler():
    # 1. Load progress
    processed_urls = set()
    if os.path.exists(TRACKER_FILE):
        with open(TRACKER_FILE, 'r') as f:
            processed_urls = set(line.strip() for line in f if line.strip())

    # 2. Crawler Config
    browser_config = BrowserConfig(headless=True)
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=100,
        remove_overlay_elements=True
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        queue = [BASE_URL]

        while queue:
            current_raw_url = queue.pop(0)
            current_url = clean_url(current_raw_url)

            # Re-check (prevent duplicates in queue)
            if current_url in processed_urls:
                continue

            print(f"🚀 Processing: {current_url}")

            result = await crawler.arun(url=current_url, config=run_config)

            if result.success:
                # A. Save file
                file_name = get_file_name(current_url)
                file_path = os.path.join(OUTPUT_DIR, file_name)
                with open(file_path, "w", encoding="utf-8") as f:
                    f.write(result.markdown)

                # B. Update tracker
                processed_urls.add(current_url)
                with open(TRACKER_FILE, "a") as f:
                    f.write(current_url + "\n")

                # C. Find new links
                for link in result.links.get("internal", []):
                    link_url = clean_url(link['href'])
                    if should_crawl(link_url, PREFIX, processed_urls, ALLOWED_EXTENSIONS):
                        queue.append(link_url)
            else:
                print(f"❌ Failed: {current_url} - {result.error_message}")

            await asyncio.sleep(SLEEP_TIME)

# Execute
await universal_crawler()

One limitation of Colab is that it requires your browser to stay open and your connection to be active to keep the VM alive. For a task like this, keeping it open all day isn't a huge deal. But if your network drops for more than 30 minutes, the resources are released and your script stops. My script accounts for this by tracking progress, so you can pick up where you left off.

The Kubernetes docs were my "Phase 2" scrape. "Phase 1" was the entire documentation for LangChain.js and LangGraph.js—about 400 pages. While the Kubernetes script was running, I was already testing the search effectiveness of the LangChain data. I didn’t even wait for the K8s scrape to finish; I was too excited to see how it worked with MCP.

Here is a look at one of the scraped files:

# Tools
Tools extend what agents can do—letting them fetch real-time data, execute code...
## Create tools
### Basic tool definition
The simplest way to create a tool is by importing the `tool` function...

Now that I've scraped it, how do I use it?

I spent a lot of effort getting this data, so I needed an MCP service to serve it up immediately. Following the principle of "Don't reinvent the wheel," I looked for the simplest implementation. That's when I turned back to a piece of software I've been using for a while: Obsidian.

Obsidian is a free Markdown note-taking app with a massive plugin ecosystem. I previously used a plugin called "Copilot" which was great—it uses a built-in vector database combined with Embedding Model APIs (like OpenAI) to provide a full RAG (Retrieval-Augmented Generation) experience for your local notes. Since it's a note-taking app, Markdown rendering is built-in.

If there were a plugin that could link a vector database and an Embedding model while providing an MCP interface, I’d have the best of both worlds: managing docs in Obsidian (reviewing and editing them) while letting my AI IDE query them. I found a plugin called MCP Server (not in the official store; you have to clone it from GitHub and build it manually).

GitHub: Minhao-Zhang/obsidian-mcp-server

This plugin has all the hallmarks of a proper RAG system: configurable Embedding models, chunking, overlap settings, similarity thresholds, and top-k retrieval. It seemed perfect. However, I later realized a major flaw: it doesn't support incremental updates. If you add one new document, you have to rebuild the entire vector database. If you have a lot of docs, your API costs are going to climb fast.

After getting it running, I tested it using the official MCP Inspector:

npx @modelcontextprotocol/inspector

Using the SSE mode, I plugged in the SSE address from the Obsidian plugin and connected. I could then list the tools and use simple_vector_search to query my docs.

Does it actually work?

We scraped the docs with Crawl4ai and built the vector DB in Obsidian. We were dying to see how this improved the Vibe Coding experience. But the Inspector tests revealed a problem immediately.

Looking at the LangChain Markdown example from earlier, most documentation pages start with a huge introduction/navigation section and end with a long list of related links. These sections are packed with keywords. If you use standard chunking (say, 1000 or 2000 characters), these headers and footers become their own chunks.

From a vector DB perspective, these "noise" chunks actually have a higher semantic density than the actual code blocks. Plus, the intros often mention features from other pages. As a result, when you search via the MCP interface, the top results—sorted by similarity—are often just the intro/navigation chunks from five different pages, rather than the code implementation you actually need.

You can see it yourself if you look at the raw documentation pages: https://docs.langchain.com/oss/javascript/langchain/models

From this angle, after all this work to get the pipeline running, the results were... underwhelming. So, is there anything we can do to save our Vibe Coding setup?

There is, though whether it works is a story for next time. Stay tuned.