Building a .NET 10 Benchmark Harness


    Part 2 of 5 in the Local LLM Bench series.

    The PowerShell script from part one did its job. It surfaced the think-mode problem, sorted out which models could call tools, and gave me rough latency numbers. But it could not tell me whether the code models wrote was actually correct — I was reading output and deciding it looked fine, which is not the same thing as running it.

    What I needed was a harness that ran models against defined tasks, verified the outputs mechanically, and produced a repeatable score. I’m a C# developer. .NET 10 was already on the machine. The choice was not a choice.

    Architecture

    The project is a .NET 10 console application. The core pieces are:

    OllamaRunner is a thin HTTP wrapper around Ollama’s /api/generate and /api/chat endpoints. Every request goes out with temperature=0, seed=42, and think=false. Temperature zero makes results deterministic enough to compare across runs. The seed locks that in further. The think flag is false by default — models that need it explicitly will be detected and handled.

    RoslynEvaluator handles the SumEvens code test in-process. It takes whatever the model returns, strips any markdown fences, wraps the bare method in a class, and hands it to the Roslyn CSharp scripting API to compile and execute. If it compiles and SumEvens(new[] {1,2,3,4,5}) returns 6, the model passes. This runs entirely in memory with no disk I/O and no subprocess.

    TempProjectRunner is where it gets more serious. This component scaffolds actual temporary dotnet projects, writes model-generated code into them, builds them with dotnet build, and runs them with dotnet run. It checks stdout for the expected output. For the test suite portion, it scaffolds a second project alongside the first, adds a project reference, drops in model-generated xUnit test code, and runs dotnet test. Every project is cleaned up from the temp directory when the run completes.

    Scorer orchestrates the sequence — chat test, code test, tool test, instruction test, reasoning test, JSON output test, sequence test, Hello World test — and assembles the results into a ModelResult record.

    ModelResult is a straightforward C# record type. Every boolean metric is a property; TotalScore is a computed getter that sums them. The record also carries timing in milliseconds for each test category and a ThinkRequired flag that is informational only and does not affect the score.

    ConsoleReporter prints the final table to the terminal with ANSI color coding. ResultStore writes the raw results to results/model-results.json and a human-readable markdown ledger to results/RESULTS.md after each run.

    The Code Tests

    The first code test is SumEvens: write a C# method that takes IEnumerable<int> and returns the sum of even numbers. Return only the method, no class, no namespace, no explanation. This is deliberately narrow. The narrow scope is the point — it is testing whether a model can follow output constraints and write code that compiles and produces correct results, not whether it can write impressive prose around the code.

    RoslynEvaluator wraps the method in a class, invokes it with {1, 2, 3, 4, 5}, and checks that the result is 6. Compile error means the model scores zero on both compile and correct. Compiles but returns the wrong number means compile point awarded, correct point denied. Compiles and returns 6 means full credit.

    Hello World: The Real Test

    The Hello World test is where I learned something useful. The prompt asks the model to write a complete C# console application: a Greeter class with a public static GetGreeting() method that returns "Hello, World!", plus a Main method or top-level statements that calls it and prints the result. Separately, it asks the model to write xUnit tests for that Greeter class.

    TempProjectRunner scaffolds a dotnet new console project, replaces Program.cs with whatever the model generated, runs dotnet build, then dotnet run, and checks stdout for "Hello, World!". For the test portion, it scaffolds a dotnet new xunit project in the same temp directory, adds a project reference to the app, drops in the model’s test code as GreeterTests.cs, runs dotnet build, and then dotnet test.

    This turns out to be an excellent proxy for whether a model understands C# project structure. Writing a method is straightforward. Writing a complete application that builds from scratch against a specific framework target, with a class in a form that a separately compiled test project can reference — that is a different problem. Models that understand C# project conventions get it right on the first try. Models that pattern-match on superficial features tend to include the wrong using statements, declare the class in a namespace that the test code does not account for, or produce an entry point that conflicts with the Greeter class definition.

    Each step is gated: if the app does not compile, neither the output check nor the test run happens. If the tests do not compile, the pass/fail result is not recorded. Partial credit is possible — a model can build the app but write tests that compile and then fail at runtime, earning two of the four Hello World points.

    Scoring

    The 10-point scoring breakdown for the initial complete run:

    Category Points
    Chat response (non-empty, sensible) 1
    SumEvens compiles 1
    SumEvens correct 1
    Tool call supported (not HTTP 400) 1
    Tool call valid (structured, correct function) 1
    Instruction followed (exactly three words) 1
    Hello World app compiles 1
    Hello World app correct output 1
    Hello World tests compile 1
    Hello World tests pass 1

    After the initial runs I extended the suite with three more tests, bringing the maximum to 13: a reasoning test (a word problem with an exact numeric answer — $4.50, no other text), a JSON output test (produce a valid JSON array of at least three programming language names), and a sequence test (output the numbers 1 through 5, one per line, nothing else). All three are binary pass/fail with no partial credit. The reasoning and sequence tests catch models that ignore output constraints even when the constraint is explicit. Several did.

    Unit Tests

    The test project covers 13 cases across five test classes. ModelResultTests verifies that the scoring logic is correct — all true returns the expected sum, all false returns zero, ThinkRequired does not affect the score. RoslynEvaluatorTests covers the markdown fence stripping and three evaluation cases: correct implementation, wrong result, and garbage input. ScorerTests uses a MockRunner that replays canned responses and verifies that the Scorer assembles the ModelResult correctly for the pass case, the tool-rejected case, and the instruction-failure case. ConsoleReporterTests confirms that PrintTable does not throw with null prior results or when a model has regressed since the previous run.

    None of these tests require a running Ollama instance. The mock runner pattern makes the Scorer fully testable without any external dependencies.

    First Complete Run

    Thirteen models, ten metrics each. This is what came back:

    Model Score Notes
    gemma4:latest 10/10 Clean sweep
    glm-4.7-flash 9/10  
    gemma4:26b 8/10  
    qwen2.5:14b 8/10  
    devstral:latest 7/10  
    qwen3-coder:30b 7/10  
    qwen3:14b 7/10  
    mistral:latest 6/10  
    gpt-oss:20b 5/10 think_required detected
    phi4:14b 5/10  
    llava:7b 5/10  
    qwen2.5-coder:14b 4/10  
    qwen3:30b 3/10  

    gemma4:latest — a ~12B parameter model — scores 10 out of 10. It answers the chat question, writes SumEvens correctly, emits a proper tool call, follows the three-word instruction, builds the Hello World app, writes tests that compile and pass, gets the math problem right, produces valid JSON, and outputs the sequence with no extra text. On every metric the harness defines, it is the best model in the pool by a clean margin over everything larger than it.

    The result is worth sitting with. A model less than half the size of qwen3:30b outscores it by seven points. glm-4.7-flash is a 30B MoE and comes in second at 9/10. The coding-focused variants — qwen2.5-coder and qwen3-coder — score lower than their general-purpose counterparts at similar sizes.

    The obvious interpretation is that gemma4:latest is simply the best model here. The problem is that the harness measures what I built the harness to measure. Before drawing that conclusion, I need to know whether these metrics are the right metrics.


    The full source is at github.com/erichexter/ollama-model-bench.


    Next up: Part 3 digs into what the scores actually mean — and why gemma4:latest’s clean sweep turned out to be almost entirely beside the point.

    Search — The Evolution of the Karpathy LLM Wiki


    My LLM notes wiki outgrew file reads. Agents were pulling entire files to find a single relevant section — burning tokens on context that didn’t matter, missing things that were buried three pages deep. The corpus had just grown past the point where IO-based access was practical.

    The fix was search. And since agents need tools, the obvious move was to build it as an MCP server. But if you’re building search anyway, plain keyword matching felt like leaving half the value on the table — too easy to miss conceptual matches that don’t share exact terms. So: something old and something new. SQLite already has FTS5. sqlite-vec adds HNSW vector search as a loadable extension. Ollama runs the embedding model locally. Put them together and you get hybrid RAG on hardware you already own, exposed as an MCP tool any agent in the fleet can call.

    This post covers how it’s built — starting from what the agent sees and working inward to the SQL and vector embeddings.


    What the Agent Sees

    From the agent’s perspective, this is just an MCP server with a set of tools. Point an .mcp.json at the host and the tools are available. No setup, no SDK, no awareness of what’s running underneath.

    The primary tool is search_knowledge:

    {
      "method": "tools/call",
      "params": {
        "name": "search_knowledge",
        "arguments": {
          "query": "attention mechanism scaled dot product",
          "top_k": 5,
          "hybrid_alpha": 0.6,
          "sources": ["karpathy-wiki"]
        }
      }
    }
    

    The response comes back as ranked chunks with source context:

    {
      "content": [{
        "type": "text",
        "text": "[
          {
            \"text\": \"Scaled dot-product attention divides the dot products by √d_k to prevent vanishing gradients in high dimensions...\",
            \"source\": \"karpathy-wiki\",
            \"relPath\": \"transformers/attention.md\",
            \"score\": 0.91,
            \"frontmatter\": { \"tags\": [\"attention\", \"transformers\"] }
          },
          ...
        ]"
      }]
    }
    

    The agent gets ranked text chunks, source file paths, and scores. It doesn’t need to know whether the result came from a vector search or keyword search — that’s the server’s problem.

    The Full Tool Set

    Seven tools in total. search_knowledge covers 95% of use.

    Tool Purpose
    search_knowledge Hybrid vec+FTS search across one or more sources.
    get_page Retrieve a full page by source + relative path. Use when search returns a partial chunk and you want the full document.
    list_sources Lists indexed sources with page/chunk counts and last-indexed timestamps.
    get_stats Query counts and latencies over 1h / 24h / 7d / 30d windows.
    get_query_log Recent query history. Useful for understanding what agents are actually asking.
    refresh_ingest Trigger immediate re-indexing for a source after a write.
    ping Returns current UTC. Health check.

    list_sources is underrated as a diagnostic. A 200 response from the API tells you nothing about whether the index is populated. If results are poor, check pageCount > 0 and that lastIndexed is recent before assuming the search logic is wrong.

    The hybrid_alpha Parameter

    This is the control knob for the blend between vector search and full-text search.

    • 0.0 — pure FTS (BM25 keyword ranking)
    • 1.0 — pure vector (semantic similarity)
    • 0.5 — equal blend (default)

    In practice, 0.60.7 (vector-weighted) works better for conceptual queries: “how does attention scale with sequence length.” Drop toward 0.3 when you need an exact term match that the embedding model might paraphrase: specific function names, error codes, version numbers.


    How the Search Works

    When search_knowledge is called, the server runs two queries in parallel and merges the results.

    var vectorTask = SearchByVector(embeddingVector, topK * 2, sources);
    var ftsTask    = SearchByFts(query, topK * 2, sources);
    
    await Task.WhenAll(vectorTask, ftsTask);
    
    var merged = Merge(vectorTask.Result, ftsTask.Result, hybridAlpha, topK);
    

    The merge step normalizes each result list’s scores to [0, 1], applies the alpha weight, sums scores per chunk (a chunk can appear in both lists), and returns the top K. Normalization matters — BM25 and HNSW distance are on completely different scales. Skip it and one path dominates every query regardless of alpha.

    Before either query runs, the search query itself gets embedded:

    POST http://<ollama-host>:11434/api/embeddings
    Content-Type: application/json
    
    {
      "model": "nomic-embed-text:latest",
      "prompt": "attention mechanism scaled dot product"
    }
    

    That gives back a 768-dimensional float vector — what the vector search runs against.

    The Vector Query

    sqlite-vec exposes vector search through a virtual table with a MATCH clause. Under the hood it’s doing an approximate nearest-neighbor scan via HNSW:

    SELECT c.id, c.body, c.source, c.rel_path, c.frontmatter,
           cv.distance
    FROM chunk_vecs cv
    JOIN chunks c ON c.id = cv.chunk_id
    WHERE cv.embedding MATCH :embedding
      AND cv.k = :k
      AND (:sources IS NULL OR c.source IN :sources)
    ORDER BY cv.distance;
    

    distance here is L2 distance — lower is closer. sqlite-vec handles all the index internals; from the query side it looks like a regular SQL query.

    The FTS Query

    Standard SQLite FTS5 with BM25 ranking:

    SELECT c.id, c.body, c.source, c.rel_path, c.frontmatter,
           bm25(chunk_fts) AS fts_score
    FROM chunk_fts
    JOIN chunks c ON c.id = chunk_fts.rowid
    WHERE chunk_fts MATCH :query
    ORDER BY bm25(chunk_fts)
    LIMIT :k;
    

    FTS5’s MATCH supports phrase queries, prefix matching, and boolean operators. For agent queries coming in as natural language, the server sanitizes the input to a simple term query before passing it to MATCH.


    The Data Model

    Three tables carry the retrieval workload:

    -- Chunked text with metadata
    CREATE TABLE chunks (
        id          INTEGER PRIMARY KEY,
        page_id     INTEGER NOT NULL REFERENCES pages(id),
        chunk_index INTEGER NOT NULL,
        body        TEXT    NOT NULL,
        token_count INTEGER,
        source      TEXT,
        rel_path    TEXT,
        frontmatter TEXT
    );
    
    -- Vector index (sqlite-vec extension)
    CREATE VIRTUAL TABLE chunk_vecs USING vec0(
        chunk_id INTEGER PRIMARY KEY,
        embedding FLOAT[768]
    );
    
    -- Full-text search index (FTS5, built into SQLite)
    CREATE VIRTUAL TABLE chunk_fts USING fts5(
        body,
        source    UNINDEXED,
        rel_path  UNINDEXED,
        content='chunks',
        content_rowid='id'
    );
    

    chunk_vecs is a sqlite-vec vec0 virtual table — INSERT a row with the chunk ID and its 768-dim embedding, sqlite-vec maintains the HNSW index internally. chunk_fts is a content-backed FTS5 table that stays in sync with chunks via triggers.

    Supporting tables: pages (source files with hash-based change detection), indexer_runs (ingest audit log), query_log (query history for observability).

    One SQLite file. No separate processes, no network hops between storage components, no backup complexity.


    The Write Path

    When a document is added or updated in the source directory, the indexer picks it up:

    1. SHA-256 hash the file. Compare against pages.content_hash. Skip if unchanged.
    2. Parse YAML frontmatter. Extract the body.
    3. Split into chunks — 512-token target, 64-token overlap, break on paragraph boundaries where possible.
    4. For each chunk: POST to Ollama /api/embeddings. Receive a 768-dim float array.
    5. INSERT into chunks. INSERT into chunk_vecs. FTS5 trigger handles chunk_fts sync.
    6. Update pages.content_hash and indexed_at.
    7. Write a row to indexer_runs.

    nomic-embed-text is 137M parameters — fast on a GPU host, single-digit milliseconds per chunk. The indexer pipelines requests; Ollama queues them.


    Gotchas

    The embed model context limit is a silent failure.

    nomic-embed-text has an 8K token context window. Chunks that exceed it are silently not embedded — present in chunks, retrievable via get_page, invisible to vector search. No error from Ollama. Enforce the chunk size limit at ingest time. Symptom check:

    SELECT p.rel_path, p.source, LENGTH(p.content) AS content_len
    FROM pages p
    LEFT JOIN chunks c ON c.page_id = p.id
    WHERE c.id IS NULL;
    

    Any row here is a page with no chunks.

    Stale bind mount after remount.

    If the CIFS mount backing the source directory remounts — after a network blip or server reboot — the container holds a file descriptor to the old empty mount point. The API returns 200. The indexer runs. It finds zero files. Nothing crashes, nothing complains. Restart the container after any storage remount.

    Shallow health checks miss the real failure mode.

    GET /ping → 200 stays green with an empty index. Real health check: call list_sources, assert pageCount > 0 with a recent lastIndexed. You’re monitoring the retrieval system, not just the process.


    What This Gets You

    ~11K chunks, query results under 100ms on commodity hardware. The Ollama embedding call is the only network hop on the hot path — ~10ms on a GPU host for a short query. The SQLite ANN index is not the bottleneck.

    Hybrid search earns its keep in practice. Pure vector drifts on exact version numbers, function names, and error codes. Pure FTS misses conceptual synonyms. The blend handles both without tuning a separate retriever per query type.

    The MCP wrapper means any agent that speaks the protocol can call it without any awareness of the storage layer. Add a source, re-index, done — consumers don’t change.

    Most databases can store embeddings at this point. The reason to reach for SQLite + sqlite-vec specifically is that you probably already have it, it requires no new infrastructure, and the FTS5 index is already there. The hybrid approach — run both searches, blend by alpha — transfers to any store that can handle both. The schema and the search logic are the portable parts.

    Which Local Models Can Actually Code?


    Part 1 of 5 in the Local LLM Bench series.

    I had ten local models installed and no good answer to a simple question: which of them could actually do useful work? Chat demos are easy to fake. I wanted to know whether these models could write working code, call tools correctly, and follow instructions without needing hand-holding. The only way to find out was to run them.

    The Setup

    Machine is an Alienware Windows 11 box with an RTX 5080 carrying 16GB of VRAM. Ollama is running locally, serving the following ten models:

    • mistral:latest (7B)
    • llava:7b (7B, vision)
    • gemma4:latest (~12B)
    • gemma4:26b (26B)
    • qwen3:14b (14B)
    • qwen3:30b (30B)
    • phi4:14b (14B)
    • qwen2.5:14b (14B)
    • qwen2.5-coder:14b (14B, coding-focused)
    • glm-4.7-flash (30B MoE)

    The size range alone tells you the hardware story. Anything under about 20B fits in VRAM comfortably. The 26B and 30B models spill onto system RAM — which you feel in the latency numbers.

    First Pass: Two Prompts, PowerShell

    The first script was about as minimal as it gets. Two prompts per model: “What is the capital of France?” to confirm the model is responding at all, and “Write an is_prime() function in Python” as a basic code generation check. No scoring, no verification — just checking that something came back.

    Most models answered both prompts without incident. Then I hit the bigger ones. gemma4:26b, glm-4.7-flash, and qwen3:30b all returned empty responses. Not errors — the HTTP calls succeeded, Ollama said everything was fine, the responses just contained no text.

    That took longer than it should have, and the answer was different for each model.

    The Think-Mode Wall

    qwen3 models support a reasoning mode where the model works through a problem step by step before producing visible output. The reasoning tokens live inside <think>...</think> blocks and don’t count against the response. What does count against the response is the token budget, and when I was requesting with a tight num_predict limit, the model was spending the entire budget on internal reasoning and returning nothing to the caller. glm-4.7-flash has its own variant of the same mode — different model family, same symptom.

    The fix for both: add "think": false to the request body. With that flag set, qwen3:14b went from returning a blank response to producing clean, working code in about 2 seconds. The qwen3 and glm models followed.

    gemma4:26b’s blank responses were a separate problem entirely. At 26B it spills to RAM, and with a tight num_predict budget and slow generation speed, the script’s read timeout was firing before any tokens arrived. More headroom fixed it.

    The lesson here is that “model returned empty string” and “model failed” are not the same thing, and you have to understand what each model family expects before you can interpret the output.

    Tool-Calling: Where Things Got Interesting

    Once the basic chat and code tests were passing, I added a tool-calling test. The prompt was “What’s the weather in Paris?” with a get_weather function schema attached to the request. A model that handles tool calling correctly should stop generating text and instead emit a structured tool_calls object pointing at get_weather with the right argument. A model that doesn’t understand the protocol either returns prose (“I don’t have access to weather data”), returns a JSON blob as plain text, or refuses the request entirely with an HTTP 400.

    The results split into three clear buckets. mistral, gemma4 (both sizes), qwen3:14b, qwen2.5:14b, and glm-4.7-flash all produced proper structured tool_calls. That is the expected behavior — the model uses the tool schema as intended.

    qwen2.5-coder:14b was the interesting failure. It returned what looked like a tool call, but as a raw JSON string embedded in the message content rather than as a structured tool_calls entry. The model clearly understood what was being asked; it just didn’t output it in the right format. A “coder” model is not necessarily a “tool-aware” model. They are different capabilities.

    llava:7b and phi4:14b both returned HTTP 400 on any request that included the tools field. Those models simply do not accept the parameter — the API rejects it before the model even sees the prompt. llava makes sense here: it is a vision model, not a chat/agent model. phi4 is less obvious.

    Mid-Phase Additions

    While working through these tests I pulled in three more models that had come up in research as strong candidates for coding benchmarks: devstral:latest (22B, Devstral Small — Mistral’s coding-focused release), qwen3-coder:30b (~30B, Qwen’s coding-tuned variant), and gpt-oss:20b (~20B). All three were added before the formal scoring phase started.

    The Baseline Table

    Here is where every model stood after the initial phase — response times are wall-clock from the PowerShell script, rounded to the nearest second:

    Model Size Chat Code Tool call Notes
    mistral:latest 7B 3s 1s proper  
    llava:7b 7B 4s <1s rejected Vision model
    gemma4:latest ~12B 6s 1s proper  
    qwen3:14b 14B 4s 1s proper think=false required
    phi4:14b 14B 5s 1s rejected  
    qwen2.5:14b 14B 6s 1s proper  
    qwen2.5-coder:14b 14B 6s 1s text (not structured) “coder” does not mean tool-aware
    gemma4:26b 26B 9s 3s proper Partial CPU offload
    glm-4.7-flash 30B MoE 8s 4s proper  
    qwen3:30b 30B 14s 8s proper Slowest in pool

    The latency numbers tell one story — size matters, mostly predictably. The tool-call column tells another: ten models, three different behaviors from the same input, and two of them would silently fail in any agentic loop that expected structured output.

    What “Works” Actually Means

    The issue with this baseline is that “passes” hides a lot. A model that returns a tool call in the message content instead of the tool_calls field looks fine until your application tries to deserialize the response. A model that works at num_predict=300 might silently truncate at num_predict=100. A model that answers “capital of France” correctly might write Python is_prime() that has an off-by-one error nobody noticed because nobody ran it.

    Everything in this phase was manual inspection. I was reading outputs and deciding they looked reasonable. That is not a test; that is a vibe check.

    The only way to actually know whether a model can write working code is to compile and run the code. Which meant building something more serious.


    Next up: Part 2 covers building the .NET 10 benchmark harness — including a scoring system that actually executes model-generated C# and runs the tests.

    Back from the dead


    Twelve years. My last post here was April 2014, and I closed it by promising “painstaking detail in the coming months” on what my team was building. Then I wrote exactly zero of those posts. Sorry about that.

    A lot has changed — starting with the site itself. When I last hit publish, lostechies.com was running on WordPress. Today it’s a Jekyll static site, hosted on GitHub Pages, and posting means committing a markdown file to lostechies/blog. Which is honestly delightful. No login, no editor, no plugin upgrades. Write, commit, ship.

    In that spirit of bringing old things back to life: I also just revived Should, the assertion library I built way back when. It’s been dragged forward into modern .NET and is usable again. More on that in a follow-up post.

    The bigger thing on my plate, though, is AI. I’ve been heads-down on agent development and agent frameworks — building them, breaking them, figuring out where the seams are. A few recent threads I’ve been pulling on over on LinkedIn: the economics of AI software delivery, adversarial code reviews run by AI, and why companies forget what they already know. That’s most of what I want to write about going forward.

    I’ve also been using those agents on small static-site experiments, including a homeowner-facing New Braunfels AC emergency repair cost guide. It’s a practical way to keep testing the boring parts of software delivery: content generation, deployment, search visibility, analytics, and production monitoring.

    I’m not going to promise a posting cadence — I learned my lesson in

    1. But if you stumbled back here from an old MvcContrib link or a 2012 SignalR post: welcome. The blog isn’t dead. It just needed a git push.

    Working hard and enjoying every minute of it.


    I have not blogged in almost a year, I am a total slacker. But, I really want to share what I have been doing and what my team and I have learned, so in the coming months, I will be getting into painstaking detail about some concepts and implementations that I think have really helped my team to deliver value.

     

    Where am I ?

    About a year ago I left my role as Chief Architect for the largest .Net ecommerce site, www.dell.com , I found my role there ended up spending more time teaching the fundamentals to teams and management, when I really wanted to spend my time moving quickly and getting things done. So I left for a start up; QuarterSpot. My role there is CTO, and I am responsible for all of the technology decisions, which is great, because if something is not working, I am accountable and empowered to change it.  QuarterSpot is a peer to peer financial company that specializes in lending money to small businesses. I feel great about our mission which is to help small businesses get money when banks will not lend money or the process is so time consuming that by the time they get approved the small business losses the opportunity they needed the money for. (QuarterSpot CEO on Small Business Lending Panel at LendIt Conference)

     

    What am I doing?

    My team is responsible for building all of the technology to enable our business. Since the peer to peer space is a newer business model, this means we need to move fast and innovate, which is what the promise of Agile was all about. Since we are in the financial space, quality is of the highest importance, so this is where my experience in extreme programming (XP) practices really pays off. So, mix this together with Continuous Delivery and we have all the components to deliver software at a rapid pace in a business that needs to rely on technology innovations to stay ahead of its competition.

    We are building the websites and backend systems to be able to process and service loans, utilizing machine learning to analyze our customers so we can analyze and discover better algorithm to serve the business. We are able to use whatever tools makes the most sense for us to move quickly and it is so much fun to deploy code to production on a frequent basis.

     

    We push code to production frequently, which means I am usually exhausted after a full day of work. This is very rewarding. It also takes a lot of mental energy to stay diligent about quality and make sure each feature is complete.

     

    Topics that I will be covering in upcoming posts

    • What is continuous delivery and how is it different from continuous deployment?
    • The importance of keeping code out of your UI / Web frameworks.
    • Using the Command Query Separationpattern
    • Transparency in your development and production support process, utilizing dashboard
    • Utilizing cloud infrastructure to move quickly.
    • Automate everything.
    • How my preferred development stack has changed since 2009.
    • Importance of a consistent architecture / application implementation.
    • Keeping your  architectural concept count low.
    • Optimizing performance when it maters and not before.
    • Machine Learning and statically typed models.

     

    If any of these topics are interesting to you, let me know in the comments and I will get to those posts first.

    using the asp.net lego blocks to create a synchronized Kanban board.


    Over the last 1-2 years the capabilities of the web lego blocks (libraries) have really come together to allow us, the web development community. to start putting together some really interesting applications. The best part is all of the plumbing code is in the libraries. You can know write a rich user experience without having to write a lot of code. The example app uses ASP.Net MVC, ASP.Net WebAPI, SignalR, KnockoutJSjQuery, jQuery UI, and Twtitter Bootstrap.

    If you are really interested in this project, fork it on github https://github.com/erichexter/SyncKanbanSample

    A Synchronized Kanban board

    A kanban board is pretty simple, it has a collection of vertical swim lanes and items that move from one lane to the next, from left to right.  Below is a screen shot of the application I put together in a few hours. The interesting features are you can click and drag a post it note from one column to another, this is then saved on the server behind the scenes. Then if two people are looking at the same board, the changes will be synchronized on each others web browser in real time.

    image

    To allow the drag and drop, I used the jQuery UI Sortable interaction.  To enable the mulit browser syncronization I used a combination of KnockoutJS and SignalR.

    Here is an example of the synchronization.

    To view this on youtube go here http://www.youtube.com/watch?v=MXQwhfHzRls&feature=youtu.be

    The Code:

    To create the initial screen us use the following code:

    ASP.Net MVC Action –

    The code in this action will retrieve a board including the collection of lists and tasks and pass that model to the mvc View.

    image

    Below is the Board Viewmodel

    image

    image

    Here is the MVC view.  A majority of the code is the client side templating. All of the data-binding is the KnockoutJS client side binding syntax.

    image

    The script on the page wires up the knockout bindings, a jQuery Sortable knockout plugin, and the signalR initialization code.

    image

    The code below shows the SignalR server side code “Hub”. The two main server side code snippets is the getAllLists, which will send down all the lists and tasks when the board initializes. The second method is the movedTask method which is executed when a card is dropped in a column.

    image

    The last piece of code which ties this together is some more client side code which is the client side viewmodel.

    This is where the client side code wires up the Sortable Drop with the signalR code to call the server side hub.

    image

    Tip to become a successful software engineer.


    This post is a follow up to Derick’s great post. I could not agree with his view point any more., but it struck a chord with me.  There is more to it. To actually call yourself a software engineer you need to take into account a few aspects of what an engineer should do.

     

    You’re Not Paid To Type

    Typing code into a code editor or text editor is not what a Software Engineer is paid to do.  At least, it is not the primary reason this profession exists.  Yes, part of the job is to write code in any number of languages and platforms. As Derick pointed out, it is more then writing code, it is about writing tests, and making sure the code you do type works as designed and can be easily maintained.

    All that being said, the actual act of typing is simple and quick.  There is training in keyboard typing and methods to increase how many words per minute one can type. So, does typing more code constructs per minute mean you should to get paid more money?  If you turn out more code then the engineer sitting next to you, have you created more value?  See where I am going with this.  Typing is easy, and typing the wrong code is really easy.  I have seen organizations that are fearful of missing deadlines and dates. Its so unhealthy that the developers think they need to start writing code NOW, but they don’t really know what they are supposed to be creating. They do know what to create in the general sense, but they rush into writing software without knowing most of the details.

     

    You are paid to THINK, so start doing that

    So, my main point of this post is that Software Engineers are paid to Think.  You are paid to think about what is the correct code to create, how is should be constructed to lower the total cost of ownership.

    If you only change one thing about the way you work this year try this.

      If you normally get your requirements verbally, trying writing them down.

    Write down your requirements or technical plan in the easiest manner possible. That could be on a whiteboard, you could annotate a screenshot of an existing screen, you could use pencil and draw the changes to a print out of a screen shot.  Just do something in terms of thinking about what needs to be done before you start typing.  If you do write down what you plan to do, you can actually communicate it to other developers. You can have someone else review it and think through the problem.  You can also show it to the person who will decide if you created the correct software, imagine getting some feedback on what you want to build before you mess it up?

    The two most valuable ways I have found to write down what needs to be created are Screen Mockups and Sequence Diagrams. Now, I have been in the web space for a long time, so if you are not creating websites, or web applications, you may find that there are better ways to write down what you need for your particular design problem.  Either way , try to write it down. If you are writing mockups today, then add a sequence diagram for the more complicated problems and see if it helps.  I know it helps me and the developers I work with.

    ASP.Net Web Config Transform Console Utility released on nuget


    Overview

    ASP.Net Web.config transformations are a great way to manage configuration differences between environments. You can easily change a database connection string or change the compilation model for asp.net.  Here is a link to the syntax documentation on msdn. The problem with web.config transformations, is that it has been historically really hard to run the transforms. The tooling to do this was buried into Visual Studio.  The ASP.Net team just released a library to run the transformations as a nuget library. 

     

    Installation

    Using that library I created a very simple command line tool to transform config files WebConfigTransformRunner is the package containing this utility.

     

    install-package WebConfigTransformRunner

     

    Usage

    WebConfigTransformationRunner.exe WebConfigFilename TransformFilename OutputFilename
    

    Scenarios

    I see this package being used in two ways.

    First, using this in an automated build as part of a packaging process to pre transform configuration files for different environments.  This was the main reason I created this library. I am using it in a build in TeamCity to transform my web.config file for an asp.net mvc application as part of an automated CI build and deploy.

    The second scenerio that seems very useful would be to access this package from the install script (install.ps1) from a nuget package. The current configuration transformations that nuget supports is very limited, It works best when you have static configuration nodes that will never change.  If you have a node that has an attribute that a user / developer may change then using a configuration transformation would be a more reliable.  Since this tool is delivered as a nuget package the command is available in the path of the nuget console, so a package that needs to run a transformation would just need to take a dependency on this package then it could run the exe command from the install script, on the files it wants to transform. I could see running the main web.config with a transformation that is located in the packages content folder, for example. 

     

    Want to help?

    The project is open source and available on github. Please submit issues, ideas or pull requests!

    Are your unit tests still hard to read ? – Should Assertion Library


    I created the Should library to fill a gap in the testing ecosystem in the .Net space.  Simply put, I took what I liked about using extension methods to make a more readable set of assertions, but made the library independent of any specific unit test framework. The last point is important, because this library can be used with all unit test frameworks. There were similar tools to this previously, but they all were tied to specific libraries so I could not have a consistent language when I move between test frameworks.

    Cleaner syntax

    First, consider the syntax and the readability of the assertions of a unit test. This is what a null check looks like using should.

    foo.ShouldBeNull();

    versus the equivalent syntax using MSTest.

    Assert.Null(foo);

    Install it now

    Should is available on nuget.

    install-package Should

    Learn more about it

    Start by watching this short video

    There are a number of place to learn more about Should.

    There is a second dialect of Should called Should.Fluent.  Learn about it here:

    Using sql compact for integration tests with entity framework.


    In my practices using continuous integration, I try to achieve 100% code coverage using integration tests. This is a separate metric from my unit tests, think of these tests as verifying all of my infrastructure code works properly when wired up to data access or other out of process assets like databases or services.  While it is easy to setup sql server on a build server, I have run into instance where organizations using shared build servers do not allow access to create and drop databases as part of the CI process.  A simple way to work around this is to use sql compact in process in the integration test suite. This also gives you an advantage on developer workstations to isolate your integration test data access from your development instance if you are running an End to End application on your workstation, which you should be.

    I have run into a number of issues getting the SQL CE working in a unit test project(class library). Here are my notes of how to get it working.

    1. install the following nuget packages:
      1. EntityFramework.SqlServerCompact
        • Microsoft.SqlServer.Compact
          • SqlServerCompact.IntegrationTestConfiguration – this is a package I created to quickly add the provider configuration into an app.config file.</ol>
          • Add the native dlls to the integration test project, set the Copy to Output Directory to Always. See how to do this in VS2012 in the screenshot below.
            image
            • The item is to use a test setup method to remove the data in the database.  There are two ways to accomplish this. Both of these methods help ensure your tests will be isolated from each other in terms of data setup and will not try to reuse test data from one test to the next.
              First you can delete the entire sql compact database file. The downside to doing this is that the tests will run slower since it will recreate the database for each test.  The advantage to this approach, is that as you add new entities to your model, you do not have to update this method in order to keep your test suite clean. WithDbContext(x =>
                              {
                                  if (x.Database.Exists())
                                      x.Database.Delete();
                                  x.Database.CreateIfNotExists();
                              });

            The second approach is to run a set of delete statements for each table in the test setup. This is faster because the Entity Framework does not need to recreate the entire file. The downside of this is maintance for your tests. Every time you add a new entity to the ORM you need to add a new line to this setup function.

            WithDbContext(x =>
                            {
                                x.Database.ExecuteSqlCommand(“delete from Users”);
                                x.Database.ExecuteSqlCommand(“delete from ShoppingCarts”);
                                x.Database.ExecuteSqlCommand(“delete from Products”);
                            });</ol>

subscribe via RSS