Microarchitectural Design & Vector Parallelism

By developing a functional Single-Cycle RISC-V CPU and implementing low-level microarchitectural interventions, I eliminate standard toolchain inefficiencies to enforce absolute hardware-aware performance — a core competency for ML infrastructure engineering.

Core Artifact: Single-Cycle RISC-V CPU Architecture

Before optimizing code at the assembly layer, I built my foundation from the gates up by engineering a functional single-cycle RISC-V processor core. This involved designing the physical datapath, implementing the arithmetic logic unit (ALU), mapping out control units, and structuring memory interfaces to execute standard base instruction sets. Building this processor provided absolute visibility into clock cycles, propagation delays, and the exact physical pathways that software instructions must travel through silicon.

The Diagnostic Case: Shifting to Vector Extensions

Building a single-cycle CPU from scratch gave me a brutal look at how instructions actually move through pipelines. As soon as I finished wiring up the base core, the bottleneck became obvious: a standard sequential architecture is fundamentally ill-equipped for massive, real-time spatial workloads like 3D point clouds and LIDAR data streams. If I tried to process millions of streaming spatial coordinates using high-level code and standard compiler auto-vectorization, it would completely stall. Compilers simply cannot guess structural data alignment effectively. They generate bloated scalar loops that leave the underlying hardware lanes totally idle while processing elements one by one. For me, this realization proved that generic software abstractions completely blind you to microarchitectural waste. To build things that actually scale, I had to drop the standard toolchain safety nets and pivot to engineering a hardware-aware architecture using explicit RISC-V Vector (RVV) intrinsics.

The Abstraction Penalty

By relying on generic compiler optimization loops, standard binaries suffer from severe scalar bloat because the compiler cannot guarantee that the 3D memory coordinates are completely contiguous or free of pointer aliasing. For peer researchers, this means the hardware’s vector registers remained entirely unutilized, processing elements one by one rather than in parallel chunks. For an infrastructure manager, this performance penalty translates directly to inflated compute times, wasted hardware expenditure, and unacceptable latency spikes in a deployment pipeline.

The Hardware-Aware Redesign

To resolve this bottleneck, I am rewriting the spatial processing core using explicit RISC-V Vector (RVV) intrinsics to take manual control over the hardware registers. By injecting the vsetvli (vector set vector length vector input) instruction directly into the pipeline, I explicitly lock the vector length and element width based on the machine’s actual hardware capabilities rather than guessing. This setup structures the memory arrays to stream coordinates straight into the vector lanes, maximizing the work performed per clock cycle.

Operationalizing Efficiency

This engineering roadmap proves that in high-performance computing, every layer of unnecessary abstraction is an active threat to system efficiency. By operationalizing this principle, I demonstrate a core competency essential for ML infrastructure and research engineering: the ability to profile an inefficient system, diagnose microarchitectural waste, and re-engineer it at the silicon layer. True optimization requires removing the guesswork from the toolchain and enforcing structural efficiency by design.

Generative AI Integration

I integrated Claude and Gemini into my hardware development loop to handle the manual overhead of documentation, learning assembly processes, and code formatting.

I used generative AI models to handle three specific areas:

Learning Assembly and ISA Concepts: When mapping out low-level execution logic, I used AI as an interactive reference tool to quickly break down base instruction sets, trace propagation patterns, and understand the raw mechanics of hardware registers.
Vector Extension Research: To prepare for migrating my single-cycle CPU design into advanced parallel execution, I used AI to parse dense architectural specifications, helping me learn the syntax, constraints, and structural layout of RISC-V Vector instructions like vsetvli.
Documentation and Code Formatting: I used generative tools to organize my hardware design notes, clean up formatting layout choices, and structure my code block documentation so that the implementation details remain perfectly readable and easy to navigate.

Artifacts & Reference Material

View on GitHub

← Back to Home