Custom Tensor Runtime
Systems-Level Deep Learning Research
Project: Cost–Benefit Analysis of CPU vs GPU Computation in Deep Learning
Institution: GLA University, Mathura
Program: B.Tech – Computer Science & Engineering
Academic Year: 2025–2026
1. Overview
Modern deep learning frameworks such as PyTorch and TensorFlow provide powerful abstractions for building neural networks. However, these abstractions often hide the underlying computational behavior of the hardware.
For small-to-medium workloads, overhead introduced by Python runtimes, automatic memory management, and large software stacks can obscure the true performance characteristics of CPU and GPU computation.
This project explores a different approach.
The goal is to build a minimal, transparent tensor runtime written in C++, designed specifically for studying the interaction between neural network computation and hardware execution.
Instead of prioritizing convenience or ecosystem size, the runtime focuses on:
- explicit memory control
- predictable execution behavior
- direct interaction with CPU and GPU computation
By implementing custom CUDA kernels and manual memory management without relying on heavy libraries such as cuDNN, the project provides a white-box environment for studying hardware efficiency in deep learning workloads.
2. Research Motivation
The project investigates a fundamental systems question:
When does GPU acceleration actually become beneficial compared to CPU execution?
While GPUs excel at large-scale parallel computation, data transfer overhead and kernel launch latency can make CPU execution competitive for smaller workloads.
This runtime allows controlled experiments to explore questions such as:
- At what tensor size does
cudaMemcpyoverhead outweigh GPU parallelism gains? - Can a dynamic execution scheduler improve performance by running early network layers on the CPU and wider layers on the GPU?
- How does manual memory management compare to managed runtimes in terms of memory usage and latency?
The long-term objective is to identify the precise crossover point where GPU acceleration becomes advantageous for neural network workloads.
3. Engineering Philosophy
The runtime is built on a strict design principle:
Clarity > Flexibility
Unlike production frameworks designed for broad compatibility, this system intentionally restricts certain behaviors in order to maintain architectural transparency.
Key design decisions include:
Opaque Tensor Architecture
Tensor objects are exposed through stable interfaces such as:
tensor_ttensor_graph_ttensor_pool_t
The internal memory layout remains hidden, allowing the backend implementation to evolve without breaking the API.
Explicit Output Operations
All tensor operations require pre-allocated output buffers.
Example design pattern: - tensor_matmul(out, a, b)
This avoids hidden allocations and improves memory predictability.
Strict Device Coherence
Tensor operations enforce device consistency.
Operations will fault if tensors from different devices are mixed implicitly.
This avoids silent CPU↔GPU transfers and ensures predictable performance measurements.
4. Technical Environment
The runtime is developed using the following environment:
| Component | Specification |
|---|---|
| Language | C++17 or later |
| GPU Compute | NVIDIA CUDA Toolkit (11.0+) |
| GPU Hardware | NVIDIA GPU (Compute Capability ≥ 6.0) |
| CPU | x86-64 architecture |
| Build System | CMake |
| Operating System | Linux |
The project intentionally avoids heavy deep-learning libraries to maintain full control over memory management and kernel execution.
5. Experimental Validation
To validate correctness and measure performance, the system will be evaluated using the MNIST handwritten digit dataset.
MNIST provides a controlled workload that allows:
- rapid training iteration
- predictable convergence behavior
- minimal disk I/O interference
The dataset can be fully loaded into RAM or VRAM, ensuring that benchmarking results reflect compute performance rather than storage latency.
Initial experiments will train networks up to 10 layers deep to confirm functional correctness and runtime stability.
6. Project Scope
This runtime is not intended to replace existing frameworks. Instead, it serves as:
- a research platform for studying ML system performance
- a teaching tool for understanding tensor runtimes
- a systems experiment in hardware-aware neural network execution
By reducing abstraction layers, the project aims to provide deeper insight into how modern machine learning workloads interact with real hardware.
Future Documentation
Additional documentation sections include:
- Architecture Overview
- Tensor Runtime API
- Device Scheduling Model
- CUDA Kernel Implementation
- Benchmark Results