Tensor Runtime (Soft) – Architecture Decisions (Current)

1. Core Philosophy

The library is designed as a research-oriented tensor runtime, prioritizing: * Performance * Explicit control * Predictable execution

It is not a convenience framework.

Key principle: clarity > flexibility Because of this, some “nice” behaviors (such as implicit device copies) are intentionally disallowed.

2. Opaque Tensor Architecture

The public API exposes opaque types.

Core Concepts: * tensor_t * tensor_pool_t * tensor_graph_t

Users cannot access the internal structure.

Benefits: * Stable public API * Internal layout can evolve freely * Backend implementations can change without breaking user code

Note: This is a common pattern used in systems like the PyTorch tensor abstraction and TensorFlow tensor handles.

3. Memory Model – Tensor Pool

Tensors are allocated from a memory pool.

Concept Flow: tensor_pool → tensor allocations

Initial Implementation: Bump allocator * Fast allocation * No fragmentation * Simple implementation

The free operation is currently a placeholder and not yet implemented.

4. Explicit Output Operations

Operations follow a pre-allocated output pattern.

Example Concept: tensor_op(out, a, b) (Instead of: return new_tensor)

Advantages: * Avoids heap allocations * Reuses memory * Predictable performance

Note: Inspired by HPC libraries such as BLAS and cuBLAS.

5. Tensor Device Awareness

Each tensor knows the device it resides on (CPU or GPU).

Future Goal: The tensor runtime decides the optimal execution device based on: * Operation cost * Tensor size * Hardware latency

A dynamic scheduling system is planned but currently postponed.

6. Device Mismatch Policy

Strict device coherence rule: device(x) == device(y) == device(out)

If a mismatch occurs, it results in a runtime fault. There is no implicit data transfer.

Reasoning: * Predictable performance * Simpler runtime * Research clarity

7. Hardware Profiling Runtime

The library includes a small runtime profiler.

First Execution Flow: 1. Run microbenchmarks 2. Measure CPU/GPU latency 3. Generate CONFIG.soft

The CONFIG.soft file stores hardware characteristics, operation performance data, and device selection heuristics.

Runtime Behavior: * If CONFIG.soft exists → load profile * Else → run profiler

8. Offline Autotuning

The profiling system determines optimal execution based on tensor size, operation type, and hardware characteristics.

Example Concept: * Small ops → CPU * Large ops → GPU

This information is securely stored in CONFIG.soft.

9. Future Hardware Profile Database

Possible Future Improvement: Embed known hardware profiles.

Concept: device signature → configuration (Example: GPU model + CPU model → scheduling config)

This allows the system to skip profiling on known hardware.

10. Tensor Shape Representation

Current Design: uint32_t* dims

The shape is stored as a sentinel-terminated array. * Example: [3, 4, 5, 0] * Max dimensions defined as: TENSOR_MAX_DIMS = 8

11. Autograd System (Planned)

The autograd system is not yet stabilized.

Planned Design: * Tensor operations build a computation graph. * Loss tensor triggers the backward pass.

Backward Entry Point: tensor_backward(loss_tensor) Graph Structure Representation: tensor_graph_t

12. Optimizer Interface

Initial Optimizer Template: SGD

Example Concept: tensor_sgd_template(parameters, learning_rate)

Future optimizers are expected to seamlessly plug into this same structure.

13. Activation and Loss Operations

Initial built-in operations form a minimal core set for neural network experimentation: * Matrix multiplication * Transpose * Addition * Scalar multiplication * ReLU activation * MSE loss

14. Project Organization Decisions

Team responsibilities have been clearly separated for this operation.

Subsystem Ownership:

Mathematical Engine: Aakarsh + Zoya
Neural/Data Pipeline: Vishal
QA / Testing / CI: Anmol
CPython API Integration: Aadya (pending)

Frontend and documentation operations are postponed until the backend architecture stabilizes.

15. Feature Deferral

The dynamic compute switching system has been postponed.

Reason: To avoid premature optimization and maintain focus on securing the core tensor engine first.

Overall Architecture Vision

The system aims to become: Small tensor runtime + Explicit memory control + Hardware-aware execution + Basic autograd

Mission Objective: A minimal deep-learning execution engine, designed strictly for experimentation rather than end-user convenience.