Soft-CUDA Repository — Full Analysis

Overview

soft-cuda is a C++23 library that aims to be a software-emulated tensor computation engine, inspired by CUDA's execution model. It provides a C-style API with opaque types, a bump-allocator memory pool, and a deferred (lazy) evaluation graph. The goal appears to be a PyTorch/tinygrad-like tensor library that can run on CPU and eventually dispatch to a GPU backend.

Current state: Early prototype. Only scalar multiplication actually executes. The vast majority of declared API functions have no implementation.

Architecture Diagram

graph TD
    subgraph "Public API"
        A["include/soft-cuda/tensor/tensor.h<br>(179 lines)"]
    end

    subgraph "Core Engine"
        B["src/core/tensor/tensor.cpp<br>Tensor creation + evaluate dispatch"]
        C["src/core/pool/pool.cpp<br>Bump allocator arena"]
    end

    subgraph "CPU Backend"
        D["src/backend_cpu/math/mul.cpp<br>Scalar mul ✅ | Matrix mul ❌"]
        E["src/backend_cpu/math/scalar.cpp<br>Float32 value extraction"]
        F["src/backend_cpu/base/debug.cpp<br>Custom assert + printf"]
    end

    subgraph "GPU Backend"
        G["src/backend_gpu/placeholder.md<br>❌ Empty"]
    end

    A --> B
    B --> C
    B --> D
    D --> E
    D --> F

What's Working ✅

Feature	File	Status
Bump allocator pool	pool.cpp	✅ Solid — aligned alloc, create/destroy/reset
Tensor creation (all dtypes)	tensor.cpp	✅ Works for scalars and N-D tensors
Scalar × Tensor multiply (float32 only)	mul.cpp	✅ Functional via lazy eval
Lazy evaluation dispatch	tensor.cpp tensor_evaluate()	✅ Switch on `tensor_op_t`
Custom debug/assert system	debug.cpp	✅ Working

What's Declared but NOT Implemented ❌

These functions exist in the public header but have zero implementation anywhere in the codebase:

Function	Purpose	Priority
tensor_matmul()	Matrix multiplication	🔴 Critical
tensor_add()	Element-wise addition	🔴 Critical
tensor_transpose()	Matrix transpose	🔴 Critical
tensor_scalar_mul()	Scalar multiply (different API from internal tensor_mul)	🟡 Medium
tensor_relu()	ReLU activation	🟡 Medium
tensor_mse_loss()	Mean squared error loss	🟡 Medium
tensor_cross_entropy_loss()	Cross-entropy with softmax	🟡 Medium
tensor_add_bias()	Bias addition with broadcasting	🟡 Medium
tensor_fill_random_normal()	Random initialization	🟡 Medium
tensor_backward()	Autograd backward pass	🔴 Critical for training
tensor_sgd_template()	SGD optimizer	🔴 Critical for training
move_tensor_device()	CPU↔GPU transfer	🟠 Future
tensor_pool_reset()	Pool reset (declared in header, tensor_pool_zero exists in pool.cpp)	🟢 Easy fix — just alias
tensor_mul_op_matrix()	Matrix multiply kernel	🔴 Returns `false` stub

Bugs & Issues 🐛

1. Critical: tensor_mul_op_scalar writes to wrong buffer (in current code vs check.txt)

In mul.cpp (current code), the scalar multiply writes the result to t->data (the output tensor):

// Current — CORRECT (writes to out)
tensor_mul_op_scalar_float32((float*)t->data, (float*)t->a->data, t->a->nvalues, tensor_float32_value(t->b));

But in check.txt (older snapshot), the scalar multiply mutates the input tensor in-place:

// check.txt — BUG (mutates input)
tensor_mul_op_scalar_float32((float*)t->a->data, t->a->nvalues, tensor_float32_value(t->b));

[!IMPORTANT] The current code has been fixed, but this discrepancy between check.txt and the actual code means check.txt is stale and misleading. Either delete it or regenerate it.

2. tensor_mul_op_scalar ignores all non-float32 dtypes

case tensor_dtype_t::INT32_T:
case tensor_dtype_t::UINT32_T:
case tensor_dtype_t::INT64_T:
case tensor_dtype_t::UINT64_T:
case tensor_dtype_t::FLOAT32_T:
      return tensor_mul_op_scalar_float32(...);  // ALL fall through to float32!

Every dtype gets cast to float* and processed as float32 — this will silently produce garbage results for int32, int64, uint32, uint64 tensors.

3. Stride array is never computed

tensor_instance has a stride[TENSOR_MAX_DIMS] field, but it's never initialized in tensor_dtype_create(). There's a // TODO: Implement the stride logic comment. Without strides, you cannot support: - Non-contiguous views - Transpose (which is just a stride/shape swap) - Broadcasting

4. `num_dims` parameter in tensor_create() is ignored

tensor_t *tensor_create(tensor_pool_t *pool, tensor_dtype_t dtype, uint32_t num_dims, uint32_t *dims, float *elems) {
    return tensor_dtype_create(pool, dtype, dims, elems);
    // TODO: Handle num_dims to proccede with stride logic
}

The num_dims parameter is accepted but never used. The actual ndims is inferred from the zero-terminated dims array, making num_dims redundant and confusing.

5. Duplicate declarations in headers

In tensor.h (core):

size_t tensor_dtype_sizeof(tensor_dtype_t dtype);  // line 55
size_t tensor_dtype_sizeof(tensor_dtype_t dtype);  // line 58 — DUPLICATE

tensor_t *tensor_dtype_create(...);  // line 47
tensor_t *tensor_dtype_create(...);  // line 60 — DUPLICATE

bool tensor_evaluate(...);  // line 52
bool tensor_evaluate(...);  // line 62 — DUPLICATE

6. `tensor_graph_t` declared but never defined or used

typedef struct tensor_graph_instance tensor_graph_t;  // public header

No implementation exists anywhere. This is a dangling forward declaration.

7. Missing `#include <cstdint>` in private tensor header

The private tensor.h uses uint8_t, uint32_t etc. but relies on the public header having included <cstdint> first. It works due to internal_header.h include order, but is fragile.

Design Issues & Recommendations 🏗️

1. No op enum entries for most declared operations

tensor_op_t only has NONE, CAST, MUL_SCALAR, MUL_MATRIX. For the lazy evaluation graph to work, you need entries for every operation:

enum class tensor_op_t {
    NONE,
    CAST,
    MUL_SCALAR,
    MUL_MATRIX,
    ADD,           // ← missing
    TRANSPOSE,     // ← missing
    RELU,          // ← missing
    SCALAR_MUL,    // ← missing (public API version)
    ADD_BIAS,      // ← missing
    // ... etc
};

2. Autograd design is incomplete

The tensor_instance struct has grad_compute and grad fields, and the public API declares tensor_backward(), but: - No gradient functions are implemented for any op - No computation graph traversal (topological sort) exists - The a and b pointers form a tree, not a DAG — if a tensor is reused in two operations, the graph breaks - No mechanism to mark tensors as requiring gradients (leaf vs intermediate)

Recommendation: Before implementing backward, redesign the computation graph: - Use a proper DAG with topological sort - Add requires_grad flag - Implement gradient functions alongside each forward op

3. Memory model limitations

The bump allocator is great for forward-only inference, but: - No individual tensor deallocation — you can only reset the entire pool - Training requires multiple pools — one for weights (persistent), one for activations (reset per iteration) - The tensor_sgd_template signature takes a pool, suggesting you're aware of this, but there's no documentation of the multi-pool strategy

4. API inconsistency between tensor_mul and other ops

tensor_mul() returns a new tensor (lazy)
tensor_matmul(), tensor_add(), tensor_relu() all take an out parameter (eager-style)
tensor_mse_loss(), tensor_cross_entropy_loss() return a new tensor (lazy-style)

Pick one pattern. The lazy approach (return new tensor) is better for autograd since it naturally builds the computation graph. The out-parameter approach requires the caller to pre-allocate and breaks the graph.

5. No test framework

The only "test" is main.cpp which hard-codes a single scalar multiply test. You should add: - A proper test framework (e.g., Google Test, or even simple assert-based test functions) - Tests for each implemented operation - Shape validation tests - Pool exhaustion tests

6. CMake doesn't include GPU backend path

target_include_directories(soft
  PUBLIC  "${CMAKE_CURRENT_SOURCE_DIR}/include"
  PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}/src/backend_cpu/include"
  PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}/src/core/include"
  PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}/src"
)
# No backend_gpu include path

Summary Scorecard

Area	Score	Notes
Architecture	7/10	Clean separation (public/core/backend), opaque types, good layering
Memory management	8/10	Bump allocator is well-implemented and appropriate
API design	5/10	Inconsistent patterns (return vs out-param), redundant `num_dims`
Implementation completeness	2/10	Only scalar multiply works, 90% is stubs
Code quality	6/10	Readable but has duplicate declarations, stale check.txt, missing strides
Autograd readiness	2/10	Fields exist but no graph traversal or gradient functions
Testing	1/10	Single hard-coded test in main

Recommended Next Steps (Priority Order)

Implement strides — everything downstream depends on this
Implement tensor_add and tensor_mul_op_matrix (element-wise) — the foundational ops
Implement tensor_matmul — needed for any neural network
Fix the dtype fallthrough bug in tensor_mul_op_scalar
Standardize the API to one pattern (lazy return is recommended)
Add tensor_op_t entries for all operations
Build a proper test suite
Design the autograd graph before implementing backward
Delete or regenerate check.txt

This documentation is generated using generative AI claude opus 4.6 on 14/03/2026 12:39 AM