Ren internals overview
Ren is a lazy polynomial execution engine. User code creates Poly objects, but most operations do not run immediately. They build an IR graph of Node objects. Work starts when code asks for concrete data, usually by calling realize(), tolist(), tobytes(), or by passing values through ren.jit.
The short version:
Poly API
-> Node DAG
-> rewrite and scheduling passes
-> materialized buffers and scheduled kernels
-> memory planning
-> backend programs
-> execution or JIT replay
This is close in spirit to tinygrad: keep a compact IR, let user-facing objects stay lazy, and force the compiler pipeline only at realization time. Ren's difference is that the core object is a polynomial over an RNS ring, so the compiler has to track ring metadata, coefficient vs NTT domain, RNS basis changes, rescale semantics, and backend-specific polynomial kernels.
Ren's core polynomial object lives in a quotient ring such as \(R_q = \mathbb{Z}_q[X] / (X^N + 1)\). In CKKS-like paths, data is usually carried in an RNS basis \(q = \prod_i q_i\), and many fast products happen after moving from coefficient form into NTT form.
Recommended reading order
Start here for the compiler/runtime shape. Then read CKKS for the user-facing encrypted-number API, and JIT for capture and replay. The rest of this page is the internal pipeline in the order Ren executes it: IR creation, realization, scheduling, memory planning, backend lowering, and JIT.
As the docs grow, this page should split into smaller internals pages:
- Concepts: polynomial IR, rings, domains, and CKKS data types
- Pipeline: creation, scheduling, realization, memory planning, and backend lowering
- Guides: CKKS examples, JIT examples, and common failure modes
- Reference: API and source-link pages
flowchart TB
A["Poly API"]
B["Node DAG"]
C["Compile at realization"]
D["Plan memory"]
E["Lower to backend"]
F["Execute now"]
G["JIT replay"]
A --> B --> C --> D --> E
E --> F
E --> G
The IR
The core IR type is Node. A node has:
- an
Op, such asADD,MUL,NTT,RESCALE,BUFFER, orSTORE - a
dtype - zero or more source nodes in
src - optional operation metadata in
arg
Nodes form a DAG, not a tree. If the same polynomial expression is reused, downstream nodes can point at the same producer node. Ren also caches structurally identical nodes, so repeated construction of the same IR shape can share identity.
Ring semantics live on metadata nodes. AS_RING attaches a RingSpec and a domain to storage or computation:
Domain.COEFFmeans coefficient representation.Domain.NTTmeans NTT representation.RingSpeccarries the degree, modulus basis, and coefficient dtype.
That separation matters. A buffer can hold limbs for a ring, while the graph still decides whether an operation needs the coefficient domain, the NTT domain, or a view into a smaller modulus basis.
Polynomial creation
Poly is the front-end wrapper around a Node.
Creating a polynomial from coefficients encodes the data into Montgomery RNS limbs, allocates a Python-side buffer node, and tags it as coefficient-domain ring data:
coefficients
-> encoded RNS limb array
-> BUFFER(device="python")
-> AS_RING(..., domain=COEFF)
-> Poly
Random constructors like uniform_on_interval(), ternary(), and sparse_ternary() also produce RNS-backed coefficient buffers. Constant constructors are different: Poly.const() creates a symbolic CONST node and attaches ring metadata. Constants stay allocation-free until a later pass needs to lower them into coefficient or NTT form.
Operations on Poly generally create more nodes:
a + b,a - b, anda * bcreate ALU nodes.a.ntt()anda.intt()create domain conversion nodes.a.to_ring(ring)creates aTO_RINGnode, which may become either a real conversion or a buffer view.a.rescale(moduli)creates aRESCALEnode.a.rotate(k)creates a high-levelROTATEnode that is lowered later.a.to(device)creates aCOPYnode.
No backend kernel is launched during these operations.
Realization
Realization is the point where lazy polynomial IR becomes buffers with concrete contents.
When Poly.realize() runs, Ren gathers the requested output nodes into a single SINK:
The single sink gives the scheduler one root that covers every output requested together. This is important for shared work: if two output polynomials use the same intermediate, scheduling can see that shared dependency in one graph.
The realization path runs through Poly.realize(), Poly._schedule_and_lower(), schedule(), plan_memory(), make_exec_units(), and run_exec_units():
Poly.realize()
-> Poly._schedule_and_lower()
-> schedule(SINK(...))
-> plan_memory(...)
-> make_exec_units(...)
-> run_exec_units(...)
After scheduling, Ren records a becomes_map from old lazy outputs to their new buffer-backed outputs. Existing Poly objects whose graphs overlap the realized computation are updated to point at their realized BUFFER or CONST nodes.
Scheduling
Scheduling is the main compiler pass. It turns a high-level polynomial DAG into a list of ScheduledKernel objects.
The pass sequence is roughly:
1. Simplify algebra, ring rules, constants, and copies
2. Lower high-level ring ops such as ROTATE
3. Inject NTT or INTT nodes for domain requirements
4. Simplify domain conversions
5. Decide materialization boundaries
6. Lower BUFFERIZE markers to BUFFER + STORE + AFTER
7. Build the becomes_map for realized outputs
8. Lower automorphisms and ring-specific reduction rules
9. Lower ring constants into explicit coefficient or NTT expressions
10. Turn quotient views into BUFFER_VIEW where possible
11. Indexify elementwise computation with GIDX, INDEX, LOAD, and STORE
12. Kernelize top-level stores into scheduled kernel roots
A few parts are worth calling out.
Domain injection
The scheduler does not require callers to manually put every polynomial in the right representation. It inserts conversions based on operation requirements:
MULandROTATErequire NTT inputs.RESCALErequires coefficient-domain input.- some
TO_RINGconversions require coefficient-domain input. - mixed-domain inputs are pushed toward NTT when possible.
After insertion, domain simplification removes round trips such as NTT(INTT(x)).
Materialization
Not every expression can remain fused. Ren marks materialization boundaries for outputs, conversion operations, copies, contiguous requests, and stores. A boundary is represented first as BUFFERIZE(x), then lowered into:
AFTER keeps the dependency ordering explicit without changing the value that flows through the graph.
Views
Some TO_RING operations do not need a new computation. If the destination moduli are a contiguous subset of the source basis, the conversion can become a BUFFER_VIEW with an element offset. If a view feeds an elementwise compute chain, the scheduler can also fuse the view into indexed loads instead of creating a standalone view kernel.
Kernelization
After indexing, top-level STORE nodes become KERNEL nodes. Concrete buffers are replaced by abstract SLOT parameters in the kernel body. The scheduled item keeps:
- a lowered kernel AST, usually
SINK(STORE(...)) - the buffers passed in slot order
KernelInfo, including device, ring, and domain metadata
Copy and buffer-view items are represented as lightweight scheduled items instead of normal compute kernels.
Memory planning
Before lowering scheduled kernels to executable units, Ren plans intermediate memory.
The memory planner computes first and last use for buffers in the scheduled kernel list. For optimizable intermediate buffers, it simulates a TLSF allocator and packs them into one arena per device. The output is a remapped schedule where many short-lived buffers become views into shared backing storage.
Buffers are excluded from this optimization when identity matters, such as final outputs, already allocated buffers, copy buffers, and explicit no-opt buffers.
The key idea is that allocation decisions are made once from the schedule. Runtime execution can then reuse offsets into an arena instead of allocating and freeing each temporary independently.
Backend lowering and execution
make_exec_units() lowers each scheduled item to an ExecUnit.
Compute kernels go through the selected backend:
- the Python backend interprets the lowered kernel AST with NumPy and Montgomery helpers
- the CUDA backend renders kernel code, compiles it, builds launch specs, and runs it on CUDA queues
Copies lower to either device transfer or host-mediated copy depending on allocator support. Buffer views lower to validation-only exec units that bind a view to its base.
Execution is deliberately simple. The core loop below is from run_exec_units():
while units:
unit = units.pop(0)
if recorder is not None:
recorder.add_exec(unit)
for buf in unit.bufs:
unwrap(buf).ensure_allocated()
with trace(f"exec_unit[{idx}]", device=unit.device):
last = unit()
idx += 1
The list is consumed as it runs so Python refcounts can release intermediate buffers after their final use.
JIT
ren.jit wraps a Python function whose arguments and outputs contain Poly objects. The Poly values can be direct arguments, or they can be nested inside ordinary containers and dataclasses such as CKKS Plaintext and Ciphertext objects. It is a replay cache for a realized polynomial program.
On every call, JIT first walks the argument structure and collects the contained polynomial inputs. Inputs are realized before capture so the replay boundary is concrete buffers, not arbitrary lazy input graphs.
The cache key is a tuple of argument specs:
- realized polynomial ring, size, dtype, device, and domain
- or constant polynomial ring, value, device, and domain
If the key is new, Ren captures:
call user function
collect output Polys
run Poly.realize() while recording ExecUnits
optionally prune input-independent work
replan captured intermediate memory
build input replacement tables
try to create a backend graph runner
store the captured plan
If the key is already known, Ren replays the captured plan with new input buffers. Replay swaps the current inputs into the recorded exec units, runs either the backend graph runner or the exec units directly, then clears bound inputs from the captured plan so stale buffers are not retained by accident.
JIT sharp edges
People usually run into JIT issues at the capture boundary. A jitted function must receive at least one argument structure containing a Poly, must return a structure containing at least one Poly, and cannot trigger a nested JIT capture. The cache key is also strict: changed constant values or changed realized argument specs create a new captured plan, and a single JIT object stores only a fixed number of plans.
See the dedicated JIT guide for examples, dos and don'ts, and common failure modes.
The practical model is: eager lazy execution builds and schedules the graph each time, while JIT pays that cost once per argument shape/spec and then reuses the captured execution plan.
Reading the pipeline
The most useful files to read first are:
ren/poly.pyfor the user-facing polynomial API and realization entry pointren/ir/op.pyfor the operation setren/ir/node.pyfor graph structure, metadata, constants, and formattingren/engine/schedule.pyfor the scheduling pass orderren/engine/rules/for the individual rewrite passesren/engine/memory.pyfor arena planningren/engine/realize.pyfor lowering scheduled items into executable unitsren/engine/jit.pyfor capture and replayren/backends/python.pyandren/backends/cuda/for backend execution
For a quick feel of the graph, build a small expression and print the node:
from ren.poly import Poly
from ren.ring import RingSpec
ring = RingSpec(n=8, moduli=(17, 97))
a = Poly([1, 2, 3, 4, 5, 6, 7, 8], ring)
b = Poly([8, 7, 6, 5, 4, 3, 2, 1], ring)
expr = (a + b) * (a - 1)
print(expr.node)
expr.realize()
Before realize(), the trace shows the lazy DAG. During realization, that DAG is rewritten, materialized, split into scheduled kernels, memory-planned, lowered, and executed.